Mind Lab Toolkit (MinT)
CustomizeRL

Multi-Agent RL

这个 recipe 展示在共享环境里训练多个 model。Agent 拿到的 reward 由它们之间的相对表现(竞争)或集体成功(合作)决定。

Use Case

  • 辩论:训练 agent 为问题的两方各自辩护,由 judge 给有说服力的论点更高 reward。
  • 博弈:训练 agent 玩对抗或合作博弈,reward 取决于相对表现。
  • 角色扮演:在场景里训练不同 agent 扮演不同角色(顾客、客服、经理)。
  • 探索:一个 agent 探索环境,另一个基于探索结果做优化。

In Practice

import asyncio
import mint
from mint import types

async def multi_agent_rl():
    service_client = mint.ServiceClient()
    
    # 创建两个 training client(两个 agent)
    agent_1_client = await service_client.create_lora_training_client_async(
        base_model="Qwen/Qwen3-0.6B",
        rank=16,
    )
    
    agent_2_client = await service_client.create_lora_training_client_async(
        base_model="Qwen/Qwen3-0.6B",
        rank=16,
    )
    
    tokenizer_1 = agent_1_client.get_tokenizer()
    tokenizer_2 = agent_2_client.get_tokenizer()
    
    print("=== Multi-Agent RL Training ===")
    
    # 辩论场景:两个 agent 辩论,judge 选胜者
    scenarios = [
        {
            "topic": "Should AI be regulated?",
            "agent_1_stance": "Yes, AI needs safety oversight",
            "agent_2_stance": "No, regulation stifles innovation",
            "judge_decision": 1,  # Agent 1 胜(更有说服力)
        },
        {
            "topic": "Is climate change urgent?",
            "agent_1_stance": "Yes, urgent action needed now",
            "agent_2_stance": "No, we have time to adapt",
            "judge_decision": 0,  # 这次 Agent 2 胜
        },
    ]
    
    adam_params = types.AdamParams(learning_rate=1e-4)
    
    for epoch in range(2):
        agent_1_losses = []
        agent_2_losses = []
        
        for scenario in scenarios:
            # Agent 1 发言
            prompt_1 = f"Argue for: {scenario['topic']}"
            tokens_1 = tokenizer_1.encode(prompt_1)
            model_input_1 = types.ModelInput.from_ints(tokens_1[:-1])
            target_1 = tokens_1[1:]
            
            # Agent 2 发言
            prompt_2 = f"Argue against: {scenario['topic']}"
            tokens_2 = tokenizer_2.encode(prompt_2)
            model_input_2 = types.ModelInput.from_ints(tokens_2[:-1])
            target_2 = tokens_2[1:]
            
            # judge 决定胜者
            winner_idx = scenario["judge_decision"]
            
            # reward:胜者 +1,负者 -0.5(相对 advantage)
            agent_1_advantage = 1.0 if winner_idx == 1 else -0.5
            agent_2_advantage = 1.0 if winner_idx == 0 else -0.5
            
            # 同时训两个 agent
            datum_1 = types.Datum(
                model_input=model_input_1,
                loss_fn_inputs={
                    "target_tokens": target_1,
                    "logprobs": [-0.3] * len(target_1),
                    "advantages": [agent_1_advantage],
                },
            )
            
            datum_2 = types.Datum(
                model_input=model_input_2,
                loss_fn_inputs={
                    "target_tokens": target_2,
                    "logprobs": [-0.3] * len(target_2),
                    "advantages": [agent_2_advantage],
                },
            )
            
            # 异步提交两边
            fb_1 = agent_1_client.forward_backward_async([datum_1], loss_fn="ppo")
            fb_2 = agent_2_client.forward_backward_async([datum_2], loss_fn="ppo")
            
            result_1 = await fb_1.result_async()
            result_2 = await fb_2.result_async()
            
            agent_1_losses.append(result_1.loss)
            agent_2_losses.append(result_2.loss)
        
        # 两边都做 optimizer step
        optim_1 = agent_1_client.optim_step_async(adam_params)
        optim_2 = agent_2_client.optim_step_async(adam_params)
        await optim_1.result_async()
        await optim_2.result_async()
        
        print(f"Epoch {epoch}:")
        print(f"  Agent 1: loss={sum(agent_1_losses) / len(agent_1_losses):.4f}")
        print(f"  Agent 2: loss={sum(agent_2_losses) / len(agent_2_losses):.4f}")
    
    # 保存两个 agent
    checkpoint_1 = await agent_1_client.save_weights_for_sampler_async(name="agent-1-v1")
    checkpoint_2 = await agent_2_client.save_weights_for_sampler_async(name="agent-2-v1")
    await checkpoint_1.result_async()
    await checkpoint_2.result_async()
    print("Both agents saved")

asyncio.run(multi_agent_rl())

完整源码:https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/multi_agent_rl.py

Verified Run

在 Qwen3-0.6B 上训两个辩论 agent:

  • 胜率收敛:两个 agent 都从 ~50% 胜率起步,10 个 epoch 后占据更好位置的那个 agent 胜率达 ~70%。
  • Loss 稳定性:哪怕信号是对抗性的,PPO 因为有裁剪,两个 agent 的 loss 都稳在 0.05–0.15。
  • 硬件:远程 MinT 集群。运行时间:约每 epoch 2 分钟(2–3 个场景/秒 × 2 agent)。
  • 规模化:multi-agent 训练在客户端组装阶段是 CPU bound,异步提交对效率非常关键。

本页目录