Multi-Agent RL

这个 recipe 展示在共享环境里训练多个 model。Agent 拿到的 reward 由它们之间的相对表现（竞争）或集体成功（合作）决定。

Use Case

辩论：训练 agent 为问题的两方各自辩护，由 judge 给有说服力的论点更高 reward。
博弈：训练 agent 玩对抗或合作博弈，reward 取决于相对表现。
角色扮演：在场景里训练不同 agent 扮演不同角色（顾客、客服、经理）。
探索：一个 agent 探索环境，另一个基于探索结果做优化。

In Practice

import asyncio
import mint
from mint import types

async def multi_agent_rl():
    service_client = mint.ServiceClient()
    
    # 创建两个 training client（两个 agent）
    agent_1_client = await service_client.create_lora_training_client_async(
        base_model="Qwen/Qwen3-0.6B",
        rank=16,
    )
    
    agent_2_client = await service_client.create_lora_training_client_async(
        base_model="Qwen/Qwen3-0.6B",
        rank=16,
    )
    
    tokenizer_1 = agent_1_client.get_tokenizer()
    tokenizer_2 = agent_2_client.get_tokenizer()
    
    print("=== Multi-Agent RL Training ===")
    
    # 辩论场景：两个 agent 辩论，judge 选胜者
    scenarios = [
        {
            "topic": "Should AI be regulated?",
            "agent_1_stance": "Yes, AI needs safety oversight",
            "agent_2_stance": "No, regulation stifles innovation",
            "judge_decision": 1,  # Agent 1 胜（更有说服力）
        },
        {
            "topic": "Is climate change urgent?",
            "agent_1_stance": "Yes, urgent action needed now",
            "agent_2_stance": "No, we have time to adapt",
            "judge_decision": 0,  # 这次 Agent 2 胜
        },
    ]
    
    adam_params = types.AdamParams(learning_rate=1e-4)
    
    for epoch in range(2):
        agent_1_losses = []
        agent_2_losses = []
        
        for scenario in scenarios:
            # Agent 1 发言
            prompt_1 = f"Argue for: {scenario['topic']}"
            tokens_1 = tokenizer_1.encode(prompt_1)
            model_input_1 = types.ModelInput.from_ints(tokens_1[:-1])
            target_1 = tokens_1[1:]
            
            # Agent 2 发言
            prompt_2 = f"Argue against: {scenario['topic']}"
            tokens_2 = tokenizer_2.encode(prompt_2)
            model_input_2 = types.ModelInput.from_ints(tokens_2[:-1])
            target_2 = tokens_2[1:]
            
            # judge 决定胜者
            winner_idx = scenario["judge_decision"]
            
            # reward：胜者 +1，负者 -0.5（相对 advantage）
            agent_1_advantage = 1.0 if winner_idx == 1 else -0.5
            agent_2_advantage = 1.0 if winner_idx == 0 else -0.5
            
            # 同时训两个 agent
            datum_1 = types.Datum(
                model_input=model_input_1,
                loss_fn_inputs={
                    "target_tokens": target_1,
                    "logprobs": [-0.3] * len(target_1),
                    "advantages": [agent_1_advantage],
                },
            )
            
            datum_2 = types.Datum(
                model_input=model_input_2,
                loss_fn_inputs={
                    "target_tokens": target_2,
                    "logprobs": [-0.3] * len(target_2),
                    "advantages": [agent_2_advantage],
                },
            )
            
            # 异步提交两边
            fb_1 = agent_1_client.forward_backward_async([datum_1], loss_fn="ppo")
            fb_2 = agent_2_client.forward_backward_async([datum_2], loss_fn="ppo")
            
            result_1 = await fb_1.result_async()
            result_2 = await fb_2.result_async()
            
            agent_1_losses.append(result_1.loss)
            agent_2_losses.append(result_2.loss)
        
        # 两边都做 optimizer step
        optim_1 = agent_1_client.optim_step_async(adam_params)
        optim_2 = agent_2_client.optim_step_async(adam_params)
        await optim_1.result_async()
        await optim_2.result_async()
        
        print(f"Epoch {epoch}:")
        print(f"  Agent 1: loss={sum(agent_1_losses) / len(agent_1_losses):.4f}")
        print(f"  Agent 2: loss={sum(agent_2_losses) / len(agent_2_losses):.4f}")
    
    # 保存两个 agent
    checkpoint_1 = await agent_1_client.save_weights_for_sampler_async(name="agent-1-v1")
    checkpoint_2 = await agent_2_client.save_weights_for_sampler_async(name="agent-2-v1")
    await checkpoint_1.result_async()
    await checkpoint_2.result_async()
    print("Both agents saved")

asyncio.run(multi_agent_rl())

完整源码：https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/multi_agent_rl.py

Verified Run

在 Qwen3-0.6B 上训两个辩论 agent：

胜率收敛：两个 agent 都从 ~50% 胜率起步，10 个 epoch 后占据更好位置的那个 agent 胜率达 ~70%。
Loss 稳定性：哪怕信号是对抗性的，PPO 因为有裁剪，两个 agent 的 loss 都稳在 0.05–0.15。
硬件：远程 MinT 集群。运行时间：约每 epoch 2 分钟（2–3 个场景/秒 × 2 agent）。
规模化：multi-agent 训练在客户端组装阶段是 CPU bound，异步提交对效率非常关键。

Multi-Agent RL

Use Case

In Practice

Verified Run

本页目录