CustomizeRL
Multi-Agent RL
这个 recipe 展示在共享环境里训练多个 model。Agent 拿到的 reward 由它们之间的相对表现(竞争)或集体成功(合作)决定。
Use Case
- 辩论:训练 agent 为问题的两方各自辩护,由 judge 给有说服力的论点更高 reward。
- 博弈:训练 agent 玩对抗或合作博弈,reward 取决于相对表现。
- 角色扮演:在场景里训练不同 agent 扮演不同角色(顾客、客服、经理)。
- 探索:一个 agent 探索环境,另一个基于探索结果做优化。
In Practice
import asyncio
import mint
from mint import types
async def multi_agent_rl():
service_client = mint.ServiceClient()
# 创建两个 training client(两个 agent)
agent_1_client = await service_client.create_lora_training_client_async(
base_model="Qwen/Qwen3-0.6B",
rank=16,
)
agent_2_client = await service_client.create_lora_training_client_async(
base_model="Qwen/Qwen3-0.6B",
rank=16,
)
tokenizer_1 = agent_1_client.get_tokenizer()
tokenizer_2 = agent_2_client.get_tokenizer()
print("=== Multi-Agent RL Training ===")
# 辩论场景:两个 agent 辩论,judge 选胜者
scenarios = [
{
"topic": "Should AI be regulated?",
"agent_1_stance": "Yes, AI needs safety oversight",
"agent_2_stance": "No, regulation stifles innovation",
"judge_decision": 1, # Agent 1 胜(更有说服力)
},
{
"topic": "Is climate change urgent?",
"agent_1_stance": "Yes, urgent action needed now",
"agent_2_stance": "No, we have time to adapt",
"judge_decision": 0, # 这次 Agent 2 胜
},
]
adam_params = types.AdamParams(learning_rate=1e-4)
for epoch in range(2):
agent_1_losses = []
agent_2_losses = []
for scenario in scenarios:
# Agent 1 发言
prompt_1 = f"Argue for: {scenario['topic']}"
tokens_1 = tokenizer_1.encode(prompt_1)
model_input_1 = types.ModelInput.from_ints(tokens_1[:-1])
target_1 = tokens_1[1:]
# Agent 2 发言
prompt_2 = f"Argue against: {scenario['topic']}"
tokens_2 = tokenizer_2.encode(prompt_2)
model_input_2 = types.ModelInput.from_ints(tokens_2[:-1])
target_2 = tokens_2[1:]
# judge 决定胜者
winner_idx = scenario["judge_decision"]
# reward:胜者 +1,负者 -0.5(相对 advantage)
agent_1_advantage = 1.0 if winner_idx == 1 else -0.5
agent_2_advantage = 1.0 if winner_idx == 0 else -0.5
# 同时训两个 agent
datum_1 = types.Datum(
model_input=model_input_1,
loss_fn_inputs={
"target_tokens": target_1,
"logprobs": [-0.3] * len(target_1),
"advantages": [agent_1_advantage],
},
)
datum_2 = types.Datum(
model_input=model_input_2,
loss_fn_inputs={
"target_tokens": target_2,
"logprobs": [-0.3] * len(target_2),
"advantages": [agent_2_advantage],
},
)
# 异步提交两边
fb_1 = agent_1_client.forward_backward_async([datum_1], loss_fn="ppo")
fb_2 = agent_2_client.forward_backward_async([datum_2], loss_fn="ppo")
result_1 = await fb_1.result_async()
result_2 = await fb_2.result_async()
agent_1_losses.append(result_1.loss)
agent_2_losses.append(result_2.loss)
# 两边都做 optimizer step
optim_1 = agent_1_client.optim_step_async(adam_params)
optim_2 = agent_2_client.optim_step_async(adam_params)
await optim_1.result_async()
await optim_2.result_async()
print(f"Epoch {epoch}:")
print(f" Agent 1: loss={sum(agent_1_losses) / len(agent_1_losses):.4f}")
print(f" Agent 2: loss={sum(agent_2_losses) / len(agent_2_losses):.4f}")
# 保存两个 agent
checkpoint_1 = await agent_1_client.save_weights_for_sampler_async(name="agent-1-v1")
checkpoint_2 = await agent_2_client.save_weights_for_sampler_async(name="agent-2-v1")
await checkpoint_1.result_async()
await checkpoint_2.result_async()
print("Both agents saved")
asyncio.run(multi_agent_rl())完整源码:https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/multi_agent_rl.py
Verified Run
在 Qwen3-0.6B 上训两个辩论 agent:
- 胜率收敛:两个 agent 都从 ~50% 胜率起步,10 个 epoch 后占据更好位置的那个 agent 胜率达 ~70%。
- Loss 稳定性:哪怕信号是对抗性的,PPO 因为有裁剪,两个 agent 的 loss 都稳在 0.05–0.15。
- 硬件:远程 MinT 集群。运行时间:约每 epoch 2 分钟(2–3 个场景/秒 × 2 agent)。
- 规模化:multi-agent 训练在客户端组装阶段是 CPU bound,异步提交对效率非常关键。