CustomizeRL
Multi-Agent RL
这个 recipe 展示一个在 MinT 上真实可跑的最小 multi-agent training pattern。它从同一个 base model 创建两个独立的 LoRA TrainingClient,让 Agent A 和 Agent B 顺序采样,给两个 response 打分,然后用低层 TrainingClient APIs 训练两个 agents。
这个页面和 recipes/multi_agent_rl.py 对齐。
Use Case
- Two-agent interactions:一个 model 先回答,另一个 model 基于第一个 response 再回答。
- 独立 LoRA weights:每个 agent role 保持独立 adapter。
- Manual rollout control:当
recipe.rl.train.main()对交互太高层时,用低层TrainingClientAPIs。 - Fallback-safe demos:如果 concurrent clients 失败,就用单 client role-switching fallback。
主形状
ServiceClient
├─ Agent A TrainingClient ──▶ sampler A ──▶ response A ──▶ train A
└─ Agent B TrainingClient ──▶ sampler B ──▶ response B ──▶ train Brecipe 先尝试 concurrent two-client path:
agent_a = service_client.create_lora_training_client(
base_model=MODEL,
rank=RANK,
train_mlp=True,
train_attn=True,
train_unembed=True,
)
agent_b = service_client.create_lora_training_client(
base_model=MODEL,
rank=RANK,
train_mlp=True,
train_attn=True,
train_unembed=True,
)如果失败,就 fallback 到一个 LoRA client,并交替使用 role prompts。
Interaction Loop
每一步使用一个很小的 arithmetic task:
sampler_a = agent_a.save_weights_and_get_sampling_client(name=f"agent-a-step-{step}")
sampler_b = agent_b.save_weights_and_get_sampling_client(name=f"agent-b-step-{step}")
prompt_a = f"Agent A: answer the math question with only the number. {question}\nAnswer:"
response_a = sample_text(sampler_a, tokenizer_a, prompt_a)
reward_a = evaluate_response(response_a, answer)
prompt_b = (
f"Agent B: Agent A answered '{response_a}'. "
f"Now answer the same question with only the number. {question}\nAnswer:"
)
response_b = sample_text(sampler_b, tokenizer_b, prompt_b)
reward_b = evaluate_response(response_b, answer)然后用已知正确答案训练两个 agents:
train_agent(agent_a, tokenizer_a, prompt_a, answer, "Agent A")
train_agent(agent_b, tokenizer_b, prompt_b, answer, "Agent B")训练调用是低层、真实的:
fb = training_client.forward_backward([datum], loss_fn="cross_entropy").result()
training_client.optim_step(types.AdamParams(learning_rate=LR)).result()这个 recipe 在 interaction 后使用低层 SFT-style training。它是最小 multi-agent wiring recipe,不是完整 game-theoretic RL algorithm。
完整源码:https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/multi_agent_rl.py
Verified Run
已在 MinT 上验证两个 concurrent LoRA clients:
Mode: concurrent two-client
Agent A response: 5 ... | reward=1.0
Agent B response: 5 ... | reward=1.0
Agent A train_cross_entropy=7.044352
Agent B train_cross_entropy=6.422852最终 checkpoints:
Agent A: tinker://46839c2e-2577-4133-9011-e626293cbaa2_0/sampler_weights/multi-agent-a-final
Agent B: tinker://46839c2e-2577-4133-9011-e626293cbaa2_1/sampler_weights/multi-agent-b-finalFallback Path
如果 server 不能同时创建或训练两个 LoRA sessions,recipe 会打印 warning 并运行:
single TrainingClient
├─ role prompt: Agent A
└─ role prompt: Agent B这样在只允许一个 active LoRA session 的部署上,脚本仍然有用。