Mind Lab Toolkit (MinT)
CustomizeRL

Multi-Agent RL

这个 recipe 展示一个在 MinT 上真实可跑的最小 multi-agent training pattern。它从同一个 base model 创建两个独立的 LoRA TrainingClient,让 Agent A 和 Agent B 顺序采样,给两个 response 打分,然后用低层 TrainingClient APIs 训练两个 agents。

这个页面和 recipes/multi_agent_rl.py 对齐。

Use Case

  • Two-agent interactions:一个 model 先回答,另一个 model 基于第一个 response 再回答。
  • 独立 LoRA weights:每个 agent role 保持独立 adapter。
  • Manual rollout control:当 recipe.rl.train.main() 对交互太高层时,用低层 TrainingClient APIs。
  • Fallback-safe demos:如果 concurrent clients 失败,就用单 client role-switching fallback。

主形状

ServiceClient
  ├─ Agent A TrainingClient ──▶ sampler A ──▶ response A ──▶ train A
  └─ Agent B TrainingClient ──▶ sampler B ──▶ response B ──▶ train B

recipe 先尝试 concurrent two-client path:

agent_a = service_client.create_lora_training_client(
    base_model=MODEL,
    rank=RANK,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

agent_b = service_client.create_lora_training_client(
    base_model=MODEL,
    rank=RANK,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

如果失败,就 fallback 到一个 LoRA client,并交替使用 role prompts。

Interaction Loop

每一步使用一个很小的 arithmetic task:

sampler_a = agent_a.save_weights_and_get_sampling_client(name=f"agent-a-step-{step}")
sampler_b = agent_b.save_weights_and_get_sampling_client(name=f"agent-b-step-{step}")

prompt_a = f"Agent A: answer the math question with only the number. {question}\nAnswer:"
response_a = sample_text(sampler_a, tokenizer_a, prompt_a)
reward_a = evaluate_response(response_a, answer)

prompt_b = (
    f"Agent B: Agent A answered '{response_a}'. "
    f"Now answer the same question with only the number. {question}\nAnswer:"
)
response_b = sample_text(sampler_b, tokenizer_b, prompt_b)
reward_b = evaluate_response(response_b, answer)

然后用已知正确答案训练两个 agents:

train_agent(agent_a, tokenizer_a, prompt_a, answer, "Agent A")
train_agent(agent_b, tokenizer_b, prompt_b, answer, "Agent B")

训练调用是低层、真实的:

fb = training_client.forward_backward([datum], loss_fn="cross_entropy").result()
training_client.optim_step(types.AdamParams(learning_rate=LR)).result()

这个 recipe 在 interaction 后使用低层 SFT-style training。它是最小 multi-agent wiring recipe,不是完整 game-theoretic RL algorithm。

完整源码:https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/multi_agent_rl.py

Verified Run

已在 MinT 上验证两个 concurrent LoRA clients:

Mode: concurrent two-client
Agent A response: 5 ... | reward=1.0
Agent B response: 5 ... | reward=1.0
Agent A train_cross_entropy=7.044352
Agent B train_cross_entropy=6.422852

最终 checkpoints:

Agent A: tinker://46839c2e-2577-4133-9011-e626293cbaa2_0/sampler_weights/multi-agent-a-final
Agent B: tinker://46839c2e-2577-4133-9011-e626293cbaa2_1/sampler_weights/multi-agent-b-final

Fallback Path

如果 server 不能同时创建或训练两个 LoRA sessions,recipe 会打印 warning 并运行:

single TrainingClient
  ├─ role prompt: Agent A
  └─ role prompt: Agent B

这样在只允许一个 active LoRA session 的部署上,脚本仍然有用。

本页目录