Mind Lab Toolkit (MinT)
CustomizeRL

Multi-Agent RL

This recipe demonstrates a minimal real multi-agent training pattern on MinT. It creates two independent LoRA TrainingClients from the same base model, lets Agent A and Agent B sample in sequence, scores both responses, and trains both agents with low-level TrainingClient APIs.

This page matches recipes/multi_agent_rl.py.

Use Case

  • Two-agent interactions: Let one model respond, then let another model respond conditioned on the first response.
  • Independent LoRA weights: Keep separate adapters for each agent role.
  • Manual rollout control: Use low-level TrainingClient APIs when recipe.rl.train.main() is too high-level for the interaction.
  • Fallback-safe demos: If concurrent clients fail, use a single-client role-switching fallback.

Main Shape

ServiceClient
  ├─ Agent A TrainingClient ──▶ sampler A ──▶ response A ──▶ train A
  └─ Agent B TrainingClient ──▶ sampler B ──▶ response B ──▶ train B

The recipe first tries the concurrent two-client path:

agent_a = service_client.create_lora_training_client(
    base_model=MODEL,
    rank=RANK,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

agent_b = service_client.create_lora_training_client(
    base_model=MODEL,
    rank=RANK,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

If this fails, it falls back to one LoRA client and alternates role prompts.

Interaction Loop

Each step uses a small arithmetic task:

sampler_a = agent_a.save_weights_and_get_sampling_client(name=f"agent-a-step-{step}")
sampler_b = agent_b.save_weights_and_get_sampling_client(name=f"agent-b-step-{step}")

prompt_a = f"Agent A: answer the math question with only the number. {question}\nAnswer:"
response_a = sample_text(sampler_a, tokenizer_a, prompt_a)
reward_a = evaluate_response(response_a, answer)

prompt_b = (
    f"Agent B: Agent A answered '{response_a}'. "
    f"Now answer the same question with only the number. {question}\nAnswer:"
)
response_b = sample_text(sampler_b, tokenizer_b, prompt_b)
reward_b = evaluate_response(response_b, answer)

Then both agents are trained with known correct answers:

train_agent(agent_a, tokenizer_a, prompt_a, answer, "Agent A")
train_agent(agent_b, tokenizer_b, prompt_b, answer, "Agent B")

The training call is low-level and real:

fb = training_client.forward_backward([datum], loss_fn="cross_entropy").result()
training_client.optim_step(types.AdamParams(learning_rate=LR)).result()

This recipe uses low-level SFT-style training after the interaction. It is a minimal multi-agent wiring recipe, not a full game-theoretic RL algorithm.

View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/multi_agent_rl.py

Verified Run

Verified on MinT with two concurrent LoRA clients:

Mode: concurrent two-client
Agent A response: 5 ... | reward=1.0
Agent B response: 5 ... | reward=1.0
Agent A train_cross_entropy=7.044352
Agent B train_cross_entropy=6.422852

Final checkpoints:

Agent A: tinker://46839c2e-2577-4133-9011-e626293cbaa2_0/sampler_weights/multi-agent-a-final
Agent B: tinker://46839c2e-2577-4133-9011-e626293cbaa2_1/sampler_weights/multi-agent-b-final

Fallback Path

If the server cannot create or train two LoRA sessions at the same time, the recipe prints a warning and runs:

single TrainingClient
  ├─ role prompt: Agent A
  └─ role prompt: Agent B

This keeps the script useful on deployments that only allow one active LoRA session.

On this page