Mind Lab Toolkit (MinT)
CustomizeRL

RL Overview

Reinforcement learning (RL) refines a policy by sampling from it, scoring the outputs, and backpropagating reward signals. MinT uses GRPO (Group Relative Policy Optimization) — a variant that stabilizes training by normalizing advantages within a group of samples from the same prompt.

The RL loop is:

  1. Sample multiple responses per prompt from the current policy.
  2. Score each response with a reward function (verifier, judge, or environment).
  3. Compute advantages by centering rewards within the group.
  4. Train on the sampled trajectories using loss_fn="importance_sampling".

Reward sources are pluggable, making RL suitable for math (exact-match verifiers), chat (preference judges), code (execution tests), or custom signals.

Configuration

The minimal RL training loop:

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)

for step in range(num_steps):
    # 1. Sample from current policy
    sampling_client = training_client.save_weights_and_get_sampling_client(
        name=f"rl-step-{step}"
    )

    # 2. Collect samples and compute rewards (see RL demos)
    training_datums: list[types.Datum] = [...]
    
    # 3. Train with importance sampling
    training_client.forward_backward(
        training_datums, 
        loss_fn="importance_sampling"
    ).result()
    training_client.optim_step(adam_params).result()

For production-grade RL loops, see demos/rl/rl_core.py, which implements the shared run_grpo(adapter) infrastructure used by the task-specific adapters below.

Prompting Guide

RL prompts are typically shorter than SFT prompts — the model generates a complete response rather than completing a fixed template. For example:

  • Math: "Q: What is 3 + 5?\nA:"
  • Chat: User message formatted through the chat template with add_generation_prompt=True.
  • Code: Problem statement + few-shot example + "A:" (model fills in code).

Prompts are encoded and passed to sampling_client.sample(...):

prompt_tokens = tokenizer.encode("Q: What is 3 + 5?\nA:")

result = sampling_client.sample(
    prompt=types.ModelInput.from_ints(tokens=prompt_tokens),
    num_samples=8,  # Sample 8 responses per prompt
    sampling_params=types.SamplingParams(
        max_tokens=16,
        temperature=0.8,  # Exploration; higher = more diversity
        stop_token_ids=[tokenizer.eos_token_id],
    ),
).result()

for seq in result.sequences:
    response_text = tokenizer.decode(seq.tokens)
    logprobs = seq.logprobs or [0.0] * len(seq.tokens)

Each sequence in the result carries:

  • tokens: The generated token IDs.
  • logprobs: Per-token log-probabilities under the policy.

Output Format

RL trains on advantages, not raw rewards. The standard pattern:

  1. Collect rewards for all samples in a group.
  2. Compute group mean reward: mean_r = sum(rewards) / num_samples.
  3. Center advantages: advantage[i] = reward[i] - mean_r.
  4. Construct Datum with loss_weights set to zero for the prompt and advantages for the response.

Example from quickstart.py:

g_rewards, g_responses, g_logprobs = [], [], []
for seq in res.sequences:
    txt = tokenizer.decode(seq.tokens)
    reward = 1.0 if extract_answer(txt) == gold_answer else 0.0
    g_rewards.append(reward)
    g_responses.append(list(seq.tokens))
    g_logprobs.append(list(seq.logprobs or [0.0] * len(seq.tokens)))

mean_r = sum(g_rewards) / len(g_rewards)
advs = [r - mean_r for r in g_rewards]

for resp_tok, lp, adv in zip(g_responses, g_logprobs, advs):
    if not resp_tok:
        continue
    full = prompt_tokens + resp_tok
    prefix = len(prompt_tokens) - 1
    datums.append(
        types.Datum(
            model_input=types.ModelInput.from_ints(tokens=full[:-1]),
            loss_fn_inputs={
                "target_tokens": full[1:],
                "weights": [0.0] * prefix + [1.0] * len(resp_tok),
                "logprobs": [0.0] * prefix + lp,
                "advantages": [0.0] * prefix + [adv] * len(resp_tok),
            },
        )
    )

The loss is then scaled by advantages: samples with reward > mean_r get positive gradients; below mean_r get negative gradients. This encourages the policy to increase the probability of high-reward samples and decrease low-reward ones.

All Parameters

ParameterTypeDefaultMeaning
loss_fnstr"importance_sampling"GRPO is the default and recommended RL algorithm. Alternatives (server accepts; not exercised in canonical scripts): "ppo", "cispo", "dro".
group_sizeint4Samples per prompt. Larger groups reduce variance but increase memory. Typical: 4–16.
batch_size (or groups_per_batch)int8Prompts per training step.
max_tokensint16Max generation length per sample. Task-dependent: math ≈ 16, chat ≈ 128, code ≈ 256.
learning_ratefloat2e-5Adam LR. RL typically uses lower LR than SFT (1e-5 to 4e-5).
temperaturefloat0.8Sampling temperature. Higher = more exploration. Typical: 0.7–1.0 for RL.
kl_penalty_coeffloat0.0KL divergence penalty (reference model vs. policy). Set > 0 to add coef * KL to the loss; prevents policy collapse.
betastuple[float, float](0.9, 0.999)Adam exponential decay rates.
epsfloat1e-8Adam numerical stability.
weight_decayfloat0.0L2 regularization.
base_modelstr"Qwen/Qwen3-0.6B"Base model ID.
rankint16LoRA rank.
train_mlpboolTrueTrain MLP layers.
train_attnboolTrueTrain attention layers.
train_unembedboolTrueTrain output layer.

Usage:

adam_params = types.AdamParams(learning_rate=2e-5)
result = training_client.forward_backward(
    datums,
    loss_fn="importance_sampling"
).result()
training_client.optim_step(adam_params).result()

Task-specific RL: See Math RL, Chat RL, and Code RL for concrete reward implementations.

What's next?

  • Math RL — deterministic verifiers for exact-match grading.
  • Chat RL — preference-based rewards from a judge model.
  • Code RL — execution-based rewards from a sandbox.
  • rl_core.py — the shared GRPO loop and RLAdapter protocol.

On this page