RL Overview
Reinforcement learning (RL) refines a policy by sampling from it, scoring the outputs, and backpropagating reward signals. MinT uses GRPO (Group Relative Policy Optimization) — a variant that stabilizes training by normalizing advantages within a group of samples from the same prompt.
The RL loop is:
- Sample multiple responses per prompt from the current policy.
- Score each response with a reward function (verifier, judge, or environment).
- Compute advantages by centering rewards within the group.
- Train on the sampled trajectories using
loss_fn="importance_sampling".
Reward sources are pluggable, making RL suitable for math (exact-match verifiers), chat (preference judges), code (execution tests), or custom signals.
Configuration
The minimal RL training loop:
import mint
from mint import types
service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3-0.6B",
rank=16,
train_mlp=True,
train_attn=True,
train_unembed=True,
)
tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)
for step in range(num_steps):
# 1. Sample from current policy
sampling_client = training_client.save_weights_and_get_sampling_client(
name=f"rl-step-{step}"
)
# 2. Collect samples and compute rewards (see RL demos)
training_datums: list[types.Datum] = [...]
# 3. Train with importance sampling
training_client.forward_backward(
training_datums,
loss_fn="importance_sampling"
).result()
training_client.optim_step(adam_params).result()For production-grade RL loops, see demos/rl/rl_core.py, which implements the shared run_grpo(adapter) infrastructure used by the task-specific adapters below.
Prompting Guide
RL prompts are typically shorter than SFT prompts — the model generates a complete response rather than completing a fixed template. For example:
- Math: "Q: What is 3 + 5?\nA:"
- Chat: User message formatted through the chat template with
add_generation_prompt=True. - Code: Problem statement + few-shot example + "A:" (model fills in code).
Prompts are encoded and passed to sampling_client.sample(...):
prompt_tokens = tokenizer.encode("Q: What is 3 + 5?\nA:")
result = sampling_client.sample(
prompt=types.ModelInput.from_ints(tokens=prompt_tokens),
num_samples=8, # Sample 8 responses per prompt
sampling_params=types.SamplingParams(
max_tokens=16,
temperature=0.8, # Exploration; higher = more diversity
stop_token_ids=[tokenizer.eos_token_id],
),
).result()
for seq in result.sequences:
response_text = tokenizer.decode(seq.tokens)
logprobs = seq.logprobs or [0.0] * len(seq.tokens)Each sequence in the result carries:
tokens: The generated token IDs.logprobs: Per-token log-probabilities under the policy.
Output Format
RL trains on advantages, not raw rewards. The standard pattern:
- Collect rewards for all samples in a group.
- Compute group mean reward:
mean_r = sum(rewards) / num_samples. - Center advantages:
advantage[i] = reward[i] - mean_r. - Construct Datum with
loss_weightsset to zero for the prompt and advantages for the response.
Example from quickstart.py:
g_rewards, g_responses, g_logprobs = [], [], []
for seq in res.sequences:
txt = tokenizer.decode(seq.tokens)
reward = 1.0 if extract_answer(txt) == gold_answer else 0.0
g_rewards.append(reward)
g_responses.append(list(seq.tokens))
g_logprobs.append(list(seq.logprobs or [0.0] * len(seq.tokens)))
mean_r = sum(g_rewards) / len(g_rewards)
advs = [r - mean_r for r in g_rewards]
for resp_tok, lp, adv in zip(g_responses, g_logprobs, advs):
if not resp_tok:
continue
full = prompt_tokens + resp_tok
prefix = len(prompt_tokens) - 1
datums.append(
types.Datum(
model_input=types.ModelInput.from_ints(tokens=full[:-1]),
loss_fn_inputs={
"target_tokens": full[1:],
"weights": [0.0] * prefix + [1.0] * len(resp_tok),
"logprobs": [0.0] * prefix + lp,
"advantages": [0.0] * prefix + [adv] * len(resp_tok),
},
)
)The loss is then scaled by advantages: samples with reward > mean_r get positive gradients; below mean_r get negative gradients. This encourages the policy to increase the probability of high-reward samples and decrease low-reward ones.
All Parameters
| Parameter | Type | Default | Meaning |
|---|---|---|---|
loss_fn | str | "importance_sampling" | GRPO is the default and recommended RL algorithm. Alternatives (server accepts; not exercised in canonical scripts): "ppo", "cispo", "dro". |
group_size | int | 4 | Samples per prompt. Larger groups reduce variance but increase memory. Typical: 4–16. |
batch_size (or groups_per_batch) | int | 8 | Prompts per training step. |
max_tokens | int | 16 | Max generation length per sample. Task-dependent: math ≈ 16, chat ≈ 128, code ≈ 256. |
learning_rate | float | 2e-5 | Adam LR. RL typically uses lower LR than SFT (1e-5 to 4e-5). |
temperature | float | 0.8 | Sampling temperature. Higher = more exploration. Typical: 0.7–1.0 for RL. |
kl_penalty_coef | float | 0.0 | KL divergence penalty (reference model vs. policy). Set > 0 to add coef * KL to the loss; prevents policy collapse. |
betas | tuple[float, float] | (0.9, 0.999) | Adam exponential decay rates. |
eps | float | 1e-8 | Adam numerical stability. |
weight_decay | float | 0.0 | L2 regularization. |
base_model | str | "Qwen/Qwen3-0.6B" | Base model ID. |
rank | int | 16 | LoRA rank. |
train_mlp | bool | True | Train MLP layers. |
train_attn | bool | True | Train attention layers. |
train_unembed | bool | True | Train output layer. |
Usage:
adam_params = types.AdamParams(learning_rate=2e-5)
result = training_client.forward_backward(
datums,
loss_fn="importance_sampling"
).result()
training_client.optim_step(adam_params).result()What's next?
- Math RL — deterministic verifiers for exact-match grading.
- Chat RL — preference-based rewards from a judge model.
- Code RL — execution-based rewards from a sandbox.
- rl_core.py — the shared GRPO loop and RLAdapter protocol.