RL Overview

Reinforcement learning (RL) refines a policy by sampling from it, scoring the outputs, and backpropagating reward signals. MinT uses GRPO (Group Relative Policy Optimization) — a variant that stabilizes training by normalizing advantages within a group of samples from the same prompt.

The RL loop is:

Sample multiple responses per prompt from the current policy.
Score each response with a reward function (verifier, judge, or environment).
Compute advantages by centering rewards within the group.
Train on the sampled trajectories using loss_fn="importance_sampling".

Reward sources are pluggable, making RL suitable for math (exact-match verifiers), chat (preference judges), code (execution tests), or custom signals.

Configuration

The minimal RL training loop:

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)

for step in range(num_steps):
    # 1. Sample from current policy
    sampling_client = training_client.save_weights_and_get_sampling_client(
        name=f"rl-step-{step}"
    )

    # 2. Collect samples and compute rewards (see RL demos)
    training_datums: list[types.Datum] = [...]
    
    # 3. Train with importance sampling
    training_client.forward_backward(
        training_datums, 
        loss_fn="importance_sampling"
    ).result()
    training_client.optim_step(adam_params).result()

For production-grade RL loops, see demos/rl/rl_core.py, which implements the shared run_grpo(adapter) infrastructure used by the task-specific adapters below.

Prompting Guide

RL prompts are typically shorter than SFT prompts — the model generates a complete response rather than completing a fixed template. For example:

Math: "Q: What is 3 + 5?\nA:"
Chat: User message formatted through the chat template with add_generation_prompt=True.
Code: Problem statement + few-shot example + "A:" (model fills in code).

Prompts are encoded and passed to sampling_client.sample(...):

prompt_tokens = tokenizer.encode("Q: What is 3 + 5?\nA:")

result = sampling_client.sample(
    prompt=types.ModelInput.from_ints(tokens=prompt_tokens),
    num_samples=8,  # Sample 8 responses per prompt
    sampling_params=types.SamplingParams(
        max_tokens=16,
        temperature=0.8,  # Exploration; higher = more diversity
        stop_token_ids=[tokenizer.eos_token_id],
    ),
).result()

for seq in result.sequences:
    response_text = tokenizer.decode(seq.tokens)
    logprobs = seq.logprobs or [0.0] * len(seq.tokens)

Each sequence in the result carries:

tokens: The generated token IDs.
logprobs: Per-token log-probabilities under the policy.

Output Format

RL trains on advantages, not raw rewards. The standard pattern:

Collect rewards for all samples in a group.
Compute group mean reward: mean_r = sum(rewards) / num_samples.
Center advantages: advantage[i] = reward[i] - mean_r.
Construct Datum with loss_weights set to zero for the prompt and advantages for the response.

Example from quickstart.py:

g_rewards, g_responses, g_logprobs = [], [], []
for seq in res.sequences:
    txt = tokenizer.decode(seq.tokens)
    reward = 1.0 if extract_answer(txt) == gold_answer else 0.0
    g_rewards.append(reward)
    g_responses.append(list(seq.tokens))
    g_logprobs.append(list(seq.logprobs or [0.0] * len(seq.tokens)))

mean_r = sum(g_rewards) / len(g_rewards)
advs = [r - mean_r for r in g_rewards]

for resp_tok, lp, adv in zip(g_responses, g_logprobs, advs):
    if not resp_tok:
        continue
    full = prompt_tokens + resp_tok
    prefix = len(prompt_tokens) - 1
    datums.append(
        types.Datum(
            model_input=types.ModelInput.from_ints(tokens=full[:-1]),
            loss_fn_inputs={
                "target_tokens": full[1:],
                "weights": [0.0] * prefix + [1.0] * len(resp_tok),
                "logprobs": [0.0] * prefix + lp,
                "advantages": [0.0] * prefix + [adv] * len(resp_tok),
            },
        )
    )

The loss is then scaled by advantages: samples with reward > mean_r get positive gradients; below mean_r get negative gradients. This encourages the policy to increase the probability of high-reward samples and decrease low-reward ones.

All Parameters

Parameter	Type	Default	Meaning
`loss_fn`	str	`"importance_sampling"`	GRPO is the default and recommended RL algorithm. Alternatives (server accepts; not exercised in canonical scripts): `"ppo"`, `"cispo"`, `"dro"`.
`group_size`	int	`4`	Samples per prompt. Larger groups reduce variance but increase memory. Typical: 4–16.
`batch_size` (or `groups_per_batch`)	int	`8`	Prompts per training step.
`max_tokens`	int	`16`	Max generation length per sample. Task-dependent: math ≈ 16, chat ≈ 128, code ≈ 256.
`learning_rate`	float	`2e-5`	Adam LR. RL typically uses lower LR than SFT (1e-5 to 4e-5).
`temperature`	float	`0.8`	Sampling temperature. Higher = more exploration. Typical: 0.7–1.0 for RL.
`kl_penalty_coef`	float	`0.0`	KL divergence penalty (reference model vs. policy). Set > 0 to add `coef * KL` to the loss; prevents policy collapse.
`betas`	tuple[float, float]	`(0.9, 0.999)`	Adam exponential decay rates.
`eps`	float	`1e-8`	Adam numerical stability.
`weight_decay`	float	`0.0`	L2 regularization.
`base_model`	str	`"Qwen/Qwen3-0.6B"`	Base model ID.
`rank`	int	`16`	LoRA rank.
`train_mlp`	bool	`True`	Train MLP layers.
`train_attn`	bool	`True`	Train attention layers.
`train_unembed`	bool	`True`	Train output layer.

Usage:

adam_params = types.AdamParams(learning_rate=2e-5)
result = training_client.forward_backward(
    datums,
    loss_fn="importance_sampling"
).result()
training_client.optim_step(adam_params).result()

Task-specific RL: See Math RL, Chat RL, and Code RL for concrete reward implementations.

What's next?

Math RL — deterministic verifiers for exact-match grading.
Chat RL — preference-based rewards from a judge model.
Code RL — execution-based rewards from a sandbox.
rl_core.py — the shared GRPO loop and RLAdapter protocol.