RL Best Practices (Practical Checklist)

This page is a set of rules of thumb for running reinforcement learning (RL) with MinT. It stays intentionally general so you can apply it to math, code, and tool-using agents.

1) Decide if RL is the right lever

RL is a good fit when:

You can define a reward function that is cheap to compute and hard to game.
The task has a clear success condition (verifiable answers, executable tests, pass/fail checks).
You care about end-to-end behavior (constraints, style, safety, cost), not just imitating examples.

Prefer other methods when:

You can write high-quality demonstrations: start with SFT.
You can compare outputs but not score them reliably: consider preference learning (e.g., DPO/RLHF).

2) Choose the loop shape: Model RL vs Agentic RL

Model RL (single-step):

Input: prompt
Output: completion
Reward: based mostly on the completion
Great for: classification, math, code generation, single-turn QA

Agentic RL (multi-step):

The model interacts with tools or an environment over multiple steps.
Reward can include both the final outcome and process metrics (tool cost, step count, policy compliance).
Great for: tool calling, search, workflows, multi-turn assistants

A simple mental model (pseudo-code; not a MinT API):

def sample_with_trace(prompt: str) -> tuple[str, dict]:
    # Produce an answer and a trace you can later inspect.
    answer = run_agent(prompt)
    trace = {"tool_calls": [], "messages": []}
    return answer, trace
 
def compute_reward(prompt: str, answer: str, trace: dict) -> float:
    return score(prompt, answer, trace)

In MinT terms: sample_with_trace(...) is just sampling / running your agent (e.g., sampling_client.sample(...)), and compute_reward(...) is your own grader. You then convert tokens + old logprobs + advantages into types.Datum for training.

3) Reward design: make it learnable (and hard to hack)

Start with the simplest reward that reflects success, then add detail.
Keep the scale bounded and consistent (e.g., [0, 1]), especially across mixed tasks.
Prefer partial credit over pure binary rewards once the loop works (it reduces datums=0 situations).
Add explicit penalties for behaviors you never want (unsafe content, tool spam, excessive verbosity).
Always maintain a small holdout set and evaluate there; avoid tuning only on training rewards.

4) Sampling: get useful diversity without blowing up cost

group_size increases learning signal (relative comparisons) but costs more inference.
Use higher temperature early to explore; lower it later to stabilize.
Ensure max_tokens is large enough for valid answers; truncation can create misleading rewards.
Use stop tokens / termination rules to avoid runaways.

5) Data strategy: start small, then scale what works

Start with tens to hundreds of high-quality prompts to validate the loop.
Track saturation: when the model solves items perfectly, they stop providing signal.
Prefer curating “just hard enough” items over adding lots of easy data.

6) Stability knobs (when things get noisy)

importance_sampling is a good baseline for simple loops.
If training becomes unstable (reward spikes, collapse), consider a clipped objective (PPO/CISPO) and/or a smaller learning rate.
Save checkpoints frequently so you can bisect regressions.

7) Observability and debugging checklist

Log and inspect a few samples every step:

Prompt, completion, reward, length
Any tool calls / traces (for agentic RL)
A small dashboard metric: success rate + average reward

Common failure modes:

No learning signal: all samples get the same reward (see the datums=0 note in Mini RL Trip).
Reward hacking: the model learns to exploit the grader instead of solving the task.
Masked loss bug: you accidentally train on prompt tokens (make sure weights mask the prompt).

Multi-turn Training Math RL