RL Best Practices (Practical Checklist)
This page is a set of rules of thumb for running reinforcement learning (RL) with MinT. It stays intentionally general so you can apply it to math, code, and tool-using agents.
1) Decide if RL is the right lever
RL is a good fit when:
- You can define a reward function that is cheap to compute and hard to game.
- The task has a clear success condition (verifiable answers, executable tests, pass/fail checks).
- You care about end-to-end behavior (constraints, style, safety, cost), not just imitating examples.
Prefer other methods when:
- You can write high-quality demonstrations: start with SFT.
- You can compare outputs but not score them reliably: consider preference learning (e.g., DPO/RLHF).
2) Choose the loop shape: Model RL vs Agentic RL
Model RL (single-step):
- Input:
prompt - Output:
completion - Reward: based mostly on the completion
- Great for: classification, math, code generation, single-turn QA
Agentic RL (multi-step):
- The model interacts with tools or an environment over multiple steps.
- Reward can include both the final outcome and process metrics (tool cost, step count, policy compliance).
- Great for: tool calling, search, workflows, multi-turn assistants
A simple mental model (pseudo-code; not a MinT API):
def sample_with_trace(prompt: str) -> tuple[str, dict]:
# Produce an answer and a trace you can later inspect.
answer = run_agent(prompt)
trace = {"tool_calls": [], "messages": []}
return answer, trace
def compute_reward(prompt: str, answer: str, trace: dict) -> float:
return score(prompt, answer, trace)In MinT terms: sample_with_trace(...) is just sampling / running your agent (e.g., sampling_client.sample(...)), and compute_reward(...) is your own grader. You then convert tokens + old logprobs + advantages into types.Datum for training.
3) Reward design: make it learnable (and hard to hack)
- Start with the simplest reward that reflects success, then add detail.
- Keep the scale bounded and consistent (e.g.,
[0, 1]), especially across mixed tasks. - Prefer partial credit over pure binary rewards once the loop works (it reduces
datums=0situations). - Add explicit penalties for behaviors you never want (unsafe content, tool spam, excessive verbosity).
- Always maintain a small holdout set and evaluate there; avoid tuning only on training rewards.
4) Sampling: get useful diversity without blowing up cost
group_sizeincreases learning signal (relative comparisons) but costs more inference.- Use higher temperature early to explore; lower it later to stabilize.
- Ensure
max_tokensis large enough for valid answers; truncation can create misleading rewards. - Use stop tokens / termination rules to avoid runaways.
5) Data strategy: start small, then scale what works
- Start with tens to hundreds of high-quality prompts to validate the loop.
- Track saturation: when the model solves items perfectly, they stop providing signal.
- Prefer curating “just hard enough” items over adding lots of easy data.
6) Stability knobs (when things get noisy)
importance_samplingis a good baseline for simple loops.- If training becomes unstable (reward spikes, collapse), consider a clipped objective (PPO/CISPO) and/or a smaller learning rate.
- Save checkpoints frequently so you can bisect regressions.
7) Observability and debugging checklist
Log and inspect a few samples every step:
- Prompt, completion, reward, length
- Any tool calls / traces (for agentic RL)
- A small dashboard metric: success rate + average reward
Common failure modes:
- No learning signal: all samples get the same reward (see the
datums=0note in Mini RL Trip). - Reward hacking: the model learns to exploit the grader instead of solving the task.
- Masked loss bug: you accidentally train on prompt tokens (make sure weights mask the prompt).