DemoRL Best Practices

RL Best Practices (Practical Checklist)

This page is a set of rules of thumb for running reinforcement learning (RL) with MinT. It stays intentionally general so you can apply it to math, code, and tool-using agents.

1) Decide if RL is the right lever

RL is a good fit when:

  • You can define a reward function that is cheap to compute and hard to game.
  • The task has a clear success condition (verifiable answers, executable tests, pass/fail checks).
  • You care about end-to-end behavior (constraints, style, safety, cost), not just imitating examples.

Prefer other methods when:

  • You can write high-quality demonstrations: start with SFT.
  • You can compare outputs but not score them reliably: consider preference learning (e.g., DPO/RLHF).

2) Choose the loop shape: Model RL vs Agentic RL

Model RL (single-step):

  • Input: prompt
  • Output: completion
  • Reward: based mostly on the completion
  • Great for: classification, math, code generation, single-turn QA

Agentic RL (multi-step):

  • The model interacts with tools or an environment over multiple steps.
  • Reward can include both the final outcome and process metrics (tool cost, step count, policy compliance).
  • Great for: tool calling, search, workflows, multi-turn assistants

A simple mental model (pseudo-code; not a MinT API):

def sample_with_trace(prompt: str) -> tuple[str, dict]:
    # Produce an answer and a trace you can later inspect.
    answer = run_agent(prompt)
    trace = {"tool_calls": [], "messages": []}
    return answer, trace
 
def compute_reward(prompt: str, answer: str, trace: dict) -> float:
    return score(prompt, answer, trace)

In MinT terms: sample_with_trace(...) is just sampling / running your agent (e.g., sampling_client.sample(...)), and compute_reward(...) is your own grader. You then convert tokens + old logprobs + advantages into types.Datum for training.

3) Reward design: make it learnable (and hard to hack)

  • Start with the simplest reward that reflects success, then add detail.
  • Keep the scale bounded and consistent (e.g., [0, 1]), especially across mixed tasks.
  • Prefer partial credit over pure binary rewards once the loop works (it reduces datums=0 situations).
  • Add explicit penalties for behaviors you never want (unsafe content, tool spam, excessive verbosity).
  • Always maintain a small holdout set and evaluate there; avoid tuning only on training rewards.

4) Sampling: get useful diversity without blowing up cost

  • group_size increases learning signal (relative comparisons) but costs more inference.
  • Use higher temperature early to explore; lower it later to stabilize.
  • Ensure max_tokens is large enough for valid answers; truncation can create misleading rewards.
  • Use stop tokens / termination rules to avoid runaways.

5) Data strategy: start small, then scale what works

  • Start with tens to hundreds of high-quality prompts to validate the loop.
  • Track saturation: when the model solves items perfectly, they stop providing signal.
  • Prefer curating “just hard enough” items over adding lots of easy data.

6) Stability knobs (when things get noisy)

  • importance_sampling is a good baseline for simple loops.
  • If training becomes unstable (reward spikes, collapse), consider a clipped objective (PPO/CISPO) and/or a smaller learning rate.
  • Save checkpoints frequently so you can bisect regressions.

7) Observability and debugging checklist

Log and inspect a few samples every step:

  • Prompt, completion, reward, length
  • Any tool calls / traces (for agentic RL)
  • A small dashboard metric: success rate + average reward

Common failure modes:

  • No learning signal: all samples get the same reward (see the datums=0 note in Mini RL Trip).
  • Reward hacking: the model learns to exploit the grader instead of solving the task.
  • Masked loss bug: you accidentally train on prompt tokens (make sure weights mask the prompt).