RL Hyperparameters
This recipe runs a real RL hyperparameter sweep on MinT. It uses a tiny arithmetic MessageEnv, wraps it with EnvFromMessageEnv, and trains with recipe.rl.train.main().
The recipe does not hand-build fake PPO datums. It uses the same environment and rollout path as real multi-turn RL recipes.
Use Case
- RL smoke testing: Check that
MessageEnv, rollout, sampling, and training work on MinT. - Group-size exploration: Compare small GRPO-style groups before a larger task.
- Temperature exploration: Check how sampling temperature changes rollout behavior.
- KL penalty checks: Verify KL-regularized RL configs use a real reference model.
What the Recipe Sweeps
Default grid:
kl_penalty_coef: [0.0, 0.02]
temperature: [0.7, 1.0]
group_size: [2, 4]
max_steps: 1 per config
configs: 2 x 2 x 2 = 8Run it:
export MINT_API_KEY=sk-your-api-key
python recipes/rl_hyperparameters.pyOverride the grid:
MINT_RL_STEPS=1 \
MINT_RL_KL_COEFS=0.0,0.02 \
MINT_RL_TEMPS=0.7,1.0 \
MINT_RL_GROUPS=2,4 \
python recipes/rl_hyperparameters.pyEnvironment Shape
The environment is intentionally small:
class ArithmeticMessageEnv(MessageEnv):
async def initial_observation(self):
return [
{
"role": "system",
"content": "You are a precise calculator. Reply with only the final number.",
},
{"role": "user", "content": self.question},
]
async def step(self, message):
content = _extract_content(message).strip()
prediction = _first_number(content)
correct = prediction == self.answer
return MessageStepResult(
reward=1.0 if correct else -0.25,
episode_done=True,
next_messages=[],
metrics={"correct": float(correct)},
)Then the dataset builder creates groups of environments:
@dataclass(frozen=True)
class ArithmeticEnvGroupBuilder(recipe.rl.types.EnvGroupBuilder):
question: str
answer: str
group_size: int
renderer_name: str
model_name: str
async def make_envs(self):
tokenizer = get_tokenizer(self.model_name)
renderer = recipe.renderers.get_renderer(self.renderer_name, tokenizer)
return [
EnvFromMessageEnv(
renderer=renderer,
message_env=ArithmeticMessageEnv(self.question, self.answer),
max_trajectory_tokens=512,
max_generation_tokens=64,
)
for _ in range(self.group_size)
]Training Config
Each grid item calls recipe.rl.train.main():
kl_reference_config = (
recipe.rl.train.KLReferenceConfig(base_model=MODEL)
if kl_coef > 0
else None
)
config = recipe.rl.train.Config(
learning_rate=1e-5,
dataset_builder=ArithmeticRLDatasetBuilder(...),
model_name=MODEL,
renderer_name="qwen3",
lora_rank=16,
max_tokens=64,
temperature=temperature,
kl_penalty_coef=kl_coef,
kl_reference_config=kl_reference_config,
loss_fn="importance_sampling",
max_steps=1,
save_every=999,
eval_every=999,
)
await recipe.rl.train.main(config=config)If kl_penalty_coef > 0, recipe.rl.train.Config requires kl_reference_config. The recipe sets KLReferenceConfig(base_model=MODEL) for those configs.
View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/rl_hyperparameters.py
Verified Run
Verified on MinT with Qwen/Qwen3-0.6B, one training step per config:
| KL | Temperature | Group size | Steps | Mean reward |
|---|---|---|---|---|
0.00 | 0.7 | 2 | 1 | -0.100 |
0.00 | 0.7 | 4 | 1 | -0.100 |
0.00 | 1.0 | 2 | 1 | -0.100 |
0.00 | 1.0 | 4 | 1 | -0.100 |
0.02 | 0.7 | 2 | 1 | -0.100 |
0.02 | 0.7 | 4 | 1 | -0.100 |
0.02 | 1.0 | 2 | 1 | -0.100 |
0.02 | 1.0 | 4 | 1 | -0.100 |
The tiny default task is a smoke test. The reward values are not a model-quality result; they prove that all 8 configs complete rollouts, training, KL reference handling, and checkpoint save.
Why This Shape Works
MessageEnv
│ defines prompt, step(), reward
▼
EnvFromMessageEnv
│ renderer bridge: messages → tokens
▼
EnvGroupBuilder
│ creates group_size copies for group-relative RL
▼
RLDatasetBuilder
│ feeds groups into recipe.rl.train
▼
recipe.rl.train.main()
│ samples, computes rewards, trains with importance_sampling
▼
metrics.jsonl + checkpointThis is the real RL recipe path users need for custom environments.