RL Hyperparameters
这个 recipe 会在 MinT 上跑真实的 RL hyperparameter sweep。它使用一个很小的 arithmetic MessageEnv,用 EnvFromMessageEnv 包起来,然后通过 recipe.rl.train.main() 训练。
这个 recipe 不是手写假的 PPO datums。它使用真实 multi-turn RL recipes 会用到的 environment 和 rollout 路径。
Use Case
- RL smoke testing:检查
MessageEnv、rollout、sampling、training 是否能在 MinT 上跑通。 - Group-size exploration:在更大任务前比较小的 GRPO-style groups。
- Temperature exploration:检查 sampling temperature 如何影响 rollout 行为。
- KL penalty checks:确认带 KL regularization 的 RL config 使用真实 reference model。
Sweep 什么
默认 grid:
kl_penalty_coef: [0.0, 0.02]
temperature: [0.7, 1.0]
group_size: [2, 4]
max_steps: 每组配置 1 step
configs: 2 x 2 x 2 = 8运行:
export MINT_API_KEY=sk-your-api-key
python recipes/rl_hyperparameters.py覆盖 grid:
MINT_RL_STEPS=1 \
MINT_RL_KL_COEFS=0.0,0.02 \
MINT_RL_TEMPS=0.7,1.0 \
MINT_RL_GROUPS=2,4 \
python recipes/rl_hyperparameters.pyEnvironment 形状
environment 故意很小:
class ArithmeticMessageEnv(MessageEnv):
async def initial_observation(self):
return [
{
"role": "system",
"content": "You are a precise calculator. Reply with only the final number.",
},
{"role": "user", "content": self.question},
]
async def step(self, message):
content = _extract_content(message).strip()
prediction = _first_number(content)
correct = prediction == self.answer
return MessageStepResult(
reward=1.0 if correct else -0.25,
episode_done=True,
next_messages=[],
metrics={"correct": float(correct)},
)然后 dataset builder 创建一组 environments:
@dataclass(frozen=True)
class ArithmeticEnvGroupBuilder(recipe.rl.types.EnvGroupBuilder):
question: str
answer: str
group_size: int
renderer_name: str
model_name: str
async def make_envs(self):
tokenizer = get_tokenizer(self.model_name)
renderer = recipe.renderers.get_renderer(self.renderer_name, tokenizer)
return [
EnvFromMessageEnv(
renderer=renderer,
message_env=ArithmeticMessageEnv(self.question, self.answer),
max_trajectory_tokens=512,
max_generation_tokens=64,
)
for _ in range(self.group_size)
]Training Config
每个 grid item 调用 recipe.rl.train.main():
kl_reference_config = (
recipe.rl.train.KLReferenceConfig(base_model=MODEL)
if kl_coef > 0
else None
)
config = recipe.rl.train.Config(
learning_rate=1e-5,
dataset_builder=ArithmeticRLDatasetBuilder(...),
model_name=MODEL,
renderer_name="qwen3",
lora_rank=16,
max_tokens=64,
temperature=temperature,
kl_penalty_coef=kl_coef,
kl_reference_config=kl_reference_config,
loss_fn="importance_sampling",
max_steps=1,
save_every=999,
eval_every=999,
)
await recipe.rl.train.main(config=config)如果 kl_penalty_coef > 0,recipe.rl.train.Config 必须设置 kl_reference_config。这个 recipe 对这些配置设置 KLReferenceConfig(base_model=MODEL)。
完整源码:https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/rl_hyperparameters.py
Verified Run
已在 MinT 上验证:Qwen/Qwen3-0.6B,每组配置 1 个 training step:
| KL | Temperature | Group size | Steps | Mean reward |
|---|---|---|---|---|
0.00 | 0.7 | 2 | 1 | -0.100 |
0.00 | 0.7 | 4 | 1 | -0.100 |
0.00 | 1.0 | 2 | 1 | -0.100 |
0.00 | 1.0 | 4 | 1 | -0.100 |
0.02 | 0.7 | 2 | 1 | -0.100 |
0.02 | 0.7 | 4 | 1 | -0.100 |
0.02 | 1.0 | 2 | 1 | -0.100 |
0.02 | 1.0 | 4 | 1 | -0.100 |
默认小任务是 smoke test。这里的 reward 数字不代表模型质量;它证明 8 组配置都完成了 rollout、training、KL reference handling 和 checkpoint save。
为什么这个形状有效
MessageEnv
│ 定义 prompt、step()、reward
▼
EnvFromMessageEnv
│ renderer bridge: messages → tokens
▼
EnvGroupBuilder
│ 创建 group_size 份环境,用于 group-relative RL
▼
RLDatasetBuilder
│ 把 groups 交给 recipe.rl.train
▼
recipe.rl.train.main()
│ sampling、计算 rewards、用 importance_sampling 训练
▼
metrics.jsonl + checkpoint这是用户做 custom environments 时真正需要的 RL recipe path。