Mind Lab Toolkit (MinT)
CustomizeRL

RL Hyperparameters

This recipe runs a real RL hyperparameter sweep on MinT. It uses a tiny arithmetic MessageEnv, wraps it with EnvFromMessageEnv, and trains with recipe.rl.train.main().

The recipe does not hand-build fake PPO datums. It uses the same environment and rollout path as real multi-turn RL recipes.

Use Case

  • RL smoke testing: Check that MessageEnv, rollout, sampling, and training work on MinT.
  • Group-size exploration: Compare small GRPO-style groups before a larger task.
  • Temperature exploration: Check how sampling temperature changes rollout behavior.
  • KL penalty checks: Verify KL-regularized RL configs use a real reference model.

What the Recipe Sweeps

Default grid:

kl_penalty_coef: [0.0, 0.02]
temperature:     [0.7, 1.0]
group_size:      [2, 4]
max_steps:       1 per config
configs:         2 x 2 x 2 = 8

Run it:

export MINT_API_KEY=sk-your-api-key
python recipes/rl_hyperparameters.py

Override the grid:

MINT_RL_STEPS=1 \
MINT_RL_KL_COEFS=0.0,0.02 \
MINT_RL_TEMPS=0.7,1.0 \
MINT_RL_GROUPS=2,4 \
python recipes/rl_hyperparameters.py

Environment Shape

The environment is intentionally small:

class ArithmeticMessageEnv(MessageEnv):
    async def initial_observation(self):
        return [
            {
                "role": "system",
                "content": "You are a precise calculator. Reply with only the final number.",
            },
            {"role": "user", "content": self.question},
        ]

    async def step(self, message):
        content = _extract_content(message).strip()
        prediction = _first_number(content)
        correct = prediction == self.answer
        return MessageStepResult(
            reward=1.0 if correct else -0.25,
            episode_done=True,
            next_messages=[],
            metrics={"correct": float(correct)},
        )

Then the dataset builder creates groups of environments:

@dataclass(frozen=True)
class ArithmeticEnvGroupBuilder(recipe.rl.types.EnvGroupBuilder):
    question: str
    answer: str
    group_size: int
    renderer_name: str
    model_name: str

    async def make_envs(self):
        tokenizer = get_tokenizer(self.model_name)
        renderer = recipe.renderers.get_renderer(self.renderer_name, tokenizer)
        return [
            EnvFromMessageEnv(
                renderer=renderer,
                message_env=ArithmeticMessageEnv(self.question, self.answer),
                max_trajectory_tokens=512,
                max_generation_tokens=64,
            )
            for _ in range(self.group_size)
        ]

Training Config

Each grid item calls recipe.rl.train.main():

kl_reference_config = (
    recipe.rl.train.KLReferenceConfig(base_model=MODEL)
    if kl_coef > 0
    else None
)

config = recipe.rl.train.Config(
    learning_rate=1e-5,
    dataset_builder=ArithmeticRLDatasetBuilder(...),
    model_name=MODEL,
    renderer_name="qwen3",
    lora_rank=16,
    max_tokens=64,
    temperature=temperature,
    kl_penalty_coef=kl_coef,
    kl_reference_config=kl_reference_config,
    loss_fn="importance_sampling",
    max_steps=1,
    save_every=999,
    eval_every=999,
)

await recipe.rl.train.main(config=config)

If kl_penalty_coef > 0, recipe.rl.train.Config requires kl_reference_config. The recipe sets KLReferenceConfig(base_model=MODEL) for those configs.

View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/rl_hyperparameters.py

Verified Run

Verified on MinT with Qwen/Qwen3-0.6B, one training step per config:

KLTemperatureGroup sizeStepsMean reward
0.000.721-0.100
0.000.741-0.100
0.001.021-0.100
0.001.041-0.100
0.020.721-0.100
0.020.741-0.100
0.021.021-0.100
0.021.041-0.100

The tiny default task is a smoke test. The reward values are not a model-quality result; they prove that all 8 configs complete rollouts, training, KL reference handling, and checkpoint save.

Why This Shape Works

MessageEnv
  │ defines prompt, step(), reward

EnvFromMessageEnv
  │ renderer bridge: messages → tokens

EnvGroupBuilder
  │ creates group_size copies for group-relative RL

RLDatasetBuilder
  │ feeds groups into recipe.rl.train

recipe.rl.train.main()
  │ samples, computes rewards, trains with importance_sampling

metrics.jsonl + checkpoint

This is the real RL recipe path users need for custom environments.

On this page