Mind Lab Toolkit (MinT)
CustomizeRL

Math RL

Math RL uses a deterministic verifier to score arithmetic or algebra problems. Given a ground-truth answer, the verifier extracts the predicted answer from the model's output and returns 1.0 (correct) or 0.0 (incorrect). This simple reward signal is highly effective for structured tasks.

The canonical implementation is demos/rl/adapters/verifiable_math.py, which solves addition problems with exact-match grading.

Configuration

Set up Math RL using the standard GRPO loop:

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)

Then run the adapter from verifiable_math.py:

from demos.rl.adapters.verifiable_math import VerifiableMathAdapter
from demos.rl.rl_core import RLConfig, run_grpo

cfg = RLConfig(
    model="Qwen/Qwen3-0.6B",
    rank=16,
    steps=10,
    batch=8,
    group=4,
    lr=2e-5,
    max_tokens=16,
    temperature=0.8,
)

run_grpo(VerifiableMathAdapter(), cfg)

Prompting Guide

Math problems are rendered with a few-shot example and then the question:

class VerifiableMathAdapter(RLAdapter):
    FEWSHOT = "Q: What is 4 + 5?\nA: 9\n\n"

    def make_prompt(self, sample: tuple[str, int], tokenizer) -> list[int]:
        question, _ = sample
        return tokenizer.encode(self.FEWSHOT + question)

For a problem (question="What is 3 + 7?", answer=10), the encoded prompt becomes:

Q: What is 4 + 5?
A: 9

Q: What is 3 + 7?
A:

The model then generates a response. The verifier extracts the numeric answer and compares it to the ground truth.

Key pattern: The question and few-shot prefix are part of the prompt (zero loss weight during training). Only the answer tokens get loss weight and advantage scaling.

Output Format

The verifier extracts the first integer found in the response:

def compute_reward(self, response: str, sample: tuple[str, int]) -> float:
    _, answer = sample
    match = re.search(r"-?\d+", response)
    return 1.0 if match and int(match.group()) == answer else 0.0
  • Reward = 1.0: Extracted answer matches the ground truth.
  • Reward = 0.0: No number extracted or answer mismatches.

Within a group of group_size samples from the same prompt, advantages are computed by centering: adv[i] = reward[i] - mean_reward_in_group. This encourages high-reward samples and discourages low-reward ones.

Optional partial credit (not in the canonical script but worth considering for harder datasets): award 0.5 if the response contains the correct format (e.g., a number, even if wrong) but not the exact answer.

All Parameters

ParameterTypeDefaultMeaning
stepsint5Training steps (sampling + training).
batchint8Problems per step.
groupint4Samples per problem (group_size).
learning_ratefloat2e-5Adam LR. Math RL: 1e-5 to 4e-5.
max_tokensint16Max generation length. Math: 8–32 (short answers).
temperaturefloat0.8Sampling temperature. Higher = more diversity. Typical: 0.7–1.0.
base_modelstr"Qwen/Qwen3-0.6B"Base model.
rankint16LoRA rank.
train_mlpboolTrueTrain MLP.
train_attnboolTrueTrain attention.
train_unembedboolTrueTrain output layer.

Environment variables (from rl_core.py):

export MINT_BASE_MODEL="Qwen/Qwen3-0.6B"
export MINT_LORA_RANK=16
export MINT_RL_STEPS=10
export MINT_RL_BATCH=8
export MINT_RL_GROUP=4
export MINT_RL_LR=2e-5
export MINT_RL_MAX_TOKENS=16
export MINT_RL_TEMPERATURE=0.8

Usage:

cfg = RLConfig.from_env()
run_grpo(VerifiableMathAdapter(), cfg)

Extending to other math domains: Swap VerifiableMathAdapter with your own RLAdapter that implements build_dataset(), make_prompt(), and compute_reward(). For GSM8K-style word problems, increase max_tokens to 256 to allow chain-of-thought reasoning.

On this page