Math RL
Math RL uses a deterministic verifier to score arithmetic or algebra problems. Given a ground-truth answer, the verifier extracts the predicted answer from the model's output and returns 1.0 (correct) or 0.0 (incorrect). This simple reward signal is highly effective for structured tasks.
The canonical implementation is demos/rl/adapters/verifiable_math.py, which solves addition problems with exact-match grading.
Configuration
Set up Math RL using the standard GRPO loop:
import mint
from mint import types
service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3-0.6B",
rank=16,
train_mlp=True,
train_attn=True,
train_unembed=True,
)
tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)Then run the adapter from verifiable_math.py:
from demos.rl.adapters.verifiable_math import VerifiableMathAdapter
from demos.rl.rl_core import RLConfig, run_grpo
cfg = RLConfig(
model="Qwen/Qwen3-0.6B",
rank=16,
steps=10,
batch=8,
group=4,
lr=2e-5,
max_tokens=16,
temperature=0.8,
)
run_grpo(VerifiableMathAdapter(), cfg)Prompting Guide
Math problems are rendered with a few-shot example and then the question:
class VerifiableMathAdapter(RLAdapter):
FEWSHOT = "Q: What is 4 + 5?\nA: 9\n\n"
def make_prompt(self, sample: tuple[str, int], tokenizer) -> list[int]:
question, _ = sample
return tokenizer.encode(self.FEWSHOT + question)For a problem (question="What is 3 + 7?", answer=10), the encoded prompt becomes:
Q: What is 4 + 5?
A: 9
Q: What is 3 + 7?
A:The model then generates a response. The verifier extracts the numeric answer and compares it to the ground truth.
Key pattern: The question and few-shot prefix are part of the prompt (zero loss weight during training). Only the answer tokens get loss weight and advantage scaling.
Output Format
The verifier extracts the first integer found in the response:
def compute_reward(self, response: str, sample: tuple[str, int]) -> float:
_, answer = sample
match = re.search(r"-?\d+", response)
return 1.0 if match and int(match.group()) == answer else 0.0- Reward = 1.0: Extracted answer matches the ground truth.
- Reward = 0.0: No number extracted or answer mismatches.
Within a group of group_size samples from the same prompt, advantages are computed by centering: adv[i] = reward[i] - mean_reward_in_group. This encourages high-reward samples and discourages low-reward ones.
Optional partial credit (not in the canonical script but worth considering for harder datasets): award 0.5 if the response contains the correct format (e.g., a number, even if wrong) but not the exact answer.
All Parameters
| Parameter | Type | Default | Meaning |
|---|---|---|---|
steps | int | 5 | Training steps (sampling + training). |
batch | int | 8 | Problems per step. |
group | int | 4 | Samples per problem (group_size). |
learning_rate | float | 2e-5 | Adam LR. Math RL: 1e-5 to 4e-5. |
max_tokens | int | 16 | Max generation length. Math: 8–32 (short answers). |
temperature | float | 0.8 | Sampling temperature. Higher = more diversity. Typical: 0.7–1.0. |
base_model | str | "Qwen/Qwen3-0.6B" | Base model. |
rank | int | 16 | LoRA rank. |
train_mlp | bool | True | Train MLP. |
train_attn | bool | True | Train attention. |
train_unembed | bool | True | Train output layer. |
Environment variables (from rl_core.py):
export MINT_BASE_MODEL="Qwen/Qwen3-0.6B"
export MINT_LORA_RANK=16
export MINT_RL_STEPS=10
export MINT_RL_BATCH=8
export MINT_RL_GROUP=4
export MINT_RL_LR=2e-5
export MINT_RL_MAX_TOKENS=16
export MINT_RL_TEMPERATURE=0.8Usage:
cfg = RLConfig.from_env()
run_grpo(VerifiableMathAdapter(), cfg)Extending to other math domains: Swap VerifiableMathAdapter with your own RLAdapter that implements build_dataset(), make_prompt(), and compute_reward(). For GSM8K-style word problems, increase max_tokens to 256 to allow chain-of-thought reasoning.