CustomizeRL
Custom Environment
RL training requires an environment that takes the model's action (generated tokens) and returns a reward signal. MinT's RLAdapter base class lets you implement custom environments for any task: math verification, code execution, multi-turn conversations, tool use, or domain-specific feedback.
Concept
An RL environment in MinT implements the RLAdapter interface:
class MyEnvironment(RLAdapter):
def __call__(self, prompt: ModelInput, response: ModelInput) -> Reward:
# Take the model's response and compute a reward
# Return: Reward(score=float, metadata=dict)The environment receives:
- prompt — The input context (tokens the model was given).
- response — The model's generated completion (tokens it produced).
The environment returns:
- score — A scalar reward (higher = better).
- metadata — Optional dict with debug info (e.g., error messages, intermediate scores).
Common environment patterns:
- Verification — Check if the response is correct (e.g., math solution is right). Return 1.0 or 0.0.
- Scoring — Use a metric function (BLEU, edit distance, semantic similarity). Return 0.0–1.0.
- External system — Call an API or subprocess (code execution, fact-checking). Return score + error info.
- Multi-step — Simulate a conversational environment; check if the model's response achieves a goal.
Pattern
import mint
from mint.rl_core import RLAdapter, Reward
class MathVerificationEnv(RLAdapter):
"""Verify math problem solutions via symbolic computation."""
def __call__(self, prompt: mint.types.ModelInput, response: mint.types.ModelInput) -> Reward:
tokenizer = self.get_tokenizer()
# Decode tokens to text
response_text = tokenizer.decode(response.to_ints())
# Extract the numeric answer from the response
import re
match = re.search(r'\d+', response_text)
if not match:
return Reward(score=0.0, metadata={"error": "No number found"})
predicted_answer = int(match.group())
expected_answer = 42 # (in practice, extract from prompt or ground truth)
# Reward: 1.0 if correct, 0.0 if incorrect
is_correct = predicted_answer == expected_answer
return Reward(
score=1.0 if is_correct else 0.0,
metadata={"predicted": predicted_answer, "expected": expected_answer},
)
class ToolUseEnv(RLAdapter):
"""Reward model for learning to use tools."""
def __call__(self, prompt: mint.types.ModelInput, response: mint.types.ModelInput) -> Reward:
tokenizer = self.get_tokenizer()
response_text = tokenizer.decode(response.to_ints())
# Check if model called a tool (e.g., contains <tool_call>)
used_tool = "<tool_call>" in response_text
tool_score = 1.0 if used_tool else 0.0
# Check if response is coherent
coherence_score = 1.0 if len(response_text) > 20 else 0.0
# Combine scores
total_score = 0.7 * tool_score + 0.3 * coherence_score
return Reward(
score=total_score,
metadata={"used_tool": used_tool, "coherence": coherence_score},
)
# Usage during RL training
training_client = service_client.create_lora_training_client(base_model="Qwen/Qwen3-0.6B", rank=16)
env = MathVerificationEnv(training_client.get_tokenizer())
# Sample from the model and evaluate
sampler = service_client.create_sampling_client(base_model="Qwen/Qwen3-0.6B")
sampler.sample(prompt=..., sampling_params=...) # Get response tokens
# Evaluate with environment
reward = env(prompt_tokens, response_tokens)
print(f"Reward: {reward.score}, Metadata: {reward.metadata}")View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/demos/rl/adapters/environment_tooluse.py
API Surface
| Class | Method | Purpose |
|---|---|---|
RLAdapter | __call__(prompt, response) | Compute reward for a (prompt, response) pair |
Reward | score: float | Scalar reward value |
Reward | metadata: dict | Optional debug/logging info |
Helper methods in RLAdapter:
get_tokenizer()— Retrieve the tokenizer for decoding tokens.get_model_info()— Query model family and configuration.
Caveats & Pitfalls
- Reward scale: Rewards should be normalized to a reasonable range (e.g., 0–1 or -1 to 1). Extremely large rewards can cause instability in gradient updates.
- Sparse rewards: Binary rewards (correct/incorrect) provide weak training signals. If possible, provide dense feedback (e.g., partial credit, semantic similarity scores).
- Execution safety: Be cautious with code execution environments. Sandboxing is essential to prevent malicious code from damaging your system.
- Latency: If your environment calls external APIs (web search, code execution), consider async patterns and timeouts. Slow environments bottleneck training.
- Determinism: If your environment is stochastic (random elements), consider seeding for reproducible training. Log sources of randomness in metadata.