Custom Environment

RL training requires an environment that takes the model's action (generated tokens) and returns a reward signal. MinT's RLAdapter base class lets you implement custom environments for any task: math verification, code execution, multi-turn conversations, tool use, or domain-specific feedback.

Concept

An RL environment in MinT implements the RLAdapter interface:

class MyEnvironment(RLAdapter):
    def __call__(self, prompt: ModelInput, response: ModelInput) -> Reward:
        # Take the model's response and compute a reward
        # Return: Reward(score=float, metadata=dict)

The environment receives:

prompt — The input context (tokens the model was given).
response — The model's generated completion (tokens it produced).

The environment returns:

score — A scalar reward (higher = better).
metadata — Optional dict with debug info (e.g., error messages, intermediate scores).

Common environment patterns:

Verification — Check if the response is correct (e.g., math solution is right). Return 1.0 or 0.0.
Scoring — Use a metric function (BLEU, edit distance, semantic similarity). Return 0.0–1.0.
External system — Call an API or subprocess (code execution, fact-checking). Return score + error info.
Multi-step — Simulate a conversational environment; check if the model's response achieves a goal.

Pattern

import mint
from mint.rl_core import RLAdapter, Reward

class MathVerificationEnv(RLAdapter):
    """Verify math problem solutions via symbolic computation."""
    
    def __call__(self, prompt: mint.types.ModelInput, response: mint.types.ModelInput) -> Reward:
        tokenizer = self.get_tokenizer()
        
        # Decode tokens to text
        response_text = tokenizer.decode(response.to_ints())
        
        # Extract the numeric answer from the response
        import re
        match = re.search(r'\d+', response_text)
        if not match:
            return Reward(score=0.0, metadata={"error": "No number found"})
        
        predicted_answer = int(match.group())
        expected_answer = 42  # (in practice, extract from prompt or ground truth)
        
        # Reward: 1.0 if correct, 0.0 if incorrect
        is_correct = predicted_answer == expected_answer
        return Reward(
            score=1.0 if is_correct else 0.0,
            metadata={"predicted": predicted_answer, "expected": expected_answer},
        )

class ToolUseEnv(RLAdapter):
    """Reward model for learning to use tools."""
    
    def __call__(self, prompt: mint.types.ModelInput, response: mint.types.ModelInput) -> Reward:
        tokenizer = self.get_tokenizer()
        response_text = tokenizer.decode(response.to_ints())
        
        # Check if model called a tool (e.g., contains <tool_call>)
        used_tool = "<tool_call>" in response_text
        tool_score = 1.0 if used_tool else 0.0
        
        # Check if response is coherent
        coherence_score = 1.0 if len(response_text) > 20 else 0.0
        
        # Combine scores
        total_score = 0.7 * tool_score + 0.3 * coherence_score
        
        return Reward(
            score=total_score,
            metadata={"used_tool": used_tool, "coherence": coherence_score},
        )

# Usage during RL training
training_client = service_client.create_lora_training_client(base_model="Qwen/Qwen3-0.6B", rank=16)
env = MathVerificationEnv(training_client.get_tokenizer())

# Sample from the model and evaluate
sampler = service_client.create_sampling_client(base_model="Qwen/Qwen3-0.6B")
sampler.sample(prompt=..., sampling_params=...)  # Get response tokens

# Evaluate with environment
reward = env(prompt_tokens, response_tokens)
print(f"Reward: {reward.score}, Metadata: {reward.metadata}")

View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/demos/rl/adapters/environment_tooluse.py

API Surface

Class	Method	Purpose
`RLAdapter`	`__call__(prompt, response)`	Compute reward for a (prompt, response) pair
`Reward`	`score: float`	Scalar reward value
`Reward`	`metadata: dict`	Optional debug/logging info

Helper methods in RLAdapter:

get_tokenizer() — Retrieve the tokenizer for decoding tokens.
get_model_info() — Query model family and configuration.

Caveats & Pitfalls

Reward scale: Rewards should be normalized to a reasonable range (e.g., 0–1 or -1 to 1). Extremely large rewards can cause instability in gradient updates.
Sparse rewards: Binary rewards (correct/incorrect) provide weak training signals. If possible, provide dense feedback (e.g., partial credit, semantic similarity scores).
Execution safety: Be cautious with code execution environments. Sandboxing is essential to prevent malicious code from damaging your system.
Latency: If your environment calls external APIs (web search, code execution), consider async patterns and timeouts. Slow environments bottleneck training.
Determinism: If your environment is stochastic (random elements), consider seeding for reproducible training. Log sources of randomness in metadata.

Custom Environment

Concept

Pattern

API Surface

Caveats & Pitfalls

On this page