Mind Lab Toolkit (MinT)
CustomizeRL

Custom Environment

RL training requires an environment that takes the model's action (generated tokens) and returns a reward signal. MinT's RLAdapter base class lets you implement custom environments for any task: math verification, code execution, multi-turn conversations, tool use, or domain-specific feedback.

Concept

An RL environment in MinT implements the RLAdapter interface:

class MyEnvironment(RLAdapter):
    def __call__(self, prompt: ModelInput, response: ModelInput) -> Reward:
        # Take the model's response and compute a reward
        # Return: Reward(score=float, metadata=dict)

The environment receives:

  • prompt — The input context (tokens the model was given).
  • response — The model's generated completion (tokens it produced).

The environment returns:

  • score — A scalar reward (higher = better).
  • metadata — Optional dict with debug info (e.g., error messages, intermediate scores).

Common environment patterns:

  • Verification — Check if the response is correct (e.g., math solution is right). Return 1.0 or 0.0.
  • Scoring — Use a metric function (BLEU, edit distance, semantic similarity). Return 0.0–1.0.
  • External system — Call an API or subprocess (code execution, fact-checking). Return score + error info.
  • Multi-step — Simulate a conversational environment; check if the model's response achieves a goal.

Pattern

import mint
from mint.rl_core import RLAdapter, Reward

class MathVerificationEnv(RLAdapter):
    """Verify math problem solutions via symbolic computation."""
    
    def __call__(self, prompt: mint.types.ModelInput, response: mint.types.ModelInput) -> Reward:
        tokenizer = self.get_tokenizer()
        
        # Decode tokens to text
        response_text = tokenizer.decode(response.to_ints())
        
        # Extract the numeric answer from the response
        import re
        match = re.search(r'\d+', response_text)
        if not match:
            return Reward(score=0.0, metadata={"error": "No number found"})
        
        predicted_answer = int(match.group())
        expected_answer = 42  # (in practice, extract from prompt or ground truth)
        
        # Reward: 1.0 if correct, 0.0 if incorrect
        is_correct = predicted_answer == expected_answer
        return Reward(
            score=1.0 if is_correct else 0.0,
            metadata={"predicted": predicted_answer, "expected": expected_answer},
        )

class ToolUseEnv(RLAdapter):
    """Reward model for learning to use tools."""
    
    def __call__(self, prompt: mint.types.ModelInput, response: mint.types.ModelInput) -> Reward:
        tokenizer = self.get_tokenizer()
        response_text = tokenizer.decode(response.to_ints())
        
        # Check if model called a tool (e.g., contains <tool_call>)
        used_tool = "<tool_call>" in response_text
        tool_score = 1.0 if used_tool else 0.0
        
        # Check if response is coherent
        coherence_score = 1.0 if len(response_text) > 20 else 0.0
        
        # Combine scores
        total_score = 0.7 * tool_score + 0.3 * coherence_score
        
        return Reward(
            score=total_score,
            metadata={"used_tool": used_tool, "coherence": coherence_score},
        )

# Usage during RL training
training_client = service_client.create_lora_training_client(base_model="Qwen/Qwen3-0.6B", rank=16)
env = MathVerificationEnv(training_client.get_tokenizer())

# Sample from the model and evaluate
sampler = service_client.create_sampling_client(base_model="Qwen/Qwen3-0.6B")
sampler.sample(prompt=..., sampling_params=...)  # Get response tokens

# Evaluate with environment
reward = env(prompt_tokens, response_tokens)
print(f"Reward: {reward.score}, Metadata: {reward.metadata}")

View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/demos/rl/adapters/environment_tooluse.py

API Surface

ClassMethodPurpose
RLAdapter__call__(prompt, response)Compute reward for a (prompt, response) pair
Rewardscore: floatScalar reward value
Rewardmetadata: dictOptional debug/logging info

Helper methods in RLAdapter:

  • get_tokenizer() — Retrieve the tokenizer for decoding tokens.
  • get_model_info() — Query model family and configuration.

Caveats & Pitfalls

  • Reward scale: Rewards should be normalized to a reasonable range (e.g., 0–1 or -1 to 1). Extremely large rewards can cause instability in gradient updates.
  • Sparse rewards: Binary rewards (correct/incorrect) provide weak training signals. If possible, provide dense feedback (e.g., partial credit, semantic similarity scores).
  • Execution safety: Be cautious with code execution environments. Sandboxing is essential to prevent malicious code from damaging your system.
  • Latency: If your environment calls external APIs (web search, code execution), consider async patterns and timeouts. Slow environments bottleneck training.
  • Determinism: If your environment is stochastic (random elements), consider seeding for reproducible training. Log sources of randomness in metadata.

On this page