Mind Lab Toolkit (MinT)
CustomizeRL

Code RL

Code RL uses execution-based rewards to score generated code. The model generates code, which is then executed in a sandbox against test cases. Reward = fraction of tests passed (or binary: all pass → 1.0, any fail → 0.0). This reward is "delayed" — the policy doesn't see reward until after execution finishes.

The canonical adapter is demos/rl/adapters/environment_tooluse.py, which solves simple function-writing problems.

Configuration

Set up Code RL with the standard GRPO loop, but with larger max_tokens to accommodate code:

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)

Then run the Code RL adapter:

from demos.rl.adapters.environment_tooluse import EnvironmentToolUseAdapter
from demos.rl.rl_core import RLConfig, run_grpo

cfg = RLConfig(
    model="Qwen/Qwen3-0.6B",
    rank=16,
    steps=10,
    batch=8,
    group=4,
    lr=2e-5,
    max_tokens=256,      # Much longer for code
    temperature=0.8,
)

run_grpo(EnvironmentToolUseAdapter(), cfg)

Prompting Guide

Code problems include a few-shot example and then the problem statement:

class EnvironmentToolUseAdapter(RLAdapter):
    FEWSHOT = """Q: Write a function `double(x)` that returns x * 2.
A: ```python
def double(x):
    return x * 2
```

"""

    def make_prompt(self, sample: dict, tokenizer) -> list[int]:
        return tokenizer.encode(FEWSHOT + f"Q: {sample['q']}\nA:")

For each problem, the model generates a response containing code. The adapter extracts the code block (regex match for ```python ... ```), executes it, and runs the test cases.

Example problem:

{
    "q": "Write `add(a, b)` that returns a + b.",
    "tests": [("add(1,2)", 3), ("add(-1,1)", 0)],
}

Model generates:

def add(a, b):
    return a + b

Output Format

Code extraction and execution:

def _extract_code(response: str) -> str | None:
    match = re.findall(r"```(?:\w+)?\n(.*?)```", response, re.DOTALL)
    if match:
        return match[-1].strip()
    if "def " in response:
        return response[response.find("def "):].strip()
    return None


def compute_reward(self, response: str, sample: dict) -> float:
    code = _extract_code(response)
    if not code:
        return 0.0
    try:
        ns: dict[str, Any] = {}
        exec(code, ns)  # Execute in isolated namespace
        for expr, expected in sample["tests"]:
            if eval(expr, ns) != expected:  # Run each test
                return 0.0
        return 1.0  # All tests passed
    except Exception:
        return 0.0  # Execution or syntax error

Reward is binary:

  • 1.0: Code extracts, executes, and passes all test cases.
  • 0.0: Code does not extract, fails to execute, or fails any test.

Within a group, advantages are centered: adv[i] = reward[i] - mean_reward.

Sandbox note: The demo uses exec() in a dictionary namespace — not production-safe. For real deployments, use a proper sandbox (Docker, gVisor, Firecracker) to isolate untrusted code. Never run this on untrusted inputs.

All Parameters

ParameterTypeDefaultMeaning
stepsint10Training steps.
batchint8Problems per step.
groupint4Samples per problem.
learning_ratefloat2e-5Adam LR. Code RL: 1e-5 to 4e-5.
max_tokensint256Max generation length. Code: 128–512 depending on problem complexity.
temperaturefloat0.8Sampling temperature. Typical: 0.7–1.0.
base_modelstr"Qwen/Qwen3-0.6B"Base model.
rankint16LoRA rank.
train_mlpboolTrueTrain MLP.
train_attnboolTrueTrain attention.
train_unembedboolTrueTrain output layer.

Sandbox parameters (environment-specific):

  • sandbox_timeout_s: float — timeout per code execution. Default: 5.0 seconds. Increase for slow tests.
  • max_tests_per_problem: int — max test cases to run. Default: 10. Cap to avoid runaway loops.
  • partial_credit: bool — whether to award fractional reward based on tests passed (e.g., 2/4 tests → 0.5). Default: False (binary reward).

Environment variables:

export MINT_RL_MAX_TOKENS=256
export MINT_RL_STEPS=10
export MINT_RL_BATCH=8
export MINT_RL_GROUP=4
export MINT_RL_LR=2e-5

Extending to other languages: The demo uses Python's exec(). For other languages (JavaScript, Rust, Go), you must handle code extraction and compilation separately. Most teams swap the reward function but keep the GRPO loop unchanged.

On this page