Code RL

Code RL uses execution-based rewards to score generated code. The model generates code, which is then executed in a sandbox against test cases. Reward = fraction of tests passed (or binary: all pass → 1.0, any fail → 0.0). This reward is "delayed" — the policy doesn't see reward until after execution finishes.

The canonical adapter is demos/rl/adapters/environment_tooluse.py, which solves simple function-writing problems.

Configuration

Set up Code RL with the standard GRPO loop, but with larger max_tokens to accommodate code:

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)

Then run the Code RL adapter:

from demos.rl.adapters.environment_tooluse import EnvironmentToolUseAdapter
from demos.rl.rl_core import RLConfig, run_grpo

cfg = RLConfig(
    model="Qwen/Qwen3-0.6B",
    rank=16,
    steps=10,
    batch=8,
    group=4,
    lr=2e-5,
    max_tokens=256,      # Much longer for code
    temperature=0.8,
)

run_grpo(EnvironmentToolUseAdapter(), cfg)

Prompting Guide

Code problems include a few-shot example and then the problem statement:

class EnvironmentToolUseAdapter(RLAdapter):
    FEWSHOT = """Q: Write a function `double(x)` that returns x * 2.
A: ```python
def double(x):
    return x * 2
```

"""

    def make_prompt(self, sample: dict, tokenizer) -> list[int]:
        return tokenizer.encode(FEWSHOT + f"Q: {sample['q']}\nA:")

For each problem, the model generates a response containing code. The adapter extracts the code block (regex match for ```python ... ```), executes it, and runs the test cases.

Example problem:

{
    "q": "Write `add(a, b)` that returns a + b.",
    "tests": [("add(1,2)", 3), ("add(-1,1)", 0)],
}

Model generates:

def add(a, b):
    return a + b

Output Format

Code extraction and execution:

def _extract_code(response: str) -> str | None:
    match = re.findall(r"```(?:\w+)?\n(.*?)```", response, re.DOTALL)
    if match:
        return match[-1].strip()
    if "def " in response:
        return response[response.find("def "):].strip()
    return None


def compute_reward(self, response: str, sample: dict) -> float:
    code = _extract_code(response)
    if not code:
        return 0.0
    try:
        ns: dict[str, Any] = {}
        exec(code, ns)  # Execute in isolated namespace
        for expr, expected in sample["tests"]:
            if eval(expr, ns) != expected:  # Run each test
                return 0.0
        return 1.0  # All tests passed
    except Exception:
        return 0.0  # Execution or syntax error

Reward is binary:

1.0: Code extracts, executes, and passes all test cases.
0.0: Code does not extract, fails to execute, or fails any test.

Within a group, advantages are centered: adv[i] = reward[i] - mean_reward.

Sandbox note: The demo uses exec() in a dictionary namespace — not production-safe. For real deployments, use a proper sandbox (Docker, gVisor, Firecracker) to isolate untrusted code. Never run this on untrusted inputs.

All Parameters

Parameter	Type	Default	Meaning
`steps`	int	`10`	Training steps.
`batch`	int	`8`	Problems per step.
`group`	int	`4`	Samples per problem.
`learning_rate`	float	`2e-5`	Adam LR. Code RL: 1e-5 to 4e-5.
`max_tokens`	int	`256`	Max generation length. Code: 128–512 depending on problem complexity.
`temperature`	float	`0.8`	Sampling temperature. Typical: 0.7–1.0.
`base_model`	str	`"Qwen/Qwen3-0.6B"`	Base model.
`rank`	int	`16`	LoRA rank.
`train_mlp`	bool	`True`	Train MLP.
`train_attn`	bool	`True`	Train attention.
`train_unembed`	bool	`True`	Train output layer.

Sandbox parameters (environment-specific):

sandbox_timeout_s: float — timeout per code execution. Default: 5.0 seconds. Increase for slow tests.
max_tests_per_problem: int — max test cases to run. Default: 10. Cap to avoid runaway loops.
partial_credit: bool — whether to award fractional reward based on tests passed (e.g., 2/4 tests → 0.5). Default: False (binary reward).

Environment variables:

export MINT_RL_MAX_TOKENS=256
export MINT_RL_STEPS=10
export MINT_RL_BATCH=8
export MINT_RL_GROUP=4
export MINT_RL_LR=2e-5

Extending to other languages: The demo uses Python's exec(). For other languages (JavaScript, Rust, Go), you must handle code extraction and compilation separately. Most teams swap the reward function but keep the GRPO loop unchanged.

Code RL

Configuration

Prompting Guide

Output Format

All Parameters

On this page