Code RL
Code RL uses execution-based rewards to score generated code. The model generates code, which is then executed in a sandbox against test cases. Reward = fraction of tests passed (or binary: all pass → 1.0, any fail → 0.0). This reward is "delayed" — the policy doesn't see reward until after execution finishes.
The canonical adapter is demos/rl/adapters/environment_tooluse.py, which solves simple function-writing problems.
Configuration
Set up Code RL with the standard GRPO loop, but with larger max_tokens to accommodate code:
import mint
from mint import types
service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3-0.6B",
rank=16,
train_mlp=True,
train_attn=True,
train_unembed=True,
)
tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)Then run the Code RL adapter:
from demos.rl.adapters.environment_tooluse import EnvironmentToolUseAdapter
from demos.rl.rl_core import RLConfig, run_grpo
cfg = RLConfig(
model="Qwen/Qwen3-0.6B",
rank=16,
steps=10,
batch=8,
group=4,
lr=2e-5,
max_tokens=256, # Much longer for code
temperature=0.8,
)
run_grpo(EnvironmentToolUseAdapter(), cfg)Prompting Guide
Code problems include a few-shot example and then the problem statement:
class EnvironmentToolUseAdapter(RLAdapter):
FEWSHOT = """Q: Write a function `double(x)` that returns x * 2.
A: ```python
def double(x):
return x * 2
```
"""
def make_prompt(self, sample: dict, tokenizer) -> list[int]:
return tokenizer.encode(FEWSHOT + f"Q: {sample['q']}\nA:")For each problem, the model generates a response containing code. The adapter extracts the code block (regex match for ```python ... ```), executes it, and runs the test cases.
Example problem:
{
"q": "Write `add(a, b)` that returns a + b.",
"tests": [("add(1,2)", 3), ("add(-1,1)", 0)],
}Model generates:
def add(a, b):
return a + bOutput Format
Code extraction and execution:
def _extract_code(response: str) -> str | None:
match = re.findall(r"```(?:\w+)?\n(.*?)```", response, re.DOTALL)
if match:
return match[-1].strip()
if "def " in response:
return response[response.find("def "):].strip()
return None
def compute_reward(self, response: str, sample: dict) -> float:
code = _extract_code(response)
if not code:
return 0.0
try:
ns: dict[str, Any] = {}
exec(code, ns) # Execute in isolated namespace
for expr, expected in sample["tests"]:
if eval(expr, ns) != expected: # Run each test
return 0.0
return 1.0 # All tests passed
except Exception:
return 0.0 # Execution or syntax errorReward is binary:
- 1.0: Code extracts, executes, and passes all test cases.
- 0.0: Code does not extract, fails to execute, or fails any test.
Within a group, advantages are centered: adv[i] = reward[i] - mean_reward.
Sandbox note: The demo uses exec() in a dictionary namespace — not production-safe. For real deployments, use a proper sandbox (Docker, gVisor, Firecracker) to isolate untrusted code. Never run this on untrusted inputs.
All Parameters
| Parameter | Type | Default | Meaning |
|---|---|---|---|
steps | int | 10 | Training steps. |
batch | int | 8 | Problems per step. |
group | int | 4 | Samples per problem. |
learning_rate | float | 2e-5 | Adam LR. Code RL: 1e-5 to 4e-5. |
max_tokens | int | 256 | Max generation length. Code: 128–512 depending on problem complexity. |
temperature | float | 0.8 | Sampling temperature. Typical: 0.7–1.0. |
base_model | str | "Qwen/Qwen3-0.6B" | Base model. |
rank | int | 16 | LoRA rank. |
train_mlp | bool | True | Train MLP. |
train_attn | bool | True | Train attention. |
train_unembed | bool | True | Train output layer. |
Sandbox parameters (environment-specific):
sandbox_timeout_s: float — timeout per code execution. Default: 5.0 seconds. Increase for slow tests.max_tests_per_problem: int — max test cases to run. Default: 10. Cap to avoid runaway loops.partial_credit: bool — whether to award fractional reward based on tests passed (e.g., 2/4 tests → 0.5). Default: False (binary reward).
Environment variables:
export MINT_RL_MAX_TOKENS=256
export MINT_RL_STEPS=10
export MINT_RL_BATCH=8
export MINT_RL_GROUP=4
export MINT_RL_LR=2e-5Extending to other languages: The demo uses Python's exec(). For other languages (JavaScript, Rust, Go), you must handle code extraction and compilation separately. Most teams swap the reward function but keep the GRPO loop unchanged.