Chat RL
Chat RL uses a preference reward function to score conversational responses. Unlike Math RL (which has a ground-truth answer), Chat RL evaluates quality on multiple dimensions: length, structure, and diversity. The canonical implementation uses a lightweight proxy scorer — no expensive LLM-as-judge calls, just heuristics.
The canonical adapter is demos/rl/adapters/preference_chat.py.
Configuration
Use the standard GRPO setup:
import mint
from mint import types
service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3-0.6B",
rank=16,
train_mlp=True,
train_attn=True,
train_unembed=True,
)
tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)Then run the Chat RL adapter:
from demos.rl.adapters.preference_chat import PreferenceChatAdapter
from demos.rl.rl_core import RLConfig, run_grpo
cfg = RLConfig(
model="Qwen/Qwen3-0.6B",
rank=16,
steps=10,
batch=8,
group=4,
lr=2e-5,
max_tokens=128, # Longer responses than math
temperature=0.8,
)
run_grpo(PreferenceChatAdapter(), cfg)Prompting Guide
Chat prompts are typically short user questions rendered through the model's chat template:
class PreferenceChatAdapter(RLAdapter):
PROMPTS = [
"Explain what a variable is in programming.",
"Write a short poem about the ocean.",
"What are three benefits of exercise?",
"Describe how to make a cup of tea.",
"Why is the sky blue?",
]
def make_prompt(self, sample: str, tokenizer) -> list[int]:
messages = [{"role": "user", "content": sample}]
if hasattr(tokenizer, "apply_chat_template"):
return _coerce_chat_template_tokens(
tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True
)
)
return tokenizer.encode(f"User: {sample}\nAssistant:")For each prompt, the policy generates group_size different responses. The reward function then evaluates each one.
Output Format
The preference reward function scores responses on multiple dimensions:
def compute_reward(self, response: str, sample: str) -> float:
r = 0.0
words = len(response.split())
# Length reward: prefer 20–100 words
if 20 <= words <= 100:
r += 0.4
elif 10 <= words < 20 or 100 < words <= 150:
r += 0.2
# Structure reward: prefer 2+ sentences
if response.count(".") >= 2:
r += 0.3
# Diversity reward: prefer varied vocabulary
unique_words = len(set(response.lower().split()))
if words > 0 and unique_words > words * 0.5:
r += 0.3
return min(r, 1.0) # Cap at 1.0Rewards range from 0.0 to 1.0, combining:
- Length: 0–0.4 (longer, more informative responses preferred).
- Structure: 0–0.3 (multiple sentences preferred).
- Diversity: 0–0.3 (varied vocabulary preferred).
Within a group, advantages are centered: adv[i] = reward[i] - mean_reward. This encourages high-quality responses while suppressing low-quality ones.
All Parameters
| Parameter | Type | Default | Meaning |
|---|---|---|---|
steps | int | 10 | Training steps. |
batch | int | 8 | Prompts per step. |
group | int | 4 | Samples per prompt. |
learning_rate | float | 2e-5 | Adam LR. Chat RL: 1e-5 to 4e-5. |
max_tokens | int | 128 | Max generation length. Chat: 64–256. |
temperature | float | 0.8 | Sampling temperature. Typical: 0.7–1.0. |
base_model | str | "Qwen/Qwen3-0.6B" | Base model. |
rank | int | 16 | LoRA rank. |
train_mlp | bool | True | Train MLP. |
train_attn | bool | True | Train attention. |
train_unembed | bool | True | Train output layer. |
Reward parameters (tunable in compute_reward):
length_threshold: (20, 100) — preferred word count range. Adjust for task.min_sentences: 2 — minimum sentences to earn structure reward. Increase for longer-form tasks.unique_word_ratio: 0.5 — minimum (unique_words / total_words) ratio. Increase to encourage vocabulary diversity.
Environment variables:
export MINT_RL_MAX_TOKENS=128
export MINT_RL_STEPS=10
export MINT_RL_BATCH=8
export MINT_RL_GROUP=4
export MINT_RL_LR=2e-5Extending to LLM-as-judge: Replace compute_reward with a call to an evaluation model. Example pattern: send (prompt, response) to a judge model, get a score, and return the normalized score. This is more expensive but can handle subjective tasks like writing quality or truthfulness.