Chat RL

Chat RL uses a preference reward function to score conversational responses. Unlike Math RL (which has a ground-truth answer), Chat RL evaluates quality on multiple dimensions: length, structure, and diversity. The canonical implementation uses a lightweight proxy scorer — no expensive LLM-as-judge calls, just heuristics.

The canonical adapter is demos/rl/adapters/preference_chat.py.

Configuration

Use the standard GRPO setup:

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)

Then run the Chat RL adapter:

from demos.rl.adapters.preference_chat import PreferenceChatAdapter
from demos.rl.rl_core import RLConfig, run_grpo

cfg = RLConfig(
    model="Qwen/Qwen3-0.6B",
    rank=16,
    steps=10,
    batch=8,
    group=4,
    lr=2e-5,
    max_tokens=128,      # Longer responses than math
    temperature=0.8,
)

run_grpo(PreferenceChatAdapter(), cfg)

Prompting Guide

Chat prompts are typically short user questions rendered through the model's chat template:

class PreferenceChatAdapter(RLAdapter):
    PROMPTS = [
        "Explain what a variable is in programming.",
        "Write a short poem about the ocean.",
        "What are three benefits of exercise?",
        "Describe how to make a cup of tea.",
        "Why is the sky blue?",
    ]

    def make_prompt(self, sample: str, tokenizer) -> list[int]:
        messages = [{"role": "user", "content": sample}]
        if hasattr(tokenizer, "apply_chat_template"):
            return _coerce_chat_template_tokens(
                tokenizer.apply_chat_template(
                    messages, tokenize=True, add_generation_prompt=True
                )
            )
        return tokenizer.encode(f"User: {sample}\nAssistant:")

For each prompt, the policy generates group_size different responses. The reward function then evaluates each one.

Output Format

The preference reward function scores responses on multiple dimensions:

def compute_reward(self, response: str, sample: str) -> float:
    r = 0.0
    words = len(response.split())
    
    # Length reward: prefer 20–100 words
    if 20 <= words <= 100:
        r += 0.4
    elif 10 <= words < 20 or 100 < words <= 150:
        r += 0.2
    
    # Structure reward: prefer 2+ sentences
    if response.count(".") >= 2:
        r += 0.3
    
    # Diversity reward: prefer varied vocabulary
    unique_words = len(set(response.lower().split()))
    if words > 0 and unique_words > words * 0.5:
        r += 0.3
    
    return min(r, 1.0)  # Cap at 1.0

Rewards range from 0.0 to 1.0, combining:

Length: 0–0.4 (longer, more informative responses preferred).
Structure: 0–0.3 (multiple sentences preferred).
Diversity: 0–0.3 (varied vocabulary preferred).

Within a group, advantages are centered: adv[i] = reward[i] - mean_reward. This encourages high-quality responses while suppressing low-quality ones.

All Parameters

Parameter	Type	Default	Meaning
`steps`	int	`10`	Training steps.
`batch`	int	`8`	Prompts per step.
`group`	int	`4`	Samples per prompt.
`learning_rate`	float	`2e-5`	Adam LR. Chat RL: 1e-5 to 4e-5.
`max_tokens`	int	`128`	Max generation length. Chat: 64–256.
`temperature`	float	`0.8`	Sampling temperature. Typical: 0.7–1.0.
`base_model`	str	`"Qwen/Qwen3-0.6B"`	Base model.
`rank`	int	`16`	LoRA rank.
`train_mlp`	bool	`True`	Train MLP.
`train_attn`	bool	`True`	Train attention.
`train_unembed`	bool	`True`	Train output layer.

Reward parameters (tunable in compute_reward):

length_threshold: (20, 100) — preferred word count range. Adjust for task.
min_sentences: 2 — minimum sentences to earn structure reward. Increase for longer-form tasks.
unique_word_ratio: 0.5 — minimum (unique_words / total_words) ratio. Increase to encourage vocabulary diversity.

Environment variables:

export MINT_RL_MAX_TOKENS=128
export MINT_RL_STEPS=10
export MINT_RL_BATCH=8
export MINT_RL_GROUP=4
export MINT_RL_LR=2e-5

Extending to LLM-as-judge: Replace compute_reward with a call to an evaluation model. Example pattern: send (prompt, response) to a judge model, get a score, and return the normalized score. This is more expensive but can handle subjective tasks like writing quality or truthfulness.

Chat RL

Configuration

Prompting Guide

Output Format

All Parameters

On this page