Mind Lab Toolkit (MinT)
CustomizeRL

Chat RL

Chat RL uses a preference reward function to score conversational responses. Unlike Math RL (which has a ground-truth answer), Chat RL evaluates quality on multiple dimensions: length, structure, and diversity. The canonical implementation uses a lightweight proxy scorer — no expensive LLM-as-judge calls, just heuristics.

The canonical adapter is demos/rl/adapters/preference_chat.py.

Configuration

Use the standard GRPO setup:

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)

Then run the Chat RL adapter:

from demos.rl.adapters.preference_chat import PreferenceChatAdapter
from demos.rl.rl_core import RLConfig, run_grpo

cfg = RLConfig(
    model="Qwen/Qwen3-0.6B",
    rank=16,
    steps=10,
    batch=8,
    group=4,
    lr=2e-5,
    max_tokens=128,      # Longer responses than math
    temperature=0.8,
)

run_grpo(PreferenceChatAdapter(), cfg)

Prompting Guide

Chat prompts are typically short user questions rendered through the model's chat template:

class PreferenceChatAdapter(RLAdapter):
    PROMPTS = [
        "Explain what a variable is in programming.",
        "Write a short poem about the ocean.",
        "What are three benefits of exercise?",
        "Describe how to make a cup of tea.",
        "Why is the sky blue?",
    ]

    def make_prompt(self, sample: str, tokenizer) -> list[int]:
        messages = [{"role": "user", "content": sample}]
        if hasattr(tokenizer, "apply_chat_template"):
            return _coerce_chat_template_tokens(
                tokenizer.apply_chat_template(
                    messages, tokenize=True, add_generation_prompt=True
                )
            )
        return tokenizer.encode(f"User: {sample}\nAssistant:")

For each prompt, the policy generates group_size different responses. The reward function then evaluates each one.

Output Format

The preference reward function scores responses on multiple dimensions:

def compute_reward(self, response: str, sample: str) -> float:
    r = 0.0
    words = len(response.split())
    
    # Length reward: prefer 20–100 words
    if 20 <= words <= 100:
        r += 0.4
    elif 10 <= words < 20 or 100 < words <= 150:
        r += 0.2
    
    # Structure reward: prefer 2+ sentences
    if response.count(".") >= 2:
        r += 0.3
    
    # Diversity reward: prefer varied vocabulary
    unique_words = len(set(response.lower().split()))
    if words > 0 and unique_words > words * 0.5:
        r += 0.3
    
    return min(r, 1.0)  # Cap at 1.0

Rewards range from 0.0 to 1.0, combining:

  • Length: 0–0.4 (longer, more informative responses preferred).
  • Structure: 0–0.3 (multiple sentences preferred).
  • Diversity: 0–0.3 (varied vocabulary preferred).

Within a group, advantages are centered: adv[i] = reward[i] - mean_reward. This encourages high-quality responses while suppressing low-quality ones.

All Parameters

ParameterTypeDefaultMeaning
stepsint10Training steps.
batchint8Prompts per step.
groupint4Samples per prompt.
learning_ratefloat2e-5Adam LR. Chat RL: 1e-5 to 4e-5.
max_tokensint128Max generation length. Chat: 64–256.
temperaturefloat0.8Sampling temperature. Typical: 0.7–1.0.
base_modelstr"Qwen/Qwen3-0.6B"Base model.
rankint16LoRA rank.
train_mlpboolTrueTrain MLP.
train_attnboolTrueTrain attention.
train_unembedboolTrueTrain output layer.

Reward parameters (tunable in compute_reward):

  • length_threshold: (20, 100) — preferred word count range. Adjust for task.
  • min_sentences: 2 — minimum sentences to earn structure reward. Increase for longer-form tasks.
  • unique_word_ratio: 0.5 — minimum (unique_words / total_words) ratio. Increase to encourage vocabulary diversity.

Environment variables:

export MINT_RL_MAX_TOKENS=128
export MINT_RL_STEPS=10
export MINT_RL_BATCH=8
export MINT_RL_GROUP=4
export MINT_RL_LR=2e-5

Extending to LLM-as-judge: Replace compute_reward with a call to an evaluation model. Example pattern: send (prompt, response) to a judge model, get a score, and return the normalized score. This is more expensive but can handle subjective tasks like writing quality or truthfulness.

On this page