Chat RL

Chat RL 用一个偏好 reward function 给对话 response 打分。和 Math RL（有标准答案）不同，Chat RL 在多个维度上评估质量：长度、结构、多样性。参考实现用的是轻量级代理打分 —— 没有调用昂贵的 LLM-as-judge，只用启发式规则。

参考 adapter 是 demos/rl/adapters/preference_chat.py。

Configuration

用标准 GRPO 设置：

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=2e-5)

然后跑 chat RL adapter：

from demos.rl.adapters.preference_chat import PreferenceChatAdapter
from demos.rl.rl_core import RLConfig, run_grpo

cfg = RLConfig(
    model="Qwen/Qwen3-0.6B",
    rank=16,
    steps=10,
    batch=8,
    group=4,
    lr=2e-5,
    max_tokens=128,      # response 比 math 长
    temperature=0.8,
)

run_grpo(PreferenceChatAdapter(), cfg)

Prompting Guide

Chat prompt 通常是简短的用户问题，过 model 的 chat template 渲染：

class PreferenceChatAdapter(RLAdapter):
    PROMPTS = [
        "Explain what a variable is in programming.",
        "Write a short poem about the ocean.",
        "What are three benefits of exercise?",
        "Describe how to make a cup of tea.",
        "Why is the sky blue?",
    ]

    def make_prompt(self, sample: str, tokenizer) -> list[int]:
        messages = [{"role": "user", "content": sample}]
        if hasattr(tokenizer, "apply_chat_template"):
            return _coerce_chat_template_tokens(
                tokenizer.apply_chat_template(
                    messages, tokenize=True, add_generation_prompt=True
                )
            )
        return tokenizer.encode(f"User: {sample}\nAssistant:")

对每个 prompt，policy 生成 group_size 个不同的 response。Reward function 再分别打分。

Output Format

偏好 reward function 在多个维度上给 response 打分：

def compute_reward(self, response: str, sample: str) -> float:
    r = 0.0
    words = len(response.split())
    
    # 长度奖励：偏好 20–100 词
    if 20 <= words <= 100:
        r += 0.4
    elif 10 <= words < 20 or 100 < words <= 150:
        r += 0.2
    
    # 结构奖励：偏好 2+ 句话
    if response.count(".") >= 2:
        r += 0.3
    
    # 多样性奖励：偏好多样化的词汇
    unique_words = len(set(response.lower().split()))
    if words > 0 and unique_words > words * 0.5:
        r += 0.3
    
    return min(r, 1.0)  # 封顶 1.0

Reward 范围 0.0 到 1.0，由三块组成：

长度： 0–0.4（偏好更长、信息更密的 response）。
结构： 0–0.3（多句更受偏好）。
多样性： 0–0.3（词汇多样的 response 更受偏好）。

在一个 group 内，advantage 中心化：adv[i] = reward[i] - mean_reward。这会推 policy 偏向高质量 response，压低低质量的。

All Parameters

参数	类型	默认值	含义
`steps`	int	`10`	训练 step 数。
`batch`	int	`8`	每 step 的 prompt 数。
`group`	int	`4`	每个 prompt 的样本数。
`learning_rate`	float	`2e-5`	Adam 学习率。Chat RL 取 1e-5 到 4e-5。
`max_tokens`	int	`128`	最大生成长度。Chat：64–256。
`temperature`	float	`0.8`	采样温度。典型值 0.7–1.0。
`base_model`	str	`"Qwen/Qwen3-0.6B"`	base model。
`rank`	int	`16`	LoRA rank。
`train_mlp`	bool	`True`	训练 MLP。
`train_attn`	bool	`True`	训练 attention。
`train_unembed`	bool	`True`	训练输出层。

Reward 参数（compute_reward 内可调）：

length_threshold：(20, 100) —— 偏好的词数区间。按任务调。
min_sentences：2 —— 拿到结构奖励的最低句数。任务越长可调高。
unique_word_ratio：0.5 —— 最小（unique_words / total_words）比例。提高这个值会鼓励词汇更多样。

环境变量：

export MINT_RL_MAX_TOKENS=128
export MINT_RL_STEPS=10
export MINT_RL_BATCH=8
export MINT_RL_GROUP=4
export MINT_RL_LR=2e-5

扩展到 LLM-as-judge： 把 compute_reward 替换成调一次 evaluator model。模式：把 (prompt, response) 发给 judge model，拿回分数，归一化后返回。这种方式更贵，但能处理写作质量、真实性这类主观任务。

Chat RL

Configuration

Prompting Guide

Output Format

All Parameters

本页目录