Chat RL
This page documents demos/rl/adapters/preference_chat.py in mint-quickstart.
What this demo does
- Runs RL-2 Preference Chat via the shared
demos/rl/rl_core.pyGRPO loop and a task-specific adapter. - Trains a LoRA over generic prompts using a proxy preference reward (length + structure + diversity).
- Samples
MINT_RL_GROUPcompletions per prompt, applies group-relative advantages, and updates viaimportance_sampling.
Expected output
- Prints
Model: ...and per-stepStep N: avg_reward=...+datums=...; finishes withSaved: ....
Common gotchas
- The reward is intentionally heuristic and gameable; for production alignment, replace
compute_reward()with real preference judges or rule sets. - If rewards are flat (or
datums=0), tuneMINT_RL_TEMPERATURE/MINT_RL_GROUP/MINT_RL_BATCHand check prompt diversity. - Prompt formatting uses
tokenizer.apply_chat_templatewhen available, otherwise falls back to"User: ...\nAssistant:"; if your model is chat-template sensitive, adjustmake_prompt().
Prerequisites
- Python >= 3.11
MINT_API_KEYset (or configured via.env)
How to run
export MINT_API_KEY=sk-mint-...
python demos/rl/adapters/preference_chat.pyParameters (env vars)
MINT_BASE_MODEL: defaultQwen/Qwen3-0.6BMINT_LORA_RANK: default16MINT_RL_STEPS: default10MINT_RL_BATCH: default8MINT_RL_GROUP: default4MINT_RL_LR: default1e-4MINT_RL_MAX_TOKENS: default128MINT_RL_TEMPERATURE: default1.0
Full script
#!/usr/bin/env python3
"""RL-2 Preference Chat — adapter for rl_core.
Reward: proxy helpfulness score (length + structure + diversity).
Run: python demos/rl/adapters/preference_chat.py
"""
from __future__ import annotations
import os
import random
import sys
from pathlib import Path
from typing import Any
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from rl_core import RLAdapter, RLConfig, run_grpo # noqa: E402
PROMPTS = [
"Explain what a variable is in programming.",
"Write a short poem about the ocean.",
"What are three benefits of exercise?",
"Describe how to make a cup of tea.",
"Why is the sky blue?",
"Give tips for better sleep.",
"What is machine learning?",
"How do plants make food?",
]
class PreferenceChatAdapter(RLAdapter):
def build_dataset(self) -> list[str]:
return [random.choice(PROMPTS) for _ in range(50)]
def make_prompt(self, sample: str, tokenizer: Any) -> list[int]:
messages = [{"role": "user", "content": sample}]
if hasattr(tokenizer, "apply_chat_template"):
return list(tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True
))
return tokenizer.encode(f"User: {sample}\nAssistant:")
def compute_reward(self, response: str, sample: str) -> float:
r = 0.0
words = len(response.split())
if 20 <= words <= 100:
r += 0.4
elif 10 <= words < 20 or 100 < words <= 150:
r += 0.2
if response.count(".") >= 2:
r += 0.3
unique_words = len(set(response.lower().split()))
if words > 0 and unique_words > words * 0.5:
r += 0.3
return min(r, 1.0)
def evaluate(self, step: int, rewards: list[float], num_datums: int) -> None:
avg = sum(rewards) / len(rewards) if rewards else 0.0
print(f"Step {step}: avg_reward={avg:.3f}, datums={num_datums}")
if __name__ == "__main__":
cfg = RLConfig.from_env()
cfg.steps = int(os.environ.get("MINT_RL_STEPS", "10"))
cfg.batch = int(os.environ.get("MINT_RL_BATCH", "8"))
cfg.max_tokens = int(os.environ.get("MINT_RL_MAX_TOKENS", "128"))
run_grpo(PreferenceChatAdapter(), cfg)Next steps
- The final line prints
Saved: <path>. You can load it in a new process:
import mint
service_client = mint.ServiceClient()
sampling_client = service_client.create_sampling_client(model_path="<paste Saved path>")- For sampling + tokenization details, see
/using-the-api/saving-and-loadingand/api-reference/sampling-client.