chat-dpo

Eval-first DPO experiment for chat-quality preference pairs. The benchmark is a held-out pairwise preference set, not a generation benchmark with an external grader.

At a glance


Algorithm	DPO (pairwise preference)
Base model	`Qwen/Qwen3-4B-Instruct-2507`
Training data	local `data/train/full.jsonl`
Benchmark	held-out pairwise preference eval on `data/eval/full.jsonl`
Primary metric	`METRIC eval_pair_accuracy`
Upstream README	Open in mint-cookbook →

For setup, runnable commands, and full eval protocol, see the upstream README. The experiment follows the shared cookbook lifecycle: uv sync → --dry-run → --eval-only → train.

chat-dpo

At a glance

On this page