Cookbook
chat-dpo
Eval-first DPO experiment for chat-quality preference pairs. The benchmark is a held-out pairwise preference set, not a generation benchmark with an external grader.
At a glance
| Algorithm | DPO (pairwise preference) |
| Base model | Qwen/Qwen3-4B-Instruct-2507 |
| Training data | local data/train/full.jsonl |
| Benchmark | held-out pairwise preference eval on data/eval/full.jsonl |
| Primary metric | METRIC eval_pair_accuracy |
| Upstream README | Open in mint-cookbook → |
For setup, runnable commands, and full eval protocol, see the upstream README. The experiment follows the shared cookbook lifecycle: uv sync → --dry-run → --eval-only → train.