DPO Overview
Direct Preference Optimization (DPO) trains a model to prefer a chosen response over a rejected response for the same prompt.
In MinT, this recipe uses the low-level TrainingClient.forward_backward_custom() API. There is no built-in loss_fn="dpo" in this recipe. The Bradley-Terry loss is a normal Python function that runs on the client side and receives model logprobs from MinT.
This page matches recipes/dpo_native.py.
Data Shape
The training data starts as (prompt, chosen, rejected) triples:
@dataclass(frozen=True)
class PreferencePair:
prompt: str
chosen: str
rejected: str
pairs = [
PreferencePair(
prompt="Explain why regular backups matter.",
chosen="Backups protect data by creating copies that can be restored...",
rejected="Backups are good.",
),
]The recipe flattens each pair into two Datum objects:
[chosen₀, rejected₀, chosen₁, rejected₁, ...]
even odd even oddThis order is required. The loss assumes even-indexed datums are chosen and odd-indexed datums are rejected.
Datum Construction
The prompt tokens have zero loss weight. The completion tokens have weight 1.0:
def build_datum(prompt_tokens, completion_text, tokenizer):
completion_tokens = tokenizer.encode(f" {completion_text}", add_special_tokens=False)
completion_tokens.append(tokenizer.eos_token_id)
all_tokens = prompt_tokens + completion_tokens
input_tokens = all_tokens[:-1]
target_tokens = all_tokens[1:]
weights = [0.0] * (len(prompt_tokens) - 1) + [1.0] * len(completion_tokens)
return types.Datum(
model_input=types.ModelInput.from_ints(tokens=input_tokens),
loss_fn_inputs={"target_tokens": target_tokens, "weights": weights},
)The prompt itself uses the model chat template when available:
def build_prompt_tokens(prompt, tokenizer):
messages = [{"role": "user", "content": prompt}]
return tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
)Custom Bradley-Terry Loss
forward_backward_custom() sends the datums through the model, then calls your Python loss function with (data, logprobs_list).
The core loss:
def sequence_logprob(logprobs, weights):
# Important: keep logprobs as a Tensor so gradients are preserved.
logprob_tensor = logprobs.flatten().float()
weight_tensor = _to_float_tensor(weights)
return torch.dot(logprob_tensor, weight_tensor)
def pairwise_preference_loss(data, logprobs_list):
chosen_scores = []
rejected_scores = []
for chosen_datum, rejected_datum, chosen_logprobs, rejected_logprobs in zip(
data[::2], data[1::2], logprobs_list[::2], logprobs_list[1::2]
):
chosen_scores.append(
sequence_logprob(chosen_logprobs, chosen_datum.loss_fn_inputs["weights"])
)
rejected_scores.append(
sequence_logprob(rejected_logprobs, rejected_datum.loss_fn_inputs["weights"])
)
margins = torch.stack(chosen_scores) - torch.stack(rejected_scores)
loss = -F.logsigmoid(margins).mean()
metrics = {
"loss": float(loss.detach().cpu()),
"pair_accuracy": float((margins > 0).float().mean().detach().cpu()),
"mean_margin": float(margins.mean().detach().cpu()),
}
return loss, metricsDo not convert logprobs to a Python list before computing the loss. That detaches the tensor from autograd and breaks forward_backward_custom().
Training Loop
service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3-0.6B",
rank=16,
train_mlp=True,
train_attn=True,
train_unembed=True,
)
tokenizer = training_client.get_tokenizer()
data = flatten_preference_pairs(PREFERENCE_PAIRS, tokenizer)
for step in range(1, DPO_STEPS + 1):
result = training_client.forward_backward_custom(
data,
pairwise_preference_loss,
).result()
metrics = result.metrics or {}
training_client.optim_step(types.AdamParams(learning_rate=1e-5)).result()
print(
f"Step {step}: loss={metrics['loss']:.6f}, "
f"pair_accuracy={metrics['pair_accuracy']:.2f}"
)View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/dpo_native.py
Verified Run
Verified on MinT with Qwen/Qwen3-0.6B, 4 preference pairs, 3 DPO steps:
Step 1: loss=34.563499, pair_accuracy=0.00, mean_margin=-34.563488
Step 2: loss=34.331955, pair_accuracy=0.00, mean_margin=-34.331944
Step 3: loss=33.277603, pair_accuracy=0.00, mean_margin=-33.277576Final checkpoint:
tinker://06770ead-184f-4638-824a-21138820dc4f_0/sampler_weights/dpo-native-finalThe tiny sample data is only for API verification. pair_accuracy=0.00 is valid because the base model initially scores the rejected completions higher for these examples. The important verification is that the custom loss is finite, gradients flow, optimizer steps complete, and metrics are returned.
Parameters Used by This Recipe
| Parameter | Default | Meaning |
|---|---|---|
MINT_BASE_MODEL | Qwen/Qwen3-0.6B | Base model to train. |
MINT_LORA_RANK | 16 | LoRA rank. |
MINT_DPO_STEPS | 3 | Number of custom-loss training steps. |
MINT_DPO_LR | 1e-5 | Adam learning rate. |