Mind Lab Toolkit (MinT)
CustomizeDPO

DPO Overview

Direct Preference Optimization (DPO) trains a model to prefer a chosen response over a rejected response for the same prompt.

In MinT, this recipe uses the low-level TrainingClient.forward_backward_custom() API. There is no built-in loss_fn="dpo" in this recipe. The Bradley-Terry loss is a normal Python function that runs on the client side and receives model logprobs from MinT.

This page matches recipes/dpo_native.py.

Data Shape

The training data starts as (prompt, chosen, rejected) triples:

@dataclass(frozen=True)
class PreferencePair:
    prompt: str
    chosen: str
    rejected: str

pairs = [
    PreferencePair(
        prompt="Explain why regular backups matter.",
        chosen="Backups protect data by creating copies that can be restored...",
        rejected="Backups are good.",
    ),
]

The recipe flattens each pair into two Datum objects:

[chosen₀, rejected₀, chosen₁, rejected₁, ...]
   even      odd       even      odd

This order is required. The loss assumes even-indexed datums are chosen and odd-indexed datums are rejected.

Datum Construction

The prompt tokens have zero loss weight. The completion tokens have weight 1.0:

def build_datum(prompt_tokens, completion_text, tokenizer):
    completion_tokens = tokenizer.encode(f" {completion_text}", add_special_tokens=False)
    completion_tokens.append(tokenizer.eos_token_id)

    all_tokens = prompt_tokens + completion_tokens
    input_tokens = all_tokens[:-1]
    target_tokens = all_tokens[1:]
    weights = [0.0] * (len(prompt_tokens) - 1) + [1.0] * len(completion_tokens)

    return types.Datum(
        model_input=types.ModelInput.from_ints(tokens=input_tokens),
        loss_fn_inputs={"target_tokens": target_tokens, "weights": weights},
    )

The prompt itself uses the model chat template when available:

def build_prompt_tokens(prompt, tokenizer):
    messages = [{"role": "user", "content": prompt}]
    return tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
    )

Custom Bradley-Terry Loss

forward_backward_custom() sends the datums through the model, then calls your Python loss function with (data, logprobs_list).

The core loss:

def sequence_logprob(logprobs, weights):
    # Important: keep logprobs as a Tensor so gradients are preserved.
    logprob_tensor = logprobs.flatten().float()
    weight_tensor = _to_float_tensor(weights)
    return torch.dot(logprob_tensor, weight_tensor)


def pairwise_preference_loss(data, logprobs_list):
    chosen_scores = []
    rejected_scores = []

    for chosen_datum, rejected_datum, chosen_logprobs, rejected_logprobs in zip(
        data[::2], data[1::2], logprobs_list[::2], logprobs_list[1::2]
    ):
        chosen_scores.append(
            sequence_logprob(chosen_logprobs, chosen_datum.loss_fn_inputs["weights"])
        )
        rejected_scores.append(
            sequence_logprob(rejected_logprobs, rejected_datum.loss_fn_inputs["weights"])
        )

    margins = torch.stack(chosen_scores) - torch.stack(rejected_scores)
    loss = -F.logsigmoid(margins).mean()
    metrics = {
        "loss": float(loss.detach().cpu()),
        "pair_accuracy": float((margins > 0).float().mean().detach().cpu()),
        "mean_margin": float(margins.mean().detach().cpu()),
    }
    return loss, metrics

Do not convert logprobs to a Python list before computing the loss. That detaches the tensor from autograd and breaks forward_backward_custom().

Training Loop

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
data = flatten_preference_pairs(PREFERENCE_PAIRS, tokenizer)

for step in range(1, DPO_STEPS + 1):
    result = training_client.forward_backward_custom(
        data,
        pairwise_preference_loss,
    ).result()
    metrics = result.metrics or {}
    training_client.optim_step(types.AdamParams(learning_rate=1e-5)).result()
    print(
        f"Step {step}: loss={metrics['loss']:.6f}, "
        f"pair_accuracy={metrics['pair_accuracy']:.2f}"
    )

View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/recipes/dpo_native.py

Verified Run

Verified on MinT with Qwen/Qwen3-0.6B, 4 preference pairs, 3 DPO steps:

Step 1: loss=34.563499, pair_accuracy=0.00, mean_margin=-34.563488
Step 2: loss=34.331955, pair_accuracy=0.00, mean_margin=-34.331944
Step 3: loss=33.277603, pair_accuracy=0.00, mean_margin=-33.277576

Final checkpoint:

tinker://06770ead-184f-4638-824a-21138820dc4f_0/sampler_weights/dpo-native-final

The tiny sample data is only for API verification. pair_accuracy=0.00 is valid because the base model initially scores the rejected completions higher for these examples. The important verification is that the custom loss is finite, gradients flow, optimizer steps complete, and metrics are returned.

Parameters Used by This Recipe

ParameterDefaultMeaning
MINT_BASE_MODELQwen/Qwen3-0.6BBase model to train.
MINT_LORA_RANK16LoRA rank.
MINT_DPO_STEPS3Number of custom-loss training steps.
MINT_DPO_LR1e-5Adam learning rate.

What's next?

On this page