SFT Overview

Supervised fine-tuning (SFT) teaches a language model to match target responses to given prompts. You provide labeled (prompt, response) pairs, and MinT updates the model weights to minimize the prediction loss on the response tokens only — the prompt tokens are masked with zero loss weight.

Configuration

SFT requires a ServiceClient, a LoRA training client, and an optimizer configuration. Here is a minimal setup:

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=5e-5)

Environment variables:

MINT_API_KEY: Your MinT API key (required). Request one from macaron.im/mindlab.
MINT_BASE_URL: MinT server endpoint. Default: https://mint.macaron.xin/ (mainland China: mint-cn.macaron.xin).
MINT_BASE_MODEL: Base model name. Default: Qwen/Qwen3-0.6B.
MINT_LORA_RANK: LoRA rank. Default: 16.

All training happens on the remote MinT server. Your Python script calls forward_backward(...) and blocks on .result() until the batch finishes.

Prompting Guide

SFT trains the model on the response portion only. You build Datum objects from (prompt, response) pairs by:

Encoding both the prompt and response separately.
Concatenating the token IDs.
Zeroing loss weights on prompt tokens and setting them to 1.0 on response tokens.

Here is the canonical pattern from quickstart.py:

def process_sft_example(ex: dict, tokenizer) -> types.Datum:
    # ex = {"question": "What is 3 * 4?"}
    a, b = map(int, re.findall(r"\d+", ex["question"]))
    answer = str(a * b)
    prompt = f"Question: {ex['question']}\nAnswer:"
    completion = f" {answer}"

    prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True)
    completion_tokens = tokenizer.encode(completion, add_special_tokens=False)
    completion_tokens.append(tokenizer.eos_token_id)

    all_tokens = prompt_tokens + completion_tokens
    all_weights = [0] * len(prompt_tokens) + [1] * len(completion_tokens)

    # Shift for teacher-forcing: input = all_tokens[:-1], target = all_tokens[1:]
    input_tokens = all_tokens[:-1]
    target_tokens = all_tokens[1:]
    weights = all_weights[1:]

    return types.Datum(
        model_input=types.ModelInput.from_ints(tokens=input_tokens),
        loss_fn_inputs={"target_tokens": target_tokens, "weights": weights},
    )

Key points:

The prompt is always masked (loss_weight=0.0) so gradients do not flow through it.
The response is always active (loss_weight=1.0) so the model learns to predict the target tokens.
Use your tokenizer's apply_chat_template(...) method for chat-style instruction tuning (see Rendering).

Output Format

SFT does not directly produce text — it updates the model weights. After each forward_backward step, you get back a ForwardBackwardResult containing:

loss: The weighted cross-entropy loss (scalar).
loss_fn_outputs: Per-datum metadata (logprobs, etc.).

To generate text from your trained model, you must first save the LoRA weights and create a sampling client:

# After SFT training completes:
checkpoint = training_client.save_weights_and_get_sampling_client(name="my-sft-v1")

prompt_ids = tokenizer.encode("3 * 7 =")
samples = checkpoint.sample(
    prompt=types.ModelInput.from_ints(prompt_ids),
    sampling_params=types.SamplingParams(max_tokens=16, temperature=0.7),
    num_samples=4,
)

for seq in samples.sequences:
    print(tokenizer.decode(seq.tokens))

For systematic evaluation — hold-out test sets, benchmark metrics, sampling logs — see Concepts → Evaluations.

All Parameters

Parameter	Type	Default	Meaning
`base_model`	str	`"Qwen/Qwen3-0.6B"`	Hugging Face model ID for the base model.
`rank`	int	`16`	LoRA rank. Higher = more expressive, higher memory. Typical: 8–64.
`train_mlp`	bool	`True`	Update the MLP (feed-forward) layers.
`train_attn`	bool	`True`	Update the attention layers.
`train_unembed`	bool	`True`	Update the unembedding (output) layer.
`loss_fn`	str	`"cross_entropy"`	Loss function. SFT always uses `"cross_entropy"`; no alternatives.
`learning_rate`	float	`5e-5`	Adam learning rate. Typical: 1e-5 to 1e-4 for instruction tuning.
`betas`	tuple[float, float]	`(0.9, 0.999)`	Adam exponential decay rates for 1st and 2nd moments.
`eps`	float	`1e-8`	Adam numerical stability term.
`weight_decay`	float	`0.0`	L2 regularization coefficient. Typical: 0.0–0.01 for LoRA.

Usage:

adam_params = types.AdamParams(
    learning_rate=5e-5,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.0,
)

for step, batch in enumerate(batches):
    result = training_client.forward_backward(batch, loss_fn="cross_entropy").result()
    training_client.optim_step(adam_params).result()
    print(f"Step {step}: loss={result.loss:.4f}")

Tinker compatibility notes:

Do not call zero_grad_async() — gradient zeroing is automatic on the MinT server.
The loss_fn parameter is passed to forward_backward(...), not stored in AdamParams.
save_weights_for_sampler(...) and save_weights_and_get_sampling_client(...) have the same semantics: both serialize the LoRA weights.

SFT Overview

Configuration

Prompting Guide

Output Format

All Parameters

On this page