SFT Overview
Supervised fine-tuning (SFT) teaches a language model to match target responses to given prompts. You provide labeled (prompt, response) pairs, and MinT updates the model weights to minimize the prediction loss on the response tokens only — the prompt tokens are masked with zero loss weight.
Configuration
SFT requires a ServiceClient, a LoRA training client, and an optimizer configuration. Here is a minimal setup:
import mint
from mint import types
service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3-0.6B",
rank=16,
train_mlp=True,
train_attn=True,
train_unembed=True,
)
tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=5e-5)Environment variables:
MINT_API_KEY: Your MinT API key (required). Request one from macaron.im/mindlab.MINT_BASE_URL: MinT server endpoint. Default:https://mint.macaron.xin/(mainland China:mint-cn.macaron.xin).MINT_BASE_MODEL: Base model name. Default:Qwen/Qwen3-0.6B.MINT_LORA_RANK: LoRA rank. Default:16.
All training happens on the remote MinT server. Your Python script calls forward_backward(...) and blocks on .result() until the batch finishes.
Prompting Guide
SFT trains the model on the response portion only. You build Datum objects from (prompt, response) pairs by:
- Encoding both the prompt and response separately.
- Concatenating the token IDs.
- Zeroing loss weights on prompt tokens and setting them to
1.0on response tokens.
Here is the canonical pattern from quickstart.py:
def process_sft_example(ex: dict, tokenizer) -> types.Datum:
# ex = {"question": "What is 3 * 4?"}
a, b = map(int, re.findall(r"\d+", ex["question"]))
answer = str(a * b)
prompt = f"Question: {ex['question']}\nAnswer:"
completion = f" {answer}"
prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True)
completion_tokens = tokenizer.encode(completion, add_special_tokens=False)
completion_tokens.append(tokenizer.eos_token_id)
all_tokens = prompt_tokens + completion_tokens
all_weights = [0] * len(prompt_tokens) + [1] * len(completion_tokens)
# Shift for teacher-forcing: input = all_tokens[:-1], target = all_tokens[1:]
input_tokens = all_tokens[:-1]
target_tokens = all_tokens[1:]
weights = all_weights[1:]
return types.Datum(
model_input=types.ModelInput.from_ints(tokens=input_tokens),
loss_fn_inputs={"target_tokens": target_tokens, "weights": weights},
)Key points:
- The prompt is always masked (
loss_weight=0.0) so gradients do not flow through it. - The response is always active (
loss_weight=1.0) so the model learns to predict the target tokens. - Use your tokenizer's
apply_chat_template(...)method for chat-style instruction tuning (see Rendering).
Output Format
SFT does not directly produce text — it updates the model weights. After each forward_backward step, you get back a ForwardBackwardResult containing:
loss: The weighted cross-entropy loss (scalar).loss_fn_outputs: Per-datum metadata (logprobs, etc.).
To generate text from your trained model, you must first save the LoRA weights and create a sampling client:
# After SFT training completes:
checkpoint = training_client.save_weights_and_get_sampling_client(name="my-sft-v1")
prompt_ids = tokenizer.encode("3 * 7 =")
samples = checkpoint.sample(
prompt=types.ModelInput.from_ints(prompt_ids),
sampling_params=types.SamplingParams(max_tokens=16, temperature=0.7),
num_samples=4,
)
for seq in samples.sequences:
print(tokenizer.decode(seq.tokens))For systematic evaluation — hold-out test sets, benchmark metrics, sampling logs — see Concepts → Evaluations.
All Parameters
| Parameter | Type | Default | Meaning |
|---|---|---|---|
base_model | str | "Qwen/Qwen3-0.6B" | Hugging Face model ID for the base model. |
rank | int | 16 | LoRA rank. Higher = more expressive, higher memory. Typical: 8–64. |
train_mlp | bool | True | Update the MLP (feed-forward) layers. |
train_attn | bool | True | Update the attention layers. |
train_unembed | bool | True | Update the unembedding (output) layer. |
loss_fn | str | "cross_entropy" | Loss function. SFT always uses "cross_entropy"; no alternatives. |
learning_rate | float | 5e-5 | Adam learning rate. Typical: 1e-5 to 1e-4 for instruction tuning. |
betas | tuple[float, float] | (0.9, 0.999) | Adam exponential decay rates for 1st and 2nd moments. |
eps | float | 1e-8 | Adam numerical stability term. |
weight_decay | float | 0.0 | L2 regularization coefficient. Typical: 0.0–0.01 for LoRA. |
Usage:
adam_params = types.AdamParams(
learning_rate=5e-5,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.0,
)
for step, batch in enumerate(batches):
result = training_client.forward_backward(batch, loss_fn="cross_entropy").result()
training_client.optim_step(adam_params).result()
print(f"Step {step}: loss={result.loss:.4f}")Tinker compatibility notes:
- Do not call
zero_grad_async()— gradient zeroing is automatic on the MinT server. - The
loss_fnparameter is passed toforward_backward(...), not stored inAdamParams. save_weights_for_sampler(...)andsave_weights_and_get_sampling_client(...)have the same semantics: both serialize the LoRA weights.