Mind Lab Toolkit (MinT)
CustomizeSFT

SFT Overview

Supervised fine-tuning (SFT) teaches a language model to match target responses to given prompts. You provide labeled (prompt, response) pairs, and MinT updates the model weights to minimize the prediction loss on the response tokens only — the prompt tokens are masked with zero loss weight.

Configuration

SFT requires a ServiceClient, a LoRA training client, and an optimizer configuration. Here is a minimal setup:

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)

tokenizer = training_client.get_tokenizer()
adam_params = types.AdamParams(learning_rate=5e-5)

Environment variables:

  • MINT_API_KEY: Your MinT API key (required). Request one from macaron.im/mindlab.
  • MINT_BASE_URL: MinT server endpoint. Default: https://mint.macaron.xin/ (mainland China: mint-cn.macaron.xin).
  • MINT_BASE_MODEL: Base model name. Default: Qwen/Qwen3-0.6B.
  • MINT_LORA_RANK: LoRA rank. Default: 16.

All training happens on the remote MinT server. Your Python script calls forward_backward(...) and blocks on .result() until the batch finishes.

Prompting Guide

SFT trains the model on the response portion only. You build Datum objects from (prompt, response) pairs by:

  1. Encoding both the prompt and response separately.
  2. Concatenating the token IDs.
  3. Zeroing loss weights on prompt tokens and setting them to 1.0 on response tokens.

Here is the canonical pattern from quickstart.py:

def process_sft_example(ex: dict, tokenizer) -> types.Datum:
    # ex = {"question": "What is 3 * 4?"}
    a, b = map(int, re.findall(r"\d+", ex["question"]))
    answer = str(a * b)
    prompt = f"Question: {ex['question']}\nAnswer:"
    completion = f" {answer}"

    prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True)
    completion_tokens = tokenizer.encode(completion, add_special_tokens=False)
    completion_tokens.append(tokenizer.eos_token_id)

    all_tokens = prompt_tokens + completion_tokens
    all_weights = [0] * len(prompt_tokens) + [1] * len(completion_tokens)

    # Shift for teacher-forcing: input = all_tokens[:-1], target = all_tokens[1:]
    input_tokens = all_tokens[:-1]
    target_tokens = all_tokens[1:]
    weights = all_weights[1:]

    return types.Datum(
        model_input=types.ModelInput.from_ints(tokens=input_tokens),
        loss_fn_inputs={"target_tokens": target_tokens, "weights": weights},
    )

Key points:

  • The prompt is always masked (loss_weight=0.0) so gradients do not flow through it.
  • The response is always active (loss_weight=1.0) so the model learns to predict the target tokens.
  • Use your tokenizer's apply_chat_template(...) method for chat-style instruction tuning (see Rendering).

Output Format

SFT does not directly produce text — it updates the model weights. After each forward_backward step, you get back a ForwardBackwardResult containing:

  • loss: The weighted cross-entropy loss (scalar).
  • loss_fn_outputs: Per-datum metadata (logprobs, etc.).

To generate text from your trained model, you must first save the LoRA weights and create a sampling client:

# After SFT training completes:
checkpoint = training_client.save_weights_and_get_sampling_client(name="my-sft-v1")

prompt_ids = tokenizer.encode("3 * 7 =")
samples = checkpoint.sample(
    prompt=types.ModelInput.from_ints(prompt_ids),
    sampling_params=types.SamplingParams(max_tokens=16, temperature=0.7),
    num_samples=4,
)

for seq in samples.sequences:
    print(tokenizer.decode(seq.tokens))

For systematic evaluation — hold-out test sets, benchmark metrics, sampling logs — see Concepts → Evaluations.

All Parameters

ParameterTypeDefaultMeaning
base_modelstr"Qwen/Qwen3-0.6B"Hugging Face model ID for the base model.
rankint16LoRA rank. Higher = more expressive, higher memory. Typical: 8–64.
train_mlpboolTrueUpdate the MLP (feed-forward) layers.
train_attnboolTrueUpdate the attention layers.
train_unembedboolTrueUpdate the unembedding (output) layer.
loss_fnstr"cross_entropy"Loss function. SFT always uses "cross_entropy"; no alternatives.
learning_ratefloat5e-5Adam learning rate. Typical: 1e-5 to 1e-4 for instruction tuning.
betastuple[float, float](0.9, 0.999)Adam exponential decay rates for 1st and 2nd moments.
epsfloat1e-8Adam numerical stability term.
weight_decayfloat0.0L2 regularization coefficient. Typical: 0.0–0.01 for LoRA.

Usage:

adam_params = types.AdamParams(
    learning_rate=5e-5,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.0,
)

for step, batch in enumerate(batches):
    result = training_client.forward_backward(batch, loss_fn="cross_entropy").result()
    training_client.optim_step(adam_params).result()
    print(f"Step {step}: loss={result.loss:.4f}")

Tinker compatibility notes:

  • Do not call zero_grad_async() — gradient zeroing is automatic on the MinT server.
  • The loss_fn parameter is passed to forward_backward(...), not stored in AdamParams.
  • save_weights_for_sampler(...) and save_weights_and_get_sampling_client(...) have the same semantics: both serialize the LoRA weights.

On this page