Mind Lab Toolkit (MinT)
AdvancedCheckpoint

Resume Training from a Checkpoint

This page documents advanced/checkpoint.py resume in mint-quickstart.

For a true training resume, create a fresh LoRA training client with the same model/rank/options, then load the checkpoint with optimizer state:

training_client = service_client.create_lora_training_client(
    base_model=model,
    rank=rank,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)
training_client.load_state_with_optimizer(resume_path).result()

This is the shape used by advanced/checkpoint.py resume --with-optimizer. Do not present load_state(...) as full training resume; it loads weights only and resets optimizer state.

Two resume modes

  • With optimizer: recommended when you want to continue training from the same optimizer state. It uses create_lora_training_client(...) plus load_state_with_optimizer(path) and requires matching MINT_BASE_MODEL, MINT_LORA_RANK, and LoRA options.
  • Weights only: useful when optimizer state does not matter. The script first tries create_training_client_from_state(path) for auto-detection. If the metadata lookup returns 404 for a raw checkpoint path, it falls back to create_lora_training_client(...) plus load_state(path) using MINT_BASE_MODEL / MINT_LORA_RANK (or their defaults).

Use the MinT endpoint that matches your region:

  • Mainland China: https://mint-cn.macaron.xin/
  • Outside Mainland China: https://mint.macaron.xin/

Commands

# Preserve optimizer state
export MINT_API_KEY=sk-...
export MINT_BASE_MODEL=Qwen/Qwen3-0.6B
export MINT_LORA_RANK=16
python advanced/checkpoint.py resume tinker://<run-id>/weights/<checkpoint-name> --with-optimizer --steps 3

# Weights only; optimizer resets
export MINT_API_KEY=sk-...
python advanced/checkpoint.py resume tinker://<run-id>/weights/<checkpoint-name>

Useful flags:

  • --with-optimizer: preserve optimizer state
  • --steps: number of post-resume SFT steps to run
  • --lr: learning rate for those steps
  • --save-name: name of the checkpoint saved after the resume steps finish

Core APIs

# Full training resume: weights + optimizer state
training_client = service_client.create_lora_training_client(
    base_model=model,
    rank=rank,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)
training_client.load_state_with_optimizer(resume_path).result()

# Weights-only load: optimizer state resets
training_client = service_client.create_lora_training_client(base_model=model, rank=rank)
training_client.load_state(resume_path).result()

Expected output

[resume] path=tinker://.../weights/my-ckpt-state with_optimizer=True steps=3
[resume] fallback to explicit training client: model=Qwen/Qwen3-0.6B rank=16
[resume] loading state from tinker://.../weights/my-ckpt-state...
[resume] loaded, running 3 SFT step(s)...
[resume] step 1/3 done
[resume] saved: tinker://.../weights/resumed-checkpoint

Common failure cases

  • the checkpoint path is missing or invalid
  • --with-optimizer is used without matching MINT_BASE_MODEL / MINT_LORA_RANK
  • the checkpoint was saved for a different adapter shape than the new client
  • the base model is unavailable for your account
  • load_state(...) is used when you expected optimizer-preserving resume
  • Generate a checkpoint to resume from: Save
  • Pull a server-side checkpoint archive to local disk: Download
  • Push a local archive back to MinT: Upload

On this page