从 Checkpoint 恢复训练

这一页对应 mint-quickstart 中的 advanced/checkpoint.py resume。

两种恢复模式

连 optimizer 一起恢复：真实续训时推荐使用。它使用 create_lora_training_client(...) 加 load_state_with_optimizer(path)，并要求 MINT_BASE_MODEL、MINT_LORA_RANK 和 LoRA 选项与保存时一致。
仅恢复权重：适合不关心 optimizer state 的情况。脚本会先尝试 create_training_client_from_state(path) 自动识别 model/rank；如果对原始 checkpoint 路径做 metadata 查询时返回 404，就回退到 create_lora_training_client(...) + load_state(path)，并使用 MINT_BASE_MODEL / MINT_LORA_RANK（或默认值）。

按所在区域选择 MinT 域名：

境内：https://mint-cn.macaron.xin/
境外：https://mint.macaron.xin/

命令

# 连 optimizer 一起恢复
export MINT_API_KEY=sk-...
export MINT_BASE_MODEL=Qwen/Qwen3-0.6B
export MINT_LORA_RANK=16
python advanced/checkpoint.py resume tinker://<run-id>/weights/<checkpoint-name> --with-optimizer --steps 3

# 仅恢复权重；optimizer 会重置
export MINT_API_KEY=sk-...
python advanced/checkpoint.py resume tinker://<run-id>/weights/<checkpoint-name>

常用参数：

--with-optimizer：保留 optimizer state
--steps：恢复后继续跑多少个 SFT step
--lr：这些 step 使用的学习率
--save-name：恢复完成后新保存的 checkpoint 名称

核心 API

# 完整续训：weights + optimizer state
training_client = service_client.create_lora_training_client(
    base_model=model,
    rank=rank,
    train_mlp=True,
    train_attn=True,
    train_unembed=True,
)
training_client.load_state_with_optimizer(resume_path).result()

# 仅加载 weights：optimizer state 会重置
training_client = service_client.create_lora_training_client(base_model=model, rank=rank)
training_client.load_state(resume_path).result()

预期输出

[resume] path=tinker://.../weights/my-ckpt-state with_optimizer=True steps=3
[resume] fallback to explicit training client: model=Qwen/Qwen3-0.6B rank=16
[resume] loading state from tinker://.../weights/my-ckpt-state...
[resume] loaded, running 3 SFT step(s)...
[resume] step 1/3 done
[resume] saved: tinker://.../weights/resumed-checkpoint

常见失败

checkpoint 路径不存在或无效
使用 --with-optimizer 时，没有匹配的 MINT_BASE_MODEL / MINT_LORA_RANK
checkpoint 的 adapter 形状与新 client 不匹配
当前账号下 base model 不可用
用了 load_state(...)，但实际期待的是保留 optimizer 的完整续训

从 Checkpoint 恢复训练

推荐恢复形状

两种恢复模式

命令

核心 API

预期输出

常见失败

相关页面

本页目录