CustomizeConcepts
Checkpoints & Weights
MinT stores LoRA checkpoints on the server. This page covers the full checkpoint lifecycle: saving for inference, resuming training, managing checkpoint metadata, and downloading weights for local deployment or merging.
Concept
Training produces two types of checkpoints:
- Inference checkpoint (
save_weights_for_sampler) — LoRA weights optimized for sampling. Use this to create aSamplingClientfor inference or evaluation. - Training state (
save_state) — Full training state including gradients, optimizer moments, and loss history. Use this to resume training from a checkpoint without losing momentum.
Checkpoints are identified by a server-side name and can be listed, set with TTL (time-to-live), published to HuggingFace Hub, or downloaded for local use. The workflow is:
Training loop:
forward_backward() -> optim_step() -> save_weights_for_sampler()
-> save_state() (for resuming)
-> get_checkpoint_metadata()
-> publish_checkpoint()Pattern
import mint
from mint import types
service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3-0.6B",
rank=16,
)
tokenizer = training_client.get_tokenizer()
# Train for a few steps
for step in range(10):
# Build a batch (simplified)
text = "Example training text for step {}".format(step)
tokens = tokenizer.encode(text)
model_input = types.ModelInput.from_ints(tokens[:-1])
target_tokens = tokens[1:]
weights = [1.0] * len(target_tokens)
datum = types.Datum(
model_input=model_input,
loss_fn_inputs={"target_tokens": target_tokens, "weights": weights},
)
result = training_client.forward_backward([datum], loss_fn="cross_entropy").result()
adam_params = types.AdamParams(learning_rate=5e-5)
training_client.optim_step(adam_params).result()
if step % 5 == 0:
# Save for inference
sampling_client = training_client.save_weights_for_sampler(
name=f"checkpoint-step-{step}"
).result()
print(f"Saved checkpoint at step {step}")
# Save full state for resuming
training_client.save_state(name=f"state-step-{step}").result()
# Later, resume from a checkpoint
checkpoint_state = "state-step-5"
resumed_client = service_client.create_lora_training_client_from_state(
checkpoint_state=checkpoint_state
).result()
# Or create a sampling client from a saved checkpoint
sampling_client = service_client.create_sampling_client_from_checkpoint(
checkpoint_name="checkpoint-step-5"
).result()View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/advanced/checkpoint.py
API Surface
| Method | Purpose | Returns |
|---|---|---|
save_weights_for_sampler(name) | Save LoRA weights for inference | SamplingClient (ready to use) |
save_state(name) | Save full training state for resuming | None |
create_lora_training_client_from_state(checkpoint_state) | Resume training from a state | TrainingClient |
create_sampling_client_from_checkpoint(checkpoint_name) | Load a checkpoint for inference | SamplingClient |
get_checkpoint_metadata(name) | Query checkpoint size, creation time, TTL | CheckpointMetadata |
set_checkpoint_ttl(name, ttl_hours) | Set expiry time for a checkpoint | None |
publish_checkpoint(name, hub_id) | Publish checkpoint to HuggingFace Hub | None |
Caveats & Pitfalls
- Sampler desync after weight reload: After saving weights, the old
SamplingClientcontinues sampling from the previous weights. Always create a new sampling client from the new checkpoint. - State vs. weights:
save_state()is heavier (includes gradients, optimizer moments) but preserves training momentum.save_weights_for_sampler()is lightweight but loses optimizer history. Use state for long training runs; weights for final deployment. - Checkpoint naming: Checkpoint names are user-defined strings. Use descriptive names like
"math-v1-step-100"to avoid confusion. Names must be unique per checkpoint type (inference vs. state). - TTL defaults: Saved checkpoints persist by default. Set a TTL to auto-expire old checkpoints and reclaim server storage.
- Hub publishing:
publish_checkpoint()requires a valid HuggingFace Hub token in the environment. See Deployment: Publish to Hub.