Mind Lab Toolkit (MinT)
CustomizeConcepts

Checkpoints & Weights

MinT stores LoRA checkpoints on the server. This page covers the full checkpoint lifecycle: saving for inference, resuming training, managing checkpoint metadata, and downloading weights for local deployment or merging.

Concept

Training produces two types of checkpoints:

  • Inference checkpoint (save_weights_for_sampler) — LoRA weights optimized for sampling. Use this to create a SamplingClient for inference or evaluation.
  • Training state (save_state) — Full training state including gradients, optimizer moments, and loss history. Use this to resume training from a checkpoint without losing momentum.

Checkpoints are identified by a server-side name and can be listed, set with TTL (time-to-live), published to HuggingFace Hub, or downloaded for local use. The workflow is:

Training loop:
  forward_backward() -> optim_step() -> save_weights_for_sampler()
                                     -> save_state() (for resuming)
                                     -> get_checkpoint_metadata()
                                     -> publish_checkpoint()

Pattern

import mint
from mint import types

service_client = mint.ServiceClient()
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-0.6B",
    rank=16,
)
tokenizer = training_client.get_tokenizer()

# Train for a few steps
for step in range(10):
    # Build a batch (simplified)
    text = "Example training text for step {}".format(step)
    tokens = tokenizer.encode(text)
    model_input = types.ModelInput.from_ints(tokens[:-1])
    target_tokens = tokens[1:]
    weights = [1.0] * len(target_tokens)
    
    datum = types.Datum(
        model_input=model_input,
        loss_fn_inputs={"target_tokens": target_tokens, "weights": weights},
    )
    
    result = training_client.forward_backward([datum], loss_fn="cross_entropy").result()
    adam_params = types.AdamParams(learning_rate=5e-5)
    training_client.optim_step(adam_params).result()
    
    if step % 5 == 0:
        # Save for inference
        sampling_client = training_client.save_weights_for_sampler(
            name=f"checkpoint-step-{step}"
        ).result()
        print(f"Saved checkpoint at step {step}")
        
        # Save full state for resuming
        training_client.save_state(name=f"state-step-{step}").result()

# Later, resume from a checkpoint
checkpoint_state = "state-step-5"
resumed_client = service_client.create_lora_training_client_from_state(
    checkpoint_state=checkpoint_state
).result()

# Or create a sampling client from a saved checkpoint
sampling_client = service_client.create_sampling_client_from_checkpoint(
    checkpoint_name="checkpoint-step-5"
).result()

View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/advanced/checkpoint.py

API Surface

MethodPurposeReturns
save_weights_for_sampler(name)Save LoRA weights for inferenceSamplingClient (ready to use)
save_state(name)Save full training state for resumingNone
create_lora_training_client_from_state(checkpoint_state)Resume training from a stateTrainingClient
create_sampling_client_from_checkpoint(checkpoint_name)Load a checkpoint for inferenceSamplingClient
get_checkpoint_metadata(name)Query checkpoint size, creation time, TTLCheckpointMetadata
set_checkpoint_ttl(name, ttl_hours)Set expiry time for a checkpointNone
publish_checkpoint(name, hub_id)Publish checkpoint to HuggingFace HubNone

Caveats & Pitfalls

  • Sampler desync after weight reload: After saving weights, the old SamplingClient continues sampling from the previous weights. Always create a new sampling client from the new checkpoint.
  • State vs. weights: save_state() is heavier (includes gradients, optimizer moments) but preserves training momentum. save_weights_for_sampler() is lightweight but loses optimizer history. Use state for long training runs; weights for final deployment.
  • Checkpoint naming: Checkpoint names are user-defined strings. Use descriptive names like "math-v1-step-100" to avoid confusion. Names must be unique per checkpoint type (inference vs. state).
  • TTL defaults: Saved checkpoints persist by default. Set a TTL to auto-expire old checkpoints and reclaim server storage.
  • Hub publishing: publish_checkpoint() requires a valid HuggingFace Hub token in the environment. See Deployment: Publish to Hub.

On this page