Mind Lab Toolkit (MinT)
Customize

VLM

Vision-Language Model fine-tuning on MinT is Coming Soon. The MinT server is rolling out VLM-capable base models; the client SDK paths described here will activate once those models are available on mint.macaron.xin and mint-cn.macaron.xin.

For interest registration, contact sales@mindlab.ltd or Schedule a Demo and mention VLM in your request.

What this page will cover

When VLM lands, this page documents the same four-section shape as the other algorithm pages:

  • Configurationmint.ServiceClient, create_lora_training_client for VLM-capable base models, image processor selection, image-token budget.
  • Prompting Guide<image> placeholder placement in chat-template messages; multi-image and resolution caveats.
  • Output Format — how the assistant response is parsed when grounded in an image; bounding-box / region-of-interest extraction for visual QA.
  • All Parameters — VLM-specific knobs (vision-encoder freeze, image patch size, max image tokens) layered on top of the standard SFT / RL parameters.

While VLM is in flight, the closest shipped pieces are:

  • Concepts → Rendering — covers the renderer abstraction that VLM training reuses.
  • VLA — vision-language-action embodied training is shipped today via the OpenPI integration.

On this page