Customize
VLM
Vision-Language Model fine-tuning on MinT is Coming Soon. The MinT server is rolling out VLM-capable base models; the client SDK paths described here will activate once those models are available on mint.macaron.xin and mint-cn.macaron.xin.
For interest registration, contact sales@mindlab.ltd or Schedule a Demo and mention VLM in your request.
What this page will cover
When VLM lands, this page documents the same four-section shape as the other algorithm pages:
- Configuration —
mint.ServiceClient,create_lora_training_clientfor VLM-capable base models, image processor selection, image-token budget. - Prompting Guide —
<image>placeholder placement in chat-template messages; multi-image and resolution caveats. - Output Format — how the assistant response is parsed when grounded in an image; bounding-box / region-of-interest extraction for visual QA.
- All Parameters — VLM-specific knobs (vision-encoder freeze, image patch size, max image tokens) layered on top of the standard SFT / RL parameters.
Related reading
While VLM is in flight, the closest shipped pieces are:
- Concepts → Rendering — covers the renderer abstraction that VLM training reuses.
- VLA — vision-language-action embodied training is shipped today via the OpenPI integration.