VLA
Vision-Language-Action (VLA) models take images and language instructions as input and output robot actions. MinT supports OpenPI-compatible VLA training via two paths: a Python SDK for ease, or direct HTTP for integration with non-Python systems.
The canonical implementations are demos/embodied/openpi_vla_sdk.py (SDK) and demos/embodied/openpi_vla_http.py (HTTP).
Configuration
The SDK path is the recommended integration for Python workflows:
import mint
import mint.mint as mintx
service_client = mint.ServiceClient()
training_client = mintx.create_openpi_training_client(
service_client,
base_model=mintx.OPENPI_FAST_MODEL,
rank=mintx.OPENPI_FAST_LORA_RANK, # Default: 16
create_timeout_seconds=1200.0,
user_metadata={"example": "vla-training"},
)
info = training_client.get_info()
print(f"Model: {info.model_name}, LoRA rank: {info.lora_rank}")Environment variables:
export MINT_API_KEY=sk-your-key
export MINT_BASE_URL=https://mint.macaron.xin/
export MINT_OPENPI_SDK_BASE_MODEL="openpi/pi0-fast-libero-low-mem-finetune"
export MINT_OPENPI_SDK_LORA_RANK=16
export MINT_OPENPI_SDK_LR=0.003The HTTP path bypasses the SDK and sends raw JSON over HTTPS. Use this for non-Python clients or direct integration tests:
curl -X POST https://mint.macaron.xin/api/v1/sessions/create \
-H "Authorization: Bearer $MINT_API_KEY" \
-H "Content-Type: application/json" \
-d '{"tags": ["vla-training"], "type": "create_session"}'The response includes a session_id. Then create a model:
curl -X POST https://mint.macaron.xin/api/v1/models/create \
-H "Authorization: Bearer $MINT_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"session_id": "...",
"base_model": "openpi/pi0-fast-libero-low-mem-finetune",
"lora_config": {"rank": 16, "train_attn": true, "train_mlp": true, "train_unembed": true}
}'See demos/embodied/openpi_vla_http.py for the full wire protocol and request builders.
Prompting Guide
VLA prompts bundle three modalities: images, state, and language. The canonical structure:
from mint.mint import build_openpi_fast_datum, CAMERA_LAYOUT
# Images from three fixed cameras
images: dict[str, bytes] = {
"base_0_rgb": load_png("base_cam.png"),
"left_wrist_0_rgb": load_png("left_cam.png"),
"right_wrist_0_rgb": load_png("right_cam.png"),
}
# State vector (proprioceptive)
state = [0.1, -0.2, 0.05, ...] # Joint angles, gripper position, etc.
# Action tokens (what the model generates)
target_tokens = [42, 43, 44, ...] # Action quantization
# Construct datum
datum = build_openpi_fast_datum(
prefix_tokens=[], # Optional instruction tokens
image_bytes_by_camera=images,
state=state,
target_tokens=target_tokens,
weights=[1.0] * len(target_tokens), # Train on all actions
token_ar_mask=[1] * len(target_tokens), # Autoregressive mask
)Key fields:
- Images: Three fixed camera views (base, left wrist, right wrist) in RGB PNG format.
- State: Robot proprioception (joint angles, gripper state, end-effector pose).
- Actions: Quantized as token IDs. The model predicts the next token in the action sequence.
- token_ar_mask: Autoregressive decoding mask (1 = generate this token, 0 = skip).
Output Format
The VLA model outputs action tokens, which must be dequantized back to continuous control signals. For OpenPI FAST:
- Token shape:
[seq_len]— one token per timestep. - Token range: 0–511 (8-bit quantization per dimension, 2 dimensions = 2 tokens per step).
- Dequantization: Token i → continuous value = (i / 256) - 1.0, rescaled to robot action ranges.
The training loop computes loss on action predictions:
result = training_client.train_step(
[datum],
loss_fn="cross_entropy",
adam_params=types.AdamParams(learning_rate=0.003),
).result()
print(f"Loss: {result.metrics.get('loss')}")After training, save and sample:
sampler = training_client.save_weights_for_sampler(
name="vla-checkpoint-1",
ttl_seconds=3600,
).result()
# Use sampler for inference (not yet documented in MinT)All Parameters
| Parameter | Type | Default | Meaning |
|---|---|---|---|
base_model | str | "openpi/pi0-fast-libero-low-mem-finetune" | OpenPI model variant. FAST = lightweight, ~0.6B params. |
rank | int | 16 | LoRA rank. VLA typically uses 8–32. |
train_mlp | bool | True | Train MLP layers. |
train_attn | bool | True | Train attention layers. |
train_unembed | bool | True | Train output layer (action head). |
learning_rate | float | 0.003 | Adam LR. VLA: 1e-4 to 1e-2. Higher LR than language models. |
max_frames | int | 10 | Max frames (images) per batch. VLA: 1–32. |
action_dim | int | 2 | Action dimensionality. Default = (dx, dy) for gripper. |
quantization_levels | int | 256 | Levels per action dimension (e.g., 256 = 8-bit). |
create_timeout_seconds | float | 1200.0 | Timeout for session/model creation. |
step_timeout_seconds | float | 1200.0 | Timeout per training step. |
SDK-specific (environment variables):
export MINT_OPENPI_SDK_BASE_MODEL="..."
export MINT_OPENPI_SDK_LORA_RANK=16
export MINT_OPENPI_SDK_LR=0.003
export MINT_OPENPI_SDK_CREATE_TIMEOUT_SECONDS=1200
export MINT_OPENPI_SDK_STEP_TIMEOUT_SECONDS=1200HTTP-specific (wire protocol):
create_session: Initialize a training session.create_model: Allocate LoRA for a model within the session.train_step: Submit data and receive loss/metrics.save_weights_for_sampler: Export weights for inference.delete_model: Clean up.
See openpi_vla_http.py for the exact JSON schemas.
Status: VLA support in MinT is new and evolving. The OpenPI FAST model is optimized for low-latency robot control. For high-dimensional tasks (full manipulation), larger model variants may be released in future versions.