VLA

Vision-Language-Action (VLA) models take images and language instructions as input and output robot actions. MinT supports OpenPI-compatible VLA training via two paths: a Python SDK for ease, or direct HTTP for integration with non-Python systems.

The canonical implementations are demos/embodied/openpi_vla_sdk.py (SDK) and demos/embodied/openpi_vla_http.py (HTTP).

Configuration

The SDK path is the recommended integration for Python workflows:

import mint
import mint.mint as mintx

service_client = mint.ServiceClient()

training_client = mintx.create_openpi_training_client(
    service_client,
    base_model=mintx.OPENPI_FAST_MODEL,
    rank=mintx.OPENPI_FAST_LORA_RANK,  # Default: 16
    create_timeout_seconds=1200.0,
    user_metadata={"example": "vla-training"},
)

info = training_client.get_info()
print(f"Model: {info.model_name}, LoRA rank: {info.lora_rank}")

Environment variables:

export MINT_API_KEY=sk-your-key
export MINT_BASE_URL=https://mint.macaron.xin/
export MINT_OPENPI_SDK_BASE_MODEL="openpi/pi0-fast-libero-low-mem-finetune"
export MINT_OPENPI_SDK_LORA_RANK=16
export MINT_OPENPI_SDK_LR=0.003

The HTTP path bypasses the SDK and sends raw JSON over HTTPS. Use this for non-Python clients or direct integration tests:

curl -X POST https://mint.macaron.xin/api/v1/sessions/create \
  -H "Authorization: Bearer $MINT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"tags": ["vla-training"], "type": "create_session"}'

The response includes a session_id. Then create a model:

curl -X POST https://mint.macaron.xin/api/v1/models/create \
  -H "Authorization: Bearer $MINT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "...",
    "base_model": "openpi/pi0-fast-libero-low-mem-finetune",
    "lora_config": {"rank": 16, "train_attn": true, "train_mlp": true, "train_unembed": true}
  }'

See demos/embodied/openpi_vla_http.py for the full wire protocol and request builders.

Prompting Guide

VLA prompts bundle three modalities: images, state, and language. The canonical structure:

from mint.mint import build_openpi_fast_datum, CAMERA_LAYOUT

# Images from three fixed cameras
images: dict[str, bytes] = {
    "base_0_rgb": load_png("base_cam.png"),
    "left_wrist_0_rgb": load_png("left_cam.png"),
    "right_wrist_0_rgb": load_png("right_cam.png"),
}

# State vector (proprioceptive)
state = [0.1, -0.2, 0.05, ...]  # Joint angles, gripper position, etc.

# Action tokens (what the model generates)
target_tokens = [42, 43, 44, ...]  # Action quantization

# Construct datum
datum = build_openpi_fast_datum(
    prefix_tokens=[],  # Optional instruction tokens
    image_bytes_by_camera=images,
    state=state,
    target_tokens=target_tokens,
    weights=[1.0] * len(target_tokens),  # Train on all actions
    token_ar_mask=[1] * len(target_tokens),  # Autoregressive mask
)

Key fields:

Images: Three fixed camera views (base, left wrist, right wrist) in RGB PNG format.
State: Robot proprioception (joint angles, gripper state, end-effector pose).
Actions: Quantized as token IDs. The model predicts the next token in the action sequence.
token_ar_mask: Autoregressive decoding mask (1 = generate this token, 0 = skip).

Output Format

The VLA model outputs action tokens, which must be dequantized back to continuous control signals. For OpenPI FAST:

Token shape: [seq_len] — one token per timestep.
Token range: 0–511 (8-bit quantization per dimension, 2 dimensions = 2 tokens per step).
Dequantization: Token i → continuous value = (i / 256) - 1.0, rescaled to robot action ranges.

The training loop computes loss on action predictions:

result = training_client.train_step(
    [datum],
    loss_fn="cross_entropy",
    adam_params=types.AdamParams(learning_rate=0.003),
).result()

print(f"Loss: {result.metrics.get('loss')}")

After training, save and sample:

sampler = training_client.save_weights_for_sampler(
    name="vla-checkpoint-1",
    ttl_seconds=3600,
).result()

# Use sampler for inference (not yet documented in MinT)

All Parameters

Parameter	Type	Default	Meaning
`base_model`	str	`"openpi/pi0-fast-libero-low-mem-finetune"`	OpenPI model variant. FAST = lightweight, ~0.6B params.
`rank`	int	`16`	LoRA rank. VLA typically uses 8–32.
`train_mlp`	bool	`True`	Train MLP layers.
`train_attn`	bool	`True`	Train attention layers.
`train_unembed`	bool	`True`	Train output layer (action head).
`learning_rate`	float	`0.003`	Adam LR. VLA: 1e-4 to 1e-2. Higher LR than language models.
`max_frames`	int	`10`	Max frames (images) per batch. VLA: 1–32.
`action_dim`	int	`2`	Action dimensionality. Default = (dx, dy) for gripper.
`quantization_levels`	int	`256`	Levels per action dimension (e.g., 256 = 8-bit).
`create_timeout_seconds`	float	`1200.0`	Timeout for session/model creation.
`step_timeout_seconds`	float	`1200.0`	Timeout per training step.

SDK-specific (environment variables):

export MINT_OPENPI_SDK_BASE_MODEL="..."
export MINT_OPENPI_SDK_LORA_RANK=16
export MINT_OPENPI_SDK_LR=0.003
export MINT_OPENPI_SDK_CREATE_TIMEOUT_SECONDS=1200
export MINT_OPENPI_SDK_STEP_TIMEOUT_SECONDS=1200

HTTP-specific (wire protocol):

create_session: Initialize a training session.
create_model: Allocate LoRA for a model within the session.
train_step: Submit data and receive loss/metrics.
save_weights_for_sampler: Export weights for inference.
delete_model: Clean up.

See openpi_vla_http.py for the exact JSON schemas.

Status: VLA support in MinT is new and evolving. The OpenPI FAST model is optimized for low-latency robot control. For high-dimensional tasks (full manipulation), larger model variants may be released in future versions.

VLA

Configuration

Prompting Guide

Output Format

All Parameters

On this page