Evaluations

Evaluations measure how well your trained model performs on held-out data. MinT supports three evaluation patterns: benchmark metrics (exact match, BLEU, custom functions), sampled completions on a test set, and LLM-as-judge scoring with a separate evaluator model.

Concept

Evaluation happens outside the training loop. You create a SamplingClient from your checkpoint, then run completions on a test set. The results are scored using:

Exact match — Does the model's output exactly match the target? Used for tasks with deterministic answers (math, code, facts).
Custom metric functions — User-defined scorers that compute task-specific metrics (BLEU, ROUGE, F1, task success rate).
LLM-as-judge — A separate evaluator model scores the quality of completions. Use this for subjective tasks (helpfulness, coherence, style) where exact match doesn't apply.

All evaluation results are logged to a structured format (JSONL) for analysis and reporting.

Pattern

import mint
from mint import types
from mint.completers import TinkerMessageCompleter

# Load your trained checkpoint
service_client = mint.ServiceClient()
sampling_client = service_client.create_sampling_client_from_checkpoint(
    checkpoint_name="my-model-v1"
).result()

tokenizer = sampling_client.get_tokenizer()
renderer = mint.renderers.get_renderer("qwen3", tokenizer)

# Create a message completer for evaluation
completer = TinkerMessageCompleter(sampling_client=sampling_client, renderer=renderer)

# Example test set: simple Q&A
test_examples = [
    {
        "question": "What is 2 + 2?",
        "expected_answer": "4",
    },
    {
        "question": "What is the capital of France?",
        "expected_answer": "Paris",
    },
]

# Run completions
results = []
for example in test_examples:
    messages = [{"role": "user", "content": example["question"]}]
    
    response = completer.complete(
        messages=messages,
        sampling_params=types.SamplingParams(max_tokens=32, temperature=0.0),
    )
    
    # Extract the answer text
    answer_text = response["content"]
    is_correct = example["expected_answer"].lower() in answer_text.lower()
    
    results.append({
        "question": example["question"],
        "expected": example["expected_answer"],
        "predicted": answer_text,
        "correct": is_correct,
    })

# Compute metrics
accuracy = sum(r["correct"] for r in results) / len(results)
print(f"Accuracy: {accuracy:.2%}")

# For more complex metrics, define custom scorers
def custom_score(example, prediction):
    # Example: task-specific scoring logic
    return 1.0 if prediction == example["expected_answer"] else 0.0

custom_scores = [
    custom_score(test_examples[i], results[i]["predicted"])
    for i in range(len(results))
]
print(f"Custom metric mean: {sum(custom_scores) / len(custom_scores):.4f}")

View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/concepts/evaluations.py

API Surface

Component	Purpose	Input	Output
`TinkerMessageCompleter`	Generate completions on test set	`list[Message]`	`Message` (response)
`SamplingClient.sample()`	Token-level generation	`ModelInput`	`SampleOutput` (tokens + logprobs)
Custom metric functions	Score individual examples	`(example, prediction)`	`float` (score)
LLM-as-judge pattern	Use another model to score	`(example, prediction)`	`Judge response`

Logging results:

Write results to JSONL: {"question": "...", "expected": "...", "predicted": "...", "score": ...}
Use pandas or your analysis tool to aggregate scores across the test set.

Caveats & Pitfalls

Determinism with sampling: For fair comparisons, use temperature=0.0 during evaluation. Non-zero temperature introduces randomness, making results harder to reproduce.
Test set size: Smaller test sets (< 100 examples) give noisy metric estimates. Use larger test sets (1000+) for reliable signal.
Metric choice: Exact match is only appropriate if your task has a single correct answer. For open-ended tasks (summarization, creative writing), use custom metrics or LLM judges.
LLM judge bias: Judge models can have their own biases and preferences. Always validate judge scores on a hand-annotated subset before trusting them for decision-making.
Batch evaluation: For efficiency, batch your completions using async APIs: gather multiple sampling futures before awaiting results.

Evaluations

Concept

Pattern

API Surface

Caveats & Pitfalls

On this page