CustomizeConcepts
Evaluations
Evaluations measure how well your trained model performs on held-out data. MinT supports three evaluation patterns: benchmark metrics (exact match, BLEU, custom functions), sampled completions on a test set, and LLM-as-judge scoring with a separate evaluator model.
Concept
Evaluation happens outside the training loop. You create a SamplingClient from your checkpoint, then run completions on a test set. The results are scored using:
- Exact match — Does the model's output exactly match the target? Used for tasks with deterministic answers (math, code, facts).
- Custom metric functions — User-defined scorers that compute task-specific metrics (BLEU, ROUGE, F1, task success rate).
- LLM-as-judge — A separate evaluator model scores the quality of completions. Use this for subjective tasks (helpfulness, coherence, style) where exact match doesn't apply.
All evaluation results are logged to a structured format (JSONL) for analysis and reporting.
Pattern
import mint
from mint import types
from mint.completers import TinkerMessageCompleter
# Load your trained checkpoint
service_client = mint.ServiceClient()
sampling_client = service_client.create_sampling_client_from_checkpoint(
checkpoint_name="my-model-v1"
).result()
tokenizer = sampling_client.get_tokenizer()
renderer = mint.renderers.get_renderer("qwen3", tokenizer)
# Create a message completer for evaluation
completer = TinkerMessageCompleter(sampling_client=sampling_client, renderer=renderer)
# Example test set: simple Q&A
test_examples = [
{
"question": "What is 2 + 2?",
"expected_answer": "4",
},
{
"question": "What is the capital of France?",
"expected_answer": "Paris",
},
]
# Run completions
results = []
for example in test_examples:
messages = [{"role": "user", "content": example["question"]}]
response = completer.complete(
messages=messages,
sampling_params=types.SamplingParams(max_tokens=32, temperature=0.0),
)
# Extract the answer text
answer_text = response["content"]
is_correct = example["expected_answer"].lower() in answer_text.lower()
results.append({
"question": example["question"],
"expected": example["expected_answer"],
"predicted": answer_text,
"correct": is_correct,
})
# Compute metrics
accuracy = sum(r["correct"] for r in results) / len(results)
print(f"Accuracy: {accuracy:.2%}")
# For more complex metrics, define custom scorers
def custom_score(example, prediction):
# Example: task-specific scoring logic
return 1.0 if prediction == example["expected_answer"] else 0.0
custom_scores = [
custom_score(test_examples[i], results[i]["predicted"])
for i in range(len(results))
]
print(f"Custom metric mean: {sum(custom_scores) / len(custom_scores):.4f}")View full source: https://github.com/MindLab-Research/mint-quickstart/blob/main/concepts/evaluations.py
API Surface
| Component | Purpose | Input | Output |
|---|---|---|---|
TinkerMessageCompleter | Generate completions on test set | list[Message] | Message (response) |
SamplingClient.sample() | Token-level generation | ModelInput | SampleOutput (tokens + logprobs) |
| Custom metric functions | Score individual examples | (example, prediction) | float (score) |
| LLM-as-judge pattern | Use another model to score | (example, prediction) | Judge response |
Logging results:
- Write results to JSONL:
{"question": "...", "expected": "...", "predicted": "...", "score": ...} - Use pandas or your analysis tool to aggregate scores across the test set.
Caveats & Pitfalls
- Determinism with sampling: For fair comparisons, use
temperature=0.0during evaluation. Non-zero temperature introduces randomness, making results harder to reproduce. - Test set size: Smaller test sets (< 100 examples) give noisy metric estimates. Use larger test sets (1000+) for reliable signal.
- Metric choice: Exact match is only appropriate if your task has a single correct answer. For open-ended tasks (summarization, creative writing), use custom metrics or LLM judges.
- LLM judge bias: Judge models can have their own biases and preferences. Always validate judge scores on a hand-annotated subset before trusting them for decision-making.
- Batch evaluation: For efficiency, batch your completions using async APIs: gather multiple sampling futures before awaiting results.