Judges
Configure VLM judges: Gemini, GPT-5.2, Claude, Ollama, or custom.
Evalytic uses Vision-Language Models (VLMs) as judges to score AI-generated images.
Any VLM that can analyze images and output structured JSON can serve as a judge.
The default is gemini-2.5-flash.
Supported Providers
| Provider | Models | API Key Env | Base URL |
|---|---|---|---|
| Gemini | gemini-2.5-flash, gemini-2.5-pro |
GEMINI_API_KEY |
generativelanguage.googleapis.com |
| OpenAI | openai/gpt-5.2 |
OPENAI_API_KEY |
api.openai.com |
| Anthropic | anthropic/claude-sonnet-4-6, anthropic/claude-haiku-4-5 |
ANTHROPIC_API_KEY |
api.anthropic.com |
| Ollama | ollama/qwen2.5-vl:7b |
None (local) | localhost:11434 |
| LM Studio | lmstudio/<model> |
None (local) | localhost:1234 |
| Custom | local/<model> |
None | localhost:8090 (or --judge-url) |
Judge Format
The --judge flag accepts two formats:
# Format 1: model name only (Gemini assumed)
--judge gemini-2.5-flash
# Format 2: provider/model
--judge openai/gpt-5.2
--judge anthropic/claude-sonnet-4-6
--judge ollama/qwen2.5-vl:7b
--judge lmstudio/my-model
--judge local/my-model
Gemini (Default)
Gemini is the default judge — fast, affordable, and easy to set up.
# These are equivalent (gemini is the default provider)
evaly bench -m flux-schnell -p "A cat"
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-flash
# Use Gemini Pro for higher quality scoring
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-pro
GEMINI_API_KEY env var.
OpenAI
export OPENAI_API_KEY=sk-...
evaly bench -m flux-schnell -p "A cat" -j openai/gpt-5.2
Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
evaly bench -m flux-schnell -p "A cat" -j anthropic/claude-sonnet-4-6
Ollama (Local)
Run judges entirely on your machine with no API costs:
# First, start Ollama and pull a vision model
ollama pull qwen2.5-vl:7b
# Use it as judge
evaly bench -m flux-schnell -p "A cat" -j ollama/qwen2.5-vl:7b
localhost:11434. Quality may vary compared to cloud judges like Gemini or GPT-5.2.
Custom Judge URL
Use --judge-url to point to any OpenAI-compatible API:
evaly bench \
-m flux-schnell \
-p "A cat" \
-j local/my-model \
--judge-url http://my-server:8080/v1
Choosing a Judge
| Judge | Quality | Speed | Cost | Best For |
|---|---|---|---|---|
gemini-2.5-flash |
Good | Fast | Low | Default, development, small benchmarks |
gemini-2.5-pro |
Excellent | Medium | Paid | High-stakes evaluations, publication |
openai/gpt-5.2 |
Excellent | Medium | Paid | Cross-validation with different VLM |
anthropic/claude-sonnet-4-6 |
Excellent | Medium | Paid | Cross-validation with different VLM |
ollama/qwen2.5-vl:7b |
Moderate | Slow | Free (local) | Privacy-sensitive, offline, experimentation |
Consensus Mode
A single VLM judge can be biased — high-score clustering, dimension-specific inconsistency. Consensus mode uses 2–3 judges with an adaptive 2+1 algorithm for more reliable scores.
How it works
For each (image, dimension) pair:
- Two primary judges score in parallel
- If they agree (within 0.5 points) → average, marked as
highagreement - If they disagree (>0.5 difference) → third judge breaks the tie with a median, marked as
disputed - If one judge fails → the other judge's score is used, marked as
degraded
This keeps cost at ~2.3x instead of 3x, since the tiebreaker is only called for disputed dimensions.
Usage
# 2 judges — average when they agree, flag disputes
evaly bench -m flux-schnell -p "A cat" \
--judges "gemini-2.5-flash,gpt-5.2"
# 3 judges — disputed dimensions get a tiebreaker (median)
evaly bench -m flux-schnell -p prompts.json \
--judges "gemini-2.5-flash,gpt-5.2,claude-haiku-4-5"
Recommended combinations
| Combination | Cost | Best For |
|---|---|---|
gemini-2.5-flash, gpt-5.2 |
Low | Budget consensus — two cheap judges, no tiebreaker |
gemini-2.5-flash, gpt-5.2, claude-haiku-4-5 |
Low–Medium | Full consensus with tiebreaker, 3 different providers |
gemini-2.5-pro, gpt-5.2 |
Medium | High-quality consensus, two strong judges |
Report output
In consensus mode, reports include:
- Agree column in the model comparison table — percentage of dimensions with high agreement
- Per-judge scores in score details — which judge gave what score
- Agreement badges (
high,disputed,degraded) per dimension - Per-provider costs in the cost summary
Config file
# evalytic.toml
[bench]
judges = ["gemini-2.5-flash", "gpt-5.2", "claude-haiku-4-5"]
gemini-2.5-flash + gpt-5.2, both GEMINI_API_KEY and OPENAI_API_KEY must be set.
Configuration
Set a default judge in evalytic.toml:
# evalytic.toml — single judge
[bench]
judge = "openai/gpt-5.2"
# Or consensus mode (overrides judge)
# judges = ["gemini-2.5-flash", "gpt-5.2"]
CLI flags always override config file settings. See Configuration for the full precedence order.