Judges

Configure VLM judges: Gemini, GPT-5.2, Claude, Ollama, or custom.

Evalytic uses Vision-Language Models (VLMs) as judges to score AI-generated images. Any VLM that can analyze images and output structured JSON can serve as a judge. The default is gemini-2.5-flash.

Supported Providers

Provider	Models	API Key Env	Base URL
Gemini	`gemini-2.5-flash`, `gemini-2.5-pro`	`GEMINI_API_KEY`	generativelanguage.googleapis.com
OpenAI	`openai/gpt-5.2`	`OPENAI_API_KEY`	api.openai.com
Anthropic	`anthropic/claude-sonnet-4-6`, `anthropic/claude-haiku-4-5`	`ANTHROPIC_API_KEY`	api.anthropic.com
Ollama	`ollama/qwen2.5-vl:7b`	None (local)	localhost:11434
LM Studio	`lmstudio/<model>`	None (local)	localhost:1234
Custom	`local/<model>`	None	localhost:8090 (or `--judge-url`)

Judge Format

The --judge flag accepts two formats:

# Format 1: model name only (Gemini assumed)
--judge gemini-2.5-flash

# Format 2: provider/model
--judge openai/gpt-5.2
--judge anthropic/claude-sonnet-4-6
--judge ollama/qwen2.5-vl:7b
--judge lmstudio/my-model
--judge local/my-model

Gemini (Default)

Gemini is the default judge — fast, affordable, and easy to set up.

# These are equivalent (gemini is the default provider)
evaly bench -m flux-schnell -p "A cat"
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-flash

# Use Gemini Pro for higher quality scoring
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-pro

Get your Gemini API key at aistudio.google.com/apikey. Set it as GEMINI_API_KEY env var.

OpenAI

export OPENAI_API_KEY=sk-...
evaly bench -m flux-schnell -p "A cat" -j openai/gpt-5.2

Anthropic

export ANTHROPIC_API_KEY=sk-ant-...
evaly bench -m flux-schnell -p "A cat" -j anthropic/claude-sonnet-4-6

Ollama (Local)

Run judges entirely on your machine with no API costs:

# First, start Ollama and pull a vision model
ollama pull qwen2.5-vl:7b

# Use it as judge
evaly bench -m flux-schnell -p "A cat" -j ollama/qwen2.5-vl:7b

Ollama must be running at localhost:11434. Quality may vary compared to cloud judges like Gemini or GPT-5.2.

Custom Judge URL

Use --judge-url to point to any OpenAI-compatible API:

evaly bench \
    -m flux-schnell \
    -p "A cat" \
    -j local/my-model \
    --judge-url http://my-server:8080/v1

Choosing a Judge

Judge	Quality	Speed	Cost	Best For
`gemini-2.5-flash`	Good	Fast	Low	Default, development, small benchmarks
`gemini-2.5-pro`	Excellent	Medium	Paid	High-stakes evaluations, publication
`openai/gpt-5.2`	Excellent	Medium	Paid	Cross-validation with different VLM
`anthropic/claude-sonnet-4-6`	Excellent	Medium	Paid	Cross-validation with different VLM
`ollama/qwen2.5-vl:7b`	Moderate	Slow	Free (local)	Privacy-sensitive, offline, experimentation

Consensus Mode

A single VLM judge can be biased — high-score clustering, dimension-specific inconsistency. Consensus mode uses 2–3 judges with an adaptive 2+1 algorithm for more reliable scores.

How it works

For each (image, dimension) pair:

Two primary judges score in parallel
If they agree (within 0.5 points) → average, marked as high agreement
If they disagree (>0.5 difference) → third judge breaks the tie with a median, marked as disputed
If one judge fails → the other judge's score is used, marked as degraded

This keeps cost at ~2.3x instead of 3x, since the tiebreaker is only called for disputed dimensions.

Usage

# 2 judges — average when they agree, flag disputes
evaly bench -m flux-schnell -p "A cat" \
    --judges "gemini-2.5-flash,gpt-5.2"

# 3 judges — disputed dimensions get a tiebreaker (median)
evaly bench -m flux-schnell -p prompts.json \
    --judges "gemini-2.5-flash,gpt-5.2,claude-haiku-4-5"

Recommended combinations

Combination	Cost	Best For
`gemini-2.5-flash, gpt-5.2`	Low	Budget consensus — two cheap judges, no tiebreaker
`gemini-2.5-flash, gpt-5.2, claude-haiku-4-5`	Low–Medium	Full consensus with tiebreaker, 3 different providers
`gemini-2.5-pro, gpt-5.2`	Medium	High-quality consensus, two strong judges

Report output

In consensus mode, reports include:

Agree column in the model comparison table — percentage of dimensions with high agreement
Per-judge scores in score details — which judge gave what score
Agreement badges (high, disputed, degraded) per dimension
Per-provider costs in the cost summary

Config file

# evalytic.toml
[bench]
judges = ["gemini-2.5-flash", "gpt-5.2", "claude-haiku-4-5"]

Each judge requires its own API key. When using gemini-2.5-flash + gpt-5.2, both GEMINI_API_KEY and OPENAI_API_KEY must be set.

Configuration

Set a default judge in evalytic.toml:

# evalytic.toml — single judge
[bench]
judge = "openai/gpt-5.2"

# Or consensus mode (overrides judge)
# judges = ["gemini-2.5-flash", "gpt-5.2"]

CLI flags always override config file settings. See Configuration for the full precedence order.

evaly config Dimensions