Judges

Configure VLM judges: Gemini, GPT-5.2, Claude, Ollama, or custom.

Evalytic uses Vision-Language Models (VLMs) as judges to score AI-generated images. Any VLM that can analyze images and output structured JSON can serve as a judge. The default is gemini-2.5-flash.

Supported Providers

ProviderModelsAPI Key EnvBase URL
Gemini gemini-2.5-flash, gemini-2.5-pro GEMINI_API_KEY generativelanguage.googleapis.com
OpenAI openai/gpt-5.2 OPENAI_API_KEY api.openai.com
Anthropic anthropic/claude-sonnet-4-6, anthropic/claude-haiku-4-5 ANTHROPIC_API_KEY api.anthropic.com
Ollama ollama/qwen2.5-vl:7b None (local) localhost:11434
LM Studio lmstudio/<model> None (local) localhost:1234
Custom local/<model> None localhost:8090 (or --judge-url)

Judge Format

The --judge flag accepts two formats:

# Format 1: model name only (Gemini assumed)
--judge gemini-2.5-flash

# Format 2: provider/model
--judge openai/gpt-5.2
--judge anthropic/claude-sonnet-4-6
--judge ollama/qwen2.5-vl:7b
--judge lmstudio/my-model
--judge local/my-model

Gemini (Default)

Gemini is the default judge — fast, affordable, and easy to set up.

# These are equivalent (gemini is the default provider)
evaly bench -m flux-schnell -p "A cat"
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-flash

# Use Gemini Pro for higher quality scoring
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-pro
Get your Gemini API key at aistudio.google.com/apikey. Set it as GEMINI_API_KEY env var.

OpenAI

export OPENAI_API_KEY=sk-...
evaly bench -m flux-schnell -p "A cat" -j openai/gpt-5.2

Anthropic

export ANTHROPIC_API_KEY=sk-ant-...
evaly bench -m flux-schnell -p "A cat" -j anthropic/claude-sonnet-4-6

Ollama (Local)

Run judges entirely on your machine with no API costs:

# First, start Ollama and pull a vision model
ollama pull qwen2.5-vl:7b

# Use it as judge
evaly bench -m flux-schnell -p "A cat" -j ollama/qwen2.5-vl:7b
Ollama must be running at localhost:11434. Quality may vary compared to cloud judges like Gemini or GPT-5.2.

Custom Judge URL

Use --judge-url to point to any OpenAI-compatible API:

evaly bench \
    -m flux-schnell \
    -p "A cat" \
    -j local/my-model \
    --judge-url http://my-server:8080/v1

Choosing a Judge

JudgeQualitySpeedCostBest For
gemini-2.5-flash Good Fast Low Default, development, small benchmarks
gemini-2.5-pro Excellent Medium Paid High-stakes evaluations, publication
openai/gpt-5.2 Excellent Medium Paid Cross-validation with different VLM
anthropic/claude-sonnet-4-6 Excellent Medium Paid Cross-validation with different VLM
ollama/qwen2.5-vl:7b Moderate Slow Free (local) Privacy-sensitive, offline, experimentation

Consensus Mode

A single VLM judge can be biased — high-score clustering, dimension-specific inconsistency. Consensus mode uses 2–3 judges with an adaptive 2+1 algorithm for more reliable scores.

How it works

For each (image, dimension) pair:

  1. Two primary judges score in parallel
  2. If they agree (within 0.5 points) → average, marked as high agreement
  3. If they disagree (>0.5 difference) → third judge breaks the tie with a median, marked as disputed
  4. If one judge fails → the other judge's score is used, marked as degraded

This keeps cost at ~2.3x instead of 3x, since the tiebreaker is only called for disputed dimensions.

Usage

# 2 judges — average when they agree, flag disputes
evaly bench -m flux-schnell -p "A cat" \
    --judges "gemini-2.5-flash,gpt-5.2"

# 3 judges — disputed dimensions get a tiebreaker (median)
evaly bench -m flux-schnell -p prompts.json \
    --judges "gemini-2.5-flash,gpt-5.2,claude-haiku-4-5"

Recommended combinations

CombinationCostBest For
gemini-2.5-flash, gpt-5.2 Low Budget consensus — two cheap judges, no tiebreaker
gemini-2.5-flash, gpt-5.2, claude-haiku-4-5 Low–Medium Full consensus with tiebreaker, 3 different providers
gemini-2.5-pro, gpt-5.2 Medium High-quality consensus, two strong judges

Report output

In consensus mode, reports include:

  • Agree column in the model comparison table — percentage of dimensions with high agreement
  • Per-judge scores in score details — which judge gave what score
  • Agreement badges (high, disputed, degraded) per dimension
  • Per-provider costs in the cost summary

Config file

# evalytic.toml
[bench]
judges = ["gemini-2.5-flash", "gpt-5.2", "claude-haiku-4-5"]
Each judge requires its own API key. When using gemini-2.5-flash + gpt-5.2, both GEMINI_API_KEY and OPENAI_API_KEY must be set.

Configuration

Set a default judge in evalytic.toml:

# evalytic.toml — single judge
[bench]
judge = "openai/gpt-5.2"

# Or consensus mode (overrides judge)
# judges = ["gemini-2.5-flash", "gpt-5.2"]

CLI flags always override config file settings. See Configuration for the full precedence order.