Judges
Configure VLM judges: Gemini, GPT-5.2, Claude, Ollama, fal.ai, or custom.
Evalytic uses Vision-Language Models (VLMs) as judges to score AI-generated images.
Any VLM that can analyze images and output structured JSON can serve as a judge.
The default is gemini-2.5-flash.
FAL_KEY handles both image generation and judging.
Use fal/ prefixed judges (e.g. fal/gemini-2.5-flash) to access Gemini, GPT-5.2, and Claude — all through one API key.
Learn more →
Supported Providers
| Provider | Models | API Key Env | Base URL |
|---|---|---|---|
| Gemini | gemini-2.5-flash, gemini-2.5-pro |
GEMINI_API_KEY |
generativelanguage.googleapis.com |
| OpenAI | openai/gpt-5.2 |
OPENAI_API_KEY |
api.openai.com |
| Anthropic | anthropic/claude-sonnet-4-6, anthropic/claude-haiku-4-5 |
ANTHROPIC_API_KEY |
api.anthropic.com |
| fal.ai | fal/gemini-2.5-flash, fal/gpt-5.2, fal/claude-sonnet-4-6 |
FAL_KEY |
fal.run (OpenRouter) |
| Ollama | ollama/qwen2.5-vl:7b |
None (local) | localhost:11434 |
| LM Studio | lmstudio/<model> |
None (local) | localhost:1234 |
| Custom | local/<model> |
None | localhost:8090 (or --judge-url) |
Judge Format
The --judge flag accepts two formats:
# Format 1: model name only (Gemini assumed)
--judge gemini-2.5-flash
# Format 2: provider/model
--judge openai/gpt-5.2
--judge anthropic/claude-sonnet-4-6
--judge fal/gemini-2.5-flash
--judge fal/gpt-5.2
--judge fal/claude-sonnet-4-6
--judge ollama/qwen2.5-vl:7b
--judge lmstudio/my-model
--judge local/my-model
Gemini (Default)
Gemini is the default judge — fast, affordable, and easy to set up.
# These are equivalent (gemini is the default provider)
evaly bench -m flux-schnell -p "A cat"
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-flash
# Use Gemini Pro for higher quality scoring
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-pro
GEMINI_API_KEY env var.
fal.ai (One Key for Everything)
If you already have a FAL_KEY for image generation, you can use it for judging too — no need for separate Gemini, OpenAI, or Anthropic API keys.
fal.ai routes requests to multiple VLM providers via OpenRouter.
export FAL_KEY=your_fal_key
# Use Gemini via fal.ai
evaly bench -m flux-schnell -p "A cat" -j fal/gemini-2.5-flash
# Use GPT-5.2 via fal.ai
evaly bench -m flux-schnell -p "A cat" -j fal/gpt-5.2
# Use Claude via fal.ai
evaly bench -m flux-schnell -p "A cat" -j fal/claude-sonnet-4-6
Available models via fal.ai
| Judge | Routes to | Cost/image |
|---|---|---|
fal/gemini-2.5-flash | Google Gemini 2.5 Flash | ~$0.0006 |
fal/gemini-2.5-pro | Google Gemini 2.5 Pro | ~$0.003 |
fal/gemini-3-flash | Google Gemini 3 Flash | ~$0.0006 |
fal/gpt-5.2 | OpenAI GPT-5.2 | ~$0.0025 |
fal/gpt-4o | OpenAI GPT-4o | ~$0.002 |
fal/gpt-4o-mini | OpenAI GPT-4o Mini | ~$0.0003 |
fal/claude-sonnet-4-6 | Anthropic Claude Sonnet 4.6 | ~$0.004 |
fal/claude-haiku-4-5 | Anthropic Claude Haiku 4.5 | ~$0.001 |
fal/claude-opus-4-6 | Anthropic Claude Opus 4.6 | ~$0.02 |
FAL_KEY at fal.ai/dashboard/keys ($10 free credit). It handles both image generation and VLM judging — no other API keys needed.
OpenAI
export OPENAI_API_KEY=sk-...
evaly bench -m flux-schnell -p "A cat" -j openai/gpt-5.2
Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
evaly bench -m flux-schnell -p "A cat" -j anthropic/claude-sonnet-4-6
Ollama (Local)
Run judges entirely on your machine with no API costs:
# First, start Ollama and pull a vision model
ollama pull qwen2.5-vl:7b
# Use it as judge
evaly bench -m flux-schnell -p "A cat" -j ollama/qwen2.5-vl:7b
localhost:11434. Quality may vary compared to cloud judges like Gemini or GPT-5.2.
Custom Judge URL
Use --judge-url to point to any OpenAI-compatible API:
evaly bench \
-m flux-schnell \
-p "A cat" \
-j local/my-model \
--judge-url http://my-server:8080/v1
Choosing a Judge
| Judge | Quality | Speed | Cost | Best For |
|---|---|---|---|---|
gemini-2.5-flash |
Good | Fast | Low | Default, development, small benchmarks |
gemini-2.5-pro |
Excellent | Medium | Paid | High-stakes evaluations, publication |
openai/gpt-5.2 |
Excellent | Medium | Paid | Cross-validation with different VLM |
anthropic/claude-sonnet-4-6 |
Excellent | Medium | Paid | Cross-validation with different VLM |
fal/gemini-2.5-flash |
Good | Fast | Low | Single-key setup, same quality as direct Gemini |
fal/gpt-5.2 |
Excellent | Medium | Paid | Cross-validation without separate OpenAI key |
ollama/qwen2.5-vl:7b |
Moderate | Slow | Free (local) | Privacy-sensitive, offline, experimentation |
Consensus Mode
A single VLM judge can be biased — high-score clustering, dimension-specific inconsistency. Consensus mode uses 2–3 judges with an adaptive 2+1 algorithm for more reliable scores.
How it works
For each (image, dimension) pair:
- Two primary judges score in parallel
- If they agree (within 0.5 points) → average, marked as
highagreement - If they disagree (>0.5 difference) → third judge breaks the tie with a median, marked as
disputed - If one judge fails → the other judge's score is used, marked as
degraded
This keeps cost at ~2.3x instead of 3x, since the tiebreaker is only called for disputed dimensions.
Usage
# 2 judges — average when they agree, flag disputes
evaly bench -m flux-schnell -p "A cat" \
--judges "gemini-2.5-flash,gpt-5.2"
# 3 judges — disputed dimensions get a tiebreaker (median)
evaly bench -m flux-schnell -p prompts.json \
--judges "gemini-2.5-flash,gpt-5.2,claude-haiku-4-5"
# Consensus with single FAL_KEY — no other API keys needed
evaly bench -m flux-schnell -p "A cat" \
--judges "fal/gemini-2.5-flash,fal/gpt-5.2"
Recommended combinations
| Combination | Cost | Best For |
|---|---|---|
gemini-2.5-flash, gpt-5.2 |
Low | Budget consensus — two cheap judges, no tiebreaker |
gemini-2.5-flash, gpt-5.2, claude-haiku-4-5 |
Low–Medium | Full consensus with tiebreaker, 3 different providers |
gemini-2.5-pro, gpt-5.2 |
Medium | High-quality consensus, two strong judges |
fal/gemini-2.5-flash, fal/gpt-5.2 |
Low | Single FAL_KEY consensus — no other API keys needed |
fal/gemini-2.5-flash, fal/gpt-5.2, fal/claude-haiku-4-5 |
Low–Medium | Full 3-provider consensus with single FAL_KEY |
Report output
In consensus mode, reports include:
- Agree column in the model comparison table — percentage of dimensions with high agreement
- Per-judge scores in score details — which judge gave what score
- Agreement badges (
high,disputed,degraded) per dimension - Per-provider costs in the cost summary
Config file
# evalytic.toml
[bench]
judges = ["gemini-2.5-flash", "gpt-5.2", "claude-haiku-4-5"]
fal/ prefixed judges. With fal/gemini-2.5-flash + fal/gpt-5.2, only FAL_KEY is needed.
Configuration
Set a default judge in evalytic.toml:
# evalytic.toml — single judge
[bench]
judge = "openai/gpt-5.2"
# Or consensus mode (overrides judge)
# judges = ["gemini-2.5-flash", "gpt-5.2"]
CLI flags always override config file settings. See Configuration for the full precedence order.