Judges

Configure judge providers for image, text, RAG, and agent evaluation.

Evalytic uses a shared judge/provider layer across visual, text, RAG, and agent workflows. Visual workflows rely on Vision-Language Models (VLMs); text-first workflows use the same providers in structured text mode. The default judge is gemini-2.5-flash.

Simplest setup: A single provider key can handle both generation and judging. Use fal/ judges with FAL_KEY, or parel/ judges with PAREL_API_KEY. Learn more →

Supported Providers

Provider	Models	API Key Env	Base URL
Gemini	`gemini-2.5-flash`, `gemini-2.5-pro`	`GEMINI_API_KEY`	generativelanguage.googleapis.com
OpenAI	`openai/gpt-5.2`	`OPENAI_API_KEY`	api.openai.com
Anthropic	`anthropic/claude-sonnet-4-6`, `anthropic/claude-haiku-4-5`	`ANTHROPIC_API_KEY`	api.anthropic.com
fal.ai	`fal/gemini-2.5-flash`, `fal/gpt-5.2`, `fal/claude-sonnet-4-6`	`FAL_KEY`	fal.run (OpenRouter)
Parel	`parel/gpt-5.4`, `parel/claude-sonnet`, `parel/gemini-3.1-pro`	`PAREL_API_KEY`	api.parel.cloud/v1
Ollama	`ollama/qwen2.5-vl:7b`	None (local)	localhost:11434
LM Studio	`lmstudio/<model>`	None (local)	localhost:1234
Custom	`local/<model>`	None	localhost:8090 (or `--judge-url`)

Judge Format

The --judge flag accepts two formats:

# Format 1: model name only (Gemini assumed)
--judge gemini-2.5-flash

# Format 2: provider/model
--judge openai/gpt-5.2
--judge anthropic/claude-sonnet-4-6
--judge fal/gemini-2.5-flash
--judge fal/gpt-5.2
--judge fal/claude-sonnet-4-6
--judge parel/gpt-5.4
--judge parel/claude-sonnet
--judge ollama/qwen2.5-vl:7b
--judge lmstudio/my-model
--judge local/my-model

Gemini (Default)

Gemini is the default judge — fast, affordable, and easy to set up.

# These are equivalent (gemini is the default provider)
evaly bench -m flux-schnell -p "A cat"
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-flash

# Use Gemini Pro for higher quality scoring
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-pro

Get your Gemini API key at aistudio.google.com/apikey. Set it as GEMINI_API_KEY env var.

fal.ai (One Key for Everything)

If you already have a FAL_KEY for image generation, you can use it for judging too — no need for separate Gemini, OpenAI, or Anthropic API keys. fal.ai routes requests to multiple VLM providers via OpenRouter.

export FAL_KEY=your_fal_key

# Use Gemini via fal.ai
evaly bench -m flux-schnell -p "A cat" -j fal/gemini-2.5-flash

# Use GPT-5.2 via fal.ai
evaly bench -m flux-schnell -p "A cat" -j fal/gpt-5.2

# Use Claude via fal.ai
evaly bench -m flux-schnell -p "A cat" -j fal/claude-sonnet-4-6

Available models via fal.ai

Judge	Routes to	Cost/image
`fal/gemini-2.5-flash`	Google Gemini 2.5 Flash	~$0.0006
`fal/gemini-2.5-pro`	Google Gemini 2.5 Pro	~$0.003
`fal/gemini-3-flash`	Google Gemini 3 Flash	~$0.0006
`fal/gpt-5.2`	OpenAI GPT-5.2	~$0.0025
`fal/gpt-4o`	OpenAI GPT-4o	~$0.002
`fal/gpt-4o-mini`	OpenAI GPT-4o Mini	~$0.0003
`fal/claude-sonnet-4-6`	Anthropic Claude Sonnet 4.6	~$0.004
`fal/claude-haiku-4-5`	Anthropic Claude Haiku 4.5	~$0.001
`fal/claude-opus-4-6`	Anthropic Claude Opus 4.6	~$0.02

One key, all providers. Get your FAL_KEY at fal.ai/dashboard/keys ($10 free credit). It handles both image generation and VLM judging — no other API keys needed.

Parel

Parel judges use the same OpenAI-compatible path as other chat providers, with Bearer auth and https://api.parel.cloud/v1 as the default base URL. Set PAREL_BASE_URL to override the endpoint.

export PAREL_API_KEY=your_parel_key

# Parel generation + Parel judge
evaly bench -m parel/flux-schnell -p "A cat" -j parel/gpt-5.4

# Mix fal.ai generation with a Parel judge
evaly bench -m flux-schnell -p "A cat" -j parel/claude-sonnet

Parel responses expose request cost through X-Parel-Cost; Evalytic captures that value in judge cost accounting.

OpenAI

export OPENAI_API_KEY=sk-...
evaly bench -m flux-schnell -p "A cat" -j openai/gpt-5.2

Anthropic

export ANTHROPIC_API_KEY=sk-ant-...
evaly bench -m flux-schnell -p "A cat" -j anthropic/claude-sonnet-4-6

Ollama (Local)

Run judges entirely on your machine with no API costs:

# First, start Ollama and pull a vision model
ollama pull qwen2.5-vl:7b

# Use it as judge
evaly bench -m flux-schnell -p "A cat" -j ollama/qwen2.5-vl:7b

Ollama must be running at localhost:11434. Quality may vary compared to cloud judges like Gemini or GPT-5.2.

Custom Judge URL

Use --judge-url to point to any OpenAI-compatible API:

evaly bench \
    -m flux-schnell \
    -p "A cat" \
    -j local/my-model \
    --judge-url http://my-server:8080/v1

Choosing a Judge

Judge	Quality	Speed	Cost	Best For
`gemini-2.5-flash`	Good	Fast	Low	Default, development, small benchmarks
`gemini-2.5-pro`	Excellent	Medium	Paid	High-stakes evaluations, publication
`openai/gpt-5.2`	Excellent	Medium	Paid	Cross-validation with different VLM
`anthropic/claude-sonnet-4-6`	Excellent	Medium	Paid	Cross-validation with different VLM
`fal/gemini-2.5-flash`	Good	Fast	Low	Single-key setup, same quality as direct Gemini
`fal/gpt-5.2`	Excellent	Medium	Paid	Cross-validation without separate OpenAI key
`parel/gpt-5.4`	Excellent	Medium	Paid	Parel-only setup or cross-validation with Parel billing
`ollama/qwen2.5-vl:7b`	Moderate	Slow	Free (local)	Privacy-sensitive, offline, experimentation

Consensus Mode

A single VLM judge can be biased — high-score clustering, dimension-specific inconsistency. Consensus mode uses 2–3 judges with an adaptive 2+1 algorithm for more reliable scores.

How it works

For each (image, dimension) pair:

Two primary judges score in parallel
If they agree (within 0.5 points) → average, marked as high agreement
If they disagree (>0.5 difference) → third judge breaks the tie with a median, marked as disputed
If one judge fails → the other judge's score is used, marked as degraded

This keeps cost at ~2.3x instead of 3x, since the tiebreaker is only called for disputed dimensions.

Usage

# 2 judges — average when they agree, flag disputes
evaly bench -m flux-schnell -p "A cat" \
    --judges "gemini-2.5-flash,gpt-5.2"

# 3 judges — disputed dimensions get a tiebreaker (median)
evaly bench -m flux-schnell -p prompts.json \
    --judges "gemini-2.5-flash,gpt-5.2,claude-haiku-4-5"

# Consensus with single FAL_KEY — no other API keys needed
evaly bench -m flux-schnell -p "A cat" \
    --judges "fal/gemini-2.5-flash,fal/gpt-5.2"

# Consensus with single PAREL_API_KEY
evaly bench -m parel/flux-schnell -p "A cat" \
    --judges "parel/gpt-5.4,parel/claude-sonnet"

Recommended combinations

Combination	Cost	Best For
`gemini-2.5-flash, gpt-5.2`	Low	Budget consensus — two cheap judges, no tiebreaker
`gemini-2.5-flash, gpt-5.2, claude-haiku-4-5`	Low–Medium	Full consensus with tiebreaker, 3 different providers
`gemini-2.5-pro, gpt-5.2`	Medium	High-quality consensus, two strong judges
`fal/gemini-2.5-flash, fal/gpt-5.2`	Low	Single FAL_KEY consensus — no other API keys needed
`fal/gemini-2.5-flash, fal/gpt-5.2, fal/claude-haiku-4-5`	Low–Medium	Full 3-provider consensus with single FAL_KEY
`parel/gpt-5.4, parel/claude-sonnet`	Medium	Single PAREL_API_KEY consensus

Report output

In consensus mode, reports include:

Agree column in the model comparison table — percentage of dimensions with high agreement
Per-judge scores in score details — which judge gave what score
Agreement badges (high, disputed, degraded) per dimension
Per-provider costs in the cost summary

Config file

# evalytic.toml
[bench]
judges = ["gemini-2.5-flash", "gpt-5.2", "claude-haiku-4-5"]

Each judge requires its own API key — unless you use one routed provider for all judges. With fal/gemini-2.5-flash + fal/gpt-5.2, only FAL_KEY is needed. With parel/gpt-5.4 + parel/claude-sonnet, only PAREL_API_KEY is needed.

Configuration

Set a default judge in evalytic.toml:

# evalytic.toml — single judge
[bench]
judge = "openai/gpt-5.2"

# Or consensus mode (overrides judge)
# judges = ["gemini-2.5-flash", "gpt-5.2"]

CLI flags always override config file settings. See Configuration for the full precedence order.

evaly config Dimensions