evaly bench

One-command model benchmarking: generate, score, and report.

evaly bench [OPTIONS]

The bench command is Evalytic's primary workflow. It generates images from prompts using fal.ai models, scores them with a VLM judge, and outputs rich reports. Everything in a single command.

Smart defaults: If you omit --models, Evalytic offers to run a demo with flux-schnell. If you omit --prompts, a demo prompt is used automatically. This makes the simplest possible invocation just evaly bench -y.

Basic Usage

# Quickest start: uses demo model + demo prompt
evaly bench -y

# Single model, inline prompt
evaly bench -m flux-schnell -p "A cat on a windowsill" --yes

# Multiple models, prompts from file
evaly bench -m flux-schnell -m flux-dev -m flux-pro -p prompts.json

# With HTML report and browser review
evaly bench -m flux-schnell flux-dev -p prompts.json -o report.html --review

# Disable auto-enabled metrics
evaly bench -m flux-schnell -p "Hello World" --no-metrics

Options

Running evaly bench --help groups options into Essential and Advanced sections. Most workflows only need the essential ones.

Essential Options

Flag	Type	Description
--models, -m	TEXT (multiple)	Model short names or fal.ai endpoints. Omit for interactive demo.
--prompts, -p	TEXT	JSON file path or inline prompt string. Omit for demo prompt.
--inputs, -i	TEXT	JSON file with img2img inputs (for transformation benchmarks).
--images	TEXT	JSON file with pre-existing images (skip generation, score-only).
--dataset	TEXT	Path to dataset file (enriched prompts with metadata/expected scores).
--check-expected	FLAG	Compare results against expected scores in dataset.
--output, -o	TEXT (multiple)	Output file paths. Supports `.json` and `.html`.
--output-dir	TEXT	Output directory. Creates a timestamped subfolder with `report.html`, `report.json`, and `errors.log`.
--yes, -y	FLAG	Skip cost confirmation prompt.
--review	FLAG	Open browser review server after scoring.
--judge, -j	TEXT	VLM judge (default: `gemini-2.5-flash`). Format: `model` or `provider/model`.
--judges	TEXT	Comma-separated judges for consensus mode (e.g. `"gemini-2.5-flash,gpt-5.2"`). Min 2, max 3.
--list-models	FLAG	Print the model registry and exit.

See Judges for the full list of supported providers and judge format examples.

Advanced Options

Flag	Type	Default	Description
--judge-url	TEXT	—	Custom judge API base URL (overrides provider default).
--dimensions, -d	TEXT (multiple)	auto	Quality dimensions to score. Auto-detected from context if omitted.
--concurrency	INT	4	Max parallel generation requests.
--image-size	TEXT	—	Image size (e.g., `landscape_16_9`, `square_hd`).
--seed	INT	—	Fixed seed for reproducible generation.
--fal-params	TEXT	—	Path to JSON file with additional fal.ai parameters.
--cache-dir	TEXT	—	Local image cache directory (skip re-generation).
--timeout	INT	300	Max seconds per generation request.
--name	TEXT	—	Human-readable name for this bench run.
--quiet, -q	FLAG	—	Suppress progress bars.
--no-terminal	FLAG	—	Suppress Rich terminal output.
--review-port	INT	3847	Port for review server.
--metrics	TEXT (multiple)	auto	Local metrics to compute: `clip`, `lpips`, `face`. Auto-detected by pipeline when `evalytic[metrics]` is installed.
--no-metrics	FLAG	—	Disable automatic metrics (CLIP/LPIPS).
--clip-threshold	FLOAT	0.18	CLIP score flag threshold.
--clip-weight	FLOAT	0.20	CLIP weight in overall score.
--lpips-threshold	FLOAT	0.40	LPIPS flag threshold.
--lpips-weight	FLOAT	0.20	LPIPS weight in overall score.
--face-threshold	FLOAT	0.60	Face similarity flag threshold (img2img only).
--face-weight	FLOAT	0.20	Face metric weight in overall score.
--no-metric-scoring	FLAG	—	Show metrics but exclude from overall score.

Metrics are auto-enabled when evalytic[metrics] is installed: CLIP for text2img, LPIPS for img2img. Install with pip install "evalytic[metrics]" (~2GB). Use --no-metrics to disable.

Smart Defaults

To minimize the barrier to your first benchmark, bench can run with almost no arguments:

No models specified

If you omit --models in an interactive terminal, Evalytic asks if you want a demo run with flux-schnell. You can also set default models in evalytic.toml:

# evalytic.toml
[bench]
models = ["flux-schnell", "flux-dev"]

No prompts specified

If you omit --prompts (and --inputs/--images), a demo prompt is used automatically: "A cat sitting on a windowsill at sunset". You can also set default prompts in evalytic.toml:

# evalytic.toml
[bench]
prompts = "prompts.json"

Prompts Format

Prompts can be provided as an inline string or as a JSON file:

// prompts.json — array of strings
[
  "A photorealistic cat on a windowsill at sunset",
  "A modern minimalist logo for 'ACME Corp'",
  "Product photo: white sneakers on marble"
]

Dataset Mode

Use --dataset to run a benchmark from a dataset file instead of plain prompts. Datasets can include metadata and expected scores for regression detection. The --dataset flag cannot be combined with --prompts or --inputs.

# Run bench from a dataset file
evaly bench -m flux-schnell --dataset golden.json -y

Add --check-expected to compare actual scores against the expected values defined in the dataset. Dimensions that fall below expected are flagged in the terminal output:

# Check results against expected scores
evaly bench -m flux-schnell --dataset golden.json --check-expected -y

See evaly dataset for how to create datasets, add items, and generate golden test sets from bench reports.

Model Registry

Run evaly bench --list-models to see all registered models. Common ones:

Short Name	fal.ai Endpoint	Cost/image
`flux-schnell`	fal-ai/flux/schnell	~$0.003
`flux-dev`	fal-ai/flux/dev	~$0.025
`flux-pro`	fal-ai/flux-pro/v1.1	~$0.05
`flux-pro-ultra`	fal-ai/flux-pro/v1.1-ultra	~$0.06
`kontext`	fal-ai/flux-pro/kontext	~$0.04

You can also pass full fal.ai endpoint paths directly:

evaly bench -m fal-ai/flux-realism -p "A portrait photo"

Output Formats

Terminal (always)

Rich-formatted table with per-model, per-dimension scores. Suppressed with --no-terminal.

JSON

Machine-readable report with all scores, metadata, and cost breakdown. Used for CI/CD integration with evaly gate.

evaly bench -m flux-schnell -p "A cat" -o report.json --yes

HTML

Interactive report with image grids, score charts, and judge reasoning.

evaly bench -m flux-schnell flux-dev -p prompts.json -o report.html --review

Output Directory

Use --output-dir to write all outputs to a timestamped subfolder. Each run creates a directory like bench-abc123_20260227-143000/ containing report.html, report.json, and errors.log (only when errors occurred).

evaly bench -m flux-schnell flux-dev -p prompts.json --output-dir ./reports -y

# Creates: reports/bench-abc123_20260227-143000/
#   report.html
#   report.json
#   errors.log    (only if errors occurred)

--output-dir works alongside --output — both can be used in the same command. You can also set a default in evalytic.toml:

# evalytic.toml
[bench]
output_dir = "./reports"

Error Log

When generation or scoring errors occur (API timeouts, content policy violations, judge parse failures), they are collected into a structured errors.log file inside the output directory. The terminal also shows a brief error summary at the end of each run.

Examples

Compare 3 models with JSON + HTML output

evaly bench \
    -m flux-schnell -m flux-dev -m flux-pro \
    -p prompts.json \
    -o report.json -o report.html \
    --review

Text rendering evaluation

evaly bench \
    -m flux-pro \
    -p "A sign that says 'OPEN 24/7'" \
    -d text_rendering -d visual_quality

Reproducible benchmark with seed

evaly bench \
    -m flux-dev \
    -p prompts.json \
    --seed 42 \
    --image-size landscape_16_9

Image-to-image benchmark

# Compare img2img models with face identity metric
evaly bench \
    -m flux-kontext -m seedream-edit -m reve-edit \
    -i inputs.json \
    -d identity_preservation \
    --metrics face \
    -o report.html --review

--metrics face computes ArcFace embedding cosine similarity between input and output images. Requires pip install "evalytic[metrics]". Automatically skipped when no face is detected in either image.

Use GPT-5.2 as judge instead of Gemini

evaly bench \
    -m flux-schnell \
    -p "A sunset over the ocean" \
    -j openai/gpt-5.2

Local judge with Ollama

evaly bench \
    -m flux-schnell \
    -p "A cat" \
    -j ollama/qwen2.5-vl:7b

Consensus mode (multi-judge)

Use --judges with 2-3 comma-separated judge names for more reliable scores via adaptive consensus:

# 2 judges — average when they agree, flag disputes
evaly bench \
    -m flux-schnell \
    -p "A cat" \
    --judges "gemini-2.5-flash,gpt-5.2"

# 3 judges — disputed dimensions get a tiebreaker (median)
evaly bench \
    -m flux-schnell flux-dev \
    -p prompts.json \
    --judges "gemini-2.5-flash,gpt-5.2,claude-haiku-4-5"

Adaptive 2+1 algorithm: Two primary judges score in parallel. If they agree (within 0.5 points), the average is used. If they disagree, the third judge breaks the tie with a median. This keeps cost at ~2.3x instead of 3x. Reports show an "Agree" column indicating the percentage of high-agreement dimensions per model.

Configuration evaly eval