evaly bench
One-command model benchmarking: generate, score, and report.
evaly bench [OPTIONS]
The bench command is Evalytic's primary workflow. It generates images from prompts using fal.ai models,
scores them with a VLM judge, and outputs rich reports. Everything in a single command.
--models, Evalytic offers to run a demo with flux-schnell.
If you omit --prompts, a demo prompt is used automatically. This makes the simplest possible invocation just evaly bench -y.
Basic Usage
# Quickest start: uses demo model + demo prompt
evaly bench -y
# Single model, inline prompt
evaly bench -m flux-schnell -p "A cat on a windowsill" --yes
# Multiple models, prompts from file
evaly bench -m flux-schnell -m flux-dev -m flux-pro -p prompts.json
# With HTML report and browser review
evaly bench -m flux-schnell flux-dev -p prompts.json -o report.html --review
# Disable auto-enabled metrics
evaly bench -m flux-schnell -p "Hello World" --no-metrics
Options
Running evaly bench --help groups options into Essential and Advanced sections.
Most workflows only need the essential ones.
Essential Options
| Flag | Type | Description |
|---|---|---|
| --models, -m | TEXT (multiple) | Model short names or fal.ai endpoints. Omit for interactive demo. |
| --prompts, -p | TEXT | JSON file path or inline prompt string. Omit for demo prompt. |
| --inputs, -i | TEXT | JSON file with img2img inputs (for transformation benchmarks). |
| --images | TEXT | JSON file with pre-existing images (skip generation, score-only). |
| --dataset | TEXT | Path to dataset file (enriched prompts with metadata/expected scores). |
| --check-expected | FLAG | Compare results against expected scores in dataset. |
| --output, -o | TEXT (multiple) | Output file paths. Supports .json and .html. |
| --output-dir | TEXT | Output directory. Creates a timestamped subfolder with report.html, report.json, and errors.log. |
| --yes, -y | FLAG | Skip cost confirmation prompt. |
| --review | FLAG | Open browser review server after scoring. |
| --judge, -j | TEXT | VLM judge (default: gemini-2.5-flash). Format: model or provider/model. |
| --judges | TEXT | Comma-separated judges for consensus mode (e.g. "gemini-2.5-flash,gpt-5.2"). Min 2, max 3. |
| --list-models | FLAG | Print the model registry and exit. |
Advanced Options
| Flag | Type | Default | Description |
|---|---|---|---|
| --judge-url | TEXT | — | Custom judge API base URL (overrides provider default). |
| --dimensions, -d | TEXT (multiple) | auto | Quality dimensions to score. Auto-detected from context if omitted. |
| --concurrency | INT | 4 | Max parallel generation requests. |
| --image-size | TEXT | — | Image size (e.g., landscape_16_9, square_hd). |
| --seed | INT | — | Fixed seed for reproducible generation. |
| --fal-params | TEXT | — | Path to JSON file with additional fal.ai parameters. |
| --cache-dir | TEXT | — | Local image cache directory (skip re-generation). |
| --timeout | INT | 300 | Max seconds per generation request. |
| --name | TEXT | — | Human-readable name for this bench run. |
| --quiet, -q | FLAG | — | Suppress progress bars. |
| --no-terminal | FLAG | — | Suppress Rich terminal output. |
| --review-port | INT | 3847 | Port for review server. |
| --metrics | TEXT (multiple) | auto | Local metrics to compute: clip, lpips, face. Auto-detected by pipeline when evalytic[metrics] is installed. |
| --no-metrics | FLAG | — | Disable automatic metrics (CLIP/LPIPS). |
| --clip-threshold | FLOAT | 0.18 | CLIP score flag threshold. |
| --clip-weight | FLOAT | 0.20 | CLIP weight in overall score. |
| --lpips-threshold | FLOAT | 0.40 | LPIPS flag threshold. |
| --lpips-weight | FLOAT | 0.20 | LPIPS weight in overall score. |
| --face-threshold | FLOAT | 0.60 | Face similarity flag threshold (img2img only). |
| --face-weight | FLOAT | 0.20 | Face metric weight in overall score. |
| --no-metric-scoring | FLAG | — | Show metrics but exclude from overall score. |
evalytic[metrics] is installed: CLIP for text2img, LPIPS for img2img. Install with pip install "evalytic[metrics]" (~2GB). Use --no-metrics to disable.
Smart Defaults
To minimize the barrier to your first benchmark, bench can run with almost no arguments:
No models specified
If you omit --models in an interactive terminal, Evalytic asks if you want a demo run with flux-schnell.
You can also set default models in evalytic.toml:
# evalytic.toml
[bench]
models = ["flux-schnell", "flux-dev"]
No prompts specified
If you omit --prompts (and --inputs/--images), a demo prompt is used automatically:
"A cat sitting on a windowsill at sunset". You can also set default prompts in evalytic.toml:
# evalytic.toml
[bench]
prompts = "prompts.json"
Prompts Format
Prompts can be provided as an inline string or as a JSON file:
// prompts.json — array of strings
[
"A photorealistic cat on a windowsill at sunset",
"A modern minimalist logo for 'ACME Corp'",
"Product photo: white sneakers on marble"
]
Dataset Mode
Use --dataset to run a benchmark from a dataset file instead of
plain prompts. Datasets can include metadata and expected scores for regression detection.
The --dataset flag cannot be combined with --prompts or --inputs.
# Run bench from a dataset file
evaly bench -m flux-schnell --dataset golden.json -y
Add --check-expected to compare actual scores against the expected values defined in the dataset.
Dimensions that fall below expected are flagged in the terminal output:
# Check results against expected scores
evaly bench -m flux-schnell --dataset golden.json --check-expected -y
Model Registry
Run evaly bench --list-models to see all registered models. Common ones:
| Short Name | fal.ai Endpoint | Cost/image |
|---|---|---|
flux-schnell | fal-ai/flux/schnell | ~$0.003 |
flux-dev | fal-ai/flux/dev | ~$0.025 |
flux-pro | fal-ai/flux-pro/v1.1 | ~$0.05 |
flux-pro-ultra | fal-ai/flux-pro/v1.1-ultra | ~$0.06 |
kontext | fal-ai/flux-pro/kontext | ~$0.04 |
You can also pass full fal.ai endpoint paths directly:
evaly bench -m fal-ai/flux-realism -p "A portrait photo"
Output Formats
Terminal (always)
Rich-formatted table with per-model, per-dimension scores. Suppressed with --no-terminal.
JSON
Machine-readable report with all scores, metadata, and cost breakdown. Used for CI/CD integration with evaly gate.
evaly bench -m flux-schnell -p "A cat" -o report.json --yes
HTML
Interactive report with image grids, score charts, and judge reasoning.
evaly bench -m flux-schnell flux-dev -p prompts.json -o report.html --review
Output Directory
Use --output-dir to write all outputs to a timestamped subfolder. Each run creates a directory like
bench-abc123_20260227-143000/ containing report.html, report.json, and
errors.log (only when errors occurred).
evaly bench -m flux-schnell flux-dev -p prompts.json --output-dir ./reports -y
# Creates: reports/bench-abc123_20260227-143000/
# report.html
# report.json
# errors.log (only if errors occurred)
--output-dir works alongside --output — both can be used in the same command.
You can also set a default in evalytic.toml:
# evalytic.toml
[bench]
output_dir = "./reports"
Error Log
When generation or scoring errors occur (API timeouts, content policy violations, judge parse failures),
they are collected into a structured errors.log file inside the output directory.
The terminal also shows a brief error summary at the end of each run.
Examples
Compare 3 models with JSON + HTML output
evaly bench \
-m flux-schnell -m flux-dev -m flux-pro \
-p prompts.json \
-o report.json -o report.html \
--review
Text rendering evaluation
evaly bench \
-m flux-pro \
-p "A sign that says 'OPEN 24/7'" \
-d text_rendering -d visual_quality
Reproducible benchmark with seed
evaly bench \
-m flux-dev \
-p prompts.json \
--seed 42 \
--image-size landscape_16_9
Image-to-image benchmark
# Compare img2img models with face identity metric
evaly bench \
-m flux-kontext -m seedream-edit -m reve-edit \
-i inputs.json \
-d identity_preservation \
--metrics face \
-o report.html --review
--metrics face computes ArcFace embedding cosine similarity between input and output images.
Requires pip install "evalytic[metrics]". Automatically skipped when no face is detected in either image.
Use GPT-5.2 as judge instead of Gemini
evaly bench \
-m flux-schnell \
-p "A sunset over the ocean" \
-j openai/gpt-5.2
Local judge with Ollama
evaly bench \
-m flux-schnell \
-p "A cat" \
-j ollama/qwen2.5-vl:7b
Consensus mode (multi-judge)
Use --judges with 2-3 comma-separated judge names for more reliable scores via adaptive consensus:
# 2 judges — average when they agree, flag disputes
evaly bench \
-m flux-schnell \
-p "A cat" \
--judges "gemini-2.5-flash,gpt-5.2"
# 3 judges — disputed dimensions get a tiebreaker (median)
evaly bench \
-m flux-schnell flux-dev \
-p prompts.json \
--judges "gemini-2.5-flash,gpt-5.2,claude-haiku-4-5"