evaly bench

One-command model benchmarking: generate, score, and report.

evaly bench [OPTIONS]

The bench command is Evalytic's primary workflow. It generates images from prompts using fal.ai models, scores them with a VLM judge, and outputs rich reports. Everything in a single command.

Smart defaults: If you omit --models, Evalytic offers to run a demo with flux-schnell. If you omit --prompts, a demo prompt is used automatically. This makes the simplest possible invocation just evaly bench -y.

Basic Usage

# Quickest start: uses demo model + demo prompt
evaly bench -y

# Single model, inline prompt
evaly bench -m flux-schnell -p "A cat on a windowsill" --yes

# Multiple models, prompts from file
evaly bench -m flux-schnell -m flux-dev -m flux-pro -p prompts.json

# With HTML report and browser review
evaly bench -m flux-schnell flux-dev -p prompts.json -o report.html --review

# Disable auto-enabled metrics
evaly bench -m flux-schnell -p "Hello World" --no-metrics

Options

Running evaly bench --help groups options into Essential and Advanced sections. Most workflows only need the essential ones.

Essential Options

FlagTypeDescription
--models, -mTEXT (multiple)Model short names or fal.ai endpoints. Omit for interactive demo.
--prompts, -pTEXTJSON file path or inline prompt string. Omit for demo prompt.
--inputs, -iTEXTJSON file with img2img inputs (for transformation benchmarks).
--imagesTEXTJSON file with pre-existing images (skip generation, score-only).
--datasetTEXTPath to dataset file (enriched prompts with metadata/expected scores).
--check-expectedFLAGCompare results against expected scores in dataset.
--output, -oTEXT (multiple)Output file paths. Supports .json and .html.
--output-dirTEXTOutput directory. Creates a timestamped subfolder with report.html, report.json, and errors.log.
--yes, -yFLAGSkip cost confirmation prompt.
--reviewFLAGOpen browser review server after scoring.
--judge, -jTEXTVLM judge (default: gemini-2.5-flash). Format: model or provider/model.
--judgesTEXTComma-separated judges for consensus mode (e.g. "gemini-2.5-flash,gpt-5.2"). Min 2, max 3.
--list-modelsFLAGPrint the model registry and exit.
See Judges for the full list of supported providers and judge format examples.

Advanced Options

FlagTypeDefaultDescription
--judge-urlTEXTCustom judge API base URL (overrides provider default).
--dimensions, -dTEXT (multiple)autoQuality dimensions to score. Auto-detected from context if omitted.
--concurrencyINT4Max parallel generation requests.
--image-sizeTEXTImage size (e.g., landscape_16_9, square_hd).
--seedINTFixed seed for reproducible generation.
--fal-paramsTEXTPath to JSON file with additional fal.ai parameters.
--cache-dirTEXTLocal image cache directory (skip re-generation).
--timeoutINT300Max seconds per generation request.
--nameTEXTHuman-readable name for this bench run.
--quiet, -qFLAGSuppress progress bars.
--no-terminalFLAGSuppress Rich terminal output.
--review-portINT3847Port for review server.
--metricsTEXT (multiple)autoLocal metrics to compute: clip, lpips, face. Auto-detected by pipeline when evalytic[metrics] is installed.
--no-metricsFLAGDisable automatic metrics (CLIP/LPIPS).
--clip-thresholdFLOAT0.18CLIP score flag threshold.
--clip-weightFLOAT0.20CLIP weight in overall score.
--lpips-thresholdFLOAT0.40LPIPS flag threshold.
--lpips-weightFLOAT0.20LPIPS weight in overall score.
--face-thresholdFLOAT0.60Face similarity flag threshold (img2img only).
--face-weightFLOAT0.20Face metric weight in overall score.
--no-metric-scoringFLAGShow metrics but exclude from overall score.
Metrics are auto-enabled when evalytic[metrics] is installed: CLIP for text2img, LPIPS for img2img. Install with pip install "evalytic[metrics]" (~2GB). Use --no-metrics to disable.

Smart Defaults

To minimize the barrier to your first benchmark, bench can run with almost no arguments:

No models specified

If you omit --models in an interactive terminal, Evalytic asks if you want a demo run with flux-schnell. You can also set default models in evalytic.toml:

# evalytic.toml
[bench]
models = ["flux-schnell", "flux-dev"]

No prompts specified

If you omit --prompts (and --inputs/--images), a demo prompt is used automatically: "A cat sitting on a windowsill at sunset". You can also set default prompts in evalytic.toml:

# evalytic.toml
[bench]
prompts = "prompts.json"

Prompts Format

Prompts can be provided as an inline string or as a JSON file:

// prompts.json — array of strings
[
  "A photorealistic cat on a windowsill at sunset",
  "A modern minimalist logo for 'ACME Corp'",
  "Product photo: white sneakers on marble"
]

Dataset Mode

Use --dataset to run a benchmark from a dataset file instead of plain prompts. Datasets can include metadata and expected scores for regression detection. The --dataset flag cannot be combined with --prompts or --inputs.

# Run bench from a dataset file
evaly bench -m flux-schnell --dataset golden.json -y

Add --check-expected to compare actual scores against the expected values defined in the dataset. Dimensions that fall below expected are flagged in the terminal output:

# Check results against expected scores
evaly bench -m flux-schnell --dataset golden.json --check-expected -y
See evaly dataset for how to create datasets, add items, and generate golden test sets from bench reports.

Model Registry

Run evaly bench --list-models to see all registered models. Common ones:

Short Namefal.ai EndpointCost/image
flux-schnellfal-ai/flux/schnell~$0.003
flux-devfal-ai/flux/dev~$0.025
flux-profal-ai/flux-pro/v1.1~$0.05
flux-pro-ultrafal-ai/flux-pro/v1.1-ultra~$0.06
kontextfal-ai/flux-pro/kontext~$0.04

You can also pass full fal.ai endpoint paths directly:

evaly bench -m fal-ai/flux-realism -p "A portrait photo"

Output Formats

Terminal (always)

Rich-formatted table with per-model, per-dimension scores. Suppressed with --no-terminal.

JSON

Machine-readable report with all scores, metadata, and cost breakdown. Used for CI/CD integration with evaly gate.

evaly bench -m flux-schnell -p "A cat" -o report.json --yes

HTML

Interactive report with image grids, score charts, and judge reasoning.

evaly bench -m flux-schnell flux-dev -p prompts.json -o report.html --review

Output Directory

Use --output-dir to write all outputs to a timestamped subfolder. Each run creates a directory like bench-abc123_20260227-143000/ containing report.html, report.json, and errors.log (only when errors occurred).

evaly bench -m flux-schnell flux-dev -p prompts.json --output-dir ./reports -y

# Creates: reports/bench-abc123_20260227-143000/
#   report.html
#   report.json
#   errors.log    (only if errors occurred)

--output-dir works alongside --output — both can be used in the same command. You can also set a default in evalytic.toml:

# evalytic.toml
[bench]
output_dir = "./reports"

Error Log

When generation or scoring errors occur (API timeouts, content policy violations, judge parse failures), they are collected into a structured errors.log file inside the output directory. The terminal also shows a brief error summary at the end of each run.

Examples

Compare 3 models with JSON + HTML output

evaly bench \
    -m flux-schnell -m flux-dev -m flux-pro \
    -p prompts.json \
    -o report.json -o report.html \
    --review

Text rendering evaluation

evaly bench \
    -m flux-pro \
    -p "A sign that says 'OPEN 24/7'" \
    -d text_rendering -d visual_quality

Reproducible benchmark with seed

evaly bench \
    -m flux-dev \
    -p prompts.json \
    --seed 42 \
    --image-size landscape_16_9

Image-to-image benchmark

# Compare img2img models with face identity metric
evaly bench \
    -m flux-kontext -m seedream-edit -m reve-edit \
    -i inputs.json \
    -d identity_preservation \
    --metrics face \
    -o report.html --review
--metrics face computes ArcFace embedding cosine similarity between input and output images. Requires pip install "evalytic[metrics]". Automatically skipped when no face is detected in either image.

Use GPT-5.2 as judge instead of Gemini

evaly bench \
    -m flux-schnell \
    -p "A sunset over the ocean" \
    -j openai/gpt-5.2

Local judge with Ollama

evaly bench \
    -m flux-schnell \
    -p "A cat" \
    -j ollama/qwen2.5-vl:7b

Consensus mode (multi-judge)

Use --judges with 2-3 comma-separated judge names for more reliable scores via adaptive consensus:

# 2 judges — average when they agree, flag disputes
evaly bench \
    -m flux-schnell \
    -p "A cat" \
    --judges "gemini-2.5-flash,gpt-5.2"

# 3 judges — disputed dimensions get a tiebreaker (median)
evaly bench \
    -m flux-schnell flux-dev \
    -p prompts.json \
    --judges "gemini-2.5-flash,gpt-5.2,claude-haiku-4-5"
Adaptive 2+1 algorithm: Two primary judges score in parallel. If they agree (within 0.5 points), the average is used. If they disagree, the third judge breaks the tie with a median. This keeps cost at ~2.3x instead of 3x. Reports show an "Agree" column indicating the percentage of high-agreement dimensions per model.