Quickstart
From zero to your first benchmark report in 3 minutes.
Install
pip install evalytic
Requires Python 3.10+. Installs the CLI, VLM judge, and Rich terminal output (~5 MB).
Setup
evaly init
The interactive wizard walks you through everything: picks your use case (text2img or img2img),
collects API keys, validates them, and writes your .env + evalytic.toml config.
$ evaly init ? What do you want to evaluate? [text2img / img2img / both] ? Gemini API key: AIza... validated ? fal.ai API key: fal_... validated .env written (2 keys) evalytic.toml written Ready! Run: evaly bench -y
First Benchmark
evaly bench -y
Zero arguments needed. Smart defaults: generates one image with Flux Schnell, scores it with Gemini, prints a terminal report. Done in ~15 seconds.
$ evaly bench -y Evalytic Bench Models: flux-schnell | Prompts: 1 | Dimensions: auto Est. cost: ~$0.01 Generating... flux-schnell: 1/1 Scoring... gemini-2.5-flash: 2/2 Rankings ┌────────────────┬─────────────────┬──────────────────┬─────────┐ │ Model │ visual_quality │ prompt_adherence │ Overall │ ├────────────────┼─────────────────┼──────────────────┼─────────┤ │ flux-schnell │ 4.5 │ 4.0 │ 4.2 │ └────────────────┴─────────────────┴──────────────────┴─────────┘ Cost: $0.004 gen + $0.000 judge = $0.004 total
Evalytic generated an image with Flux Schnell, scored it with Gemini, and printed the results — all from a single command.
Compare Models
Create a prompts.json file with your test prompts:
[
"A photorealistic cat on a windowsill at sunset",
"A modern minimalist logo for 'ACME Corp'",
"Product photo: white sneakers on marble",
"A watercolor painting of a mountain landscape"
]
Then run a multi-model benchmark with an HTML report:
evaly bench \
-m flux-schnell -m flux-dev -m flux-pro \
-p prompts.json \
-o report.html \
--review
The --review flag opens an interactive HTML report in your browser with side-by-side image comparison, per-dimension scores, radar charts, and cost breakdown.
evaly eval --image photo.jpg --prompt "A product photo of sneakers"
Only needs a Gemini API key (free). See evaly eval for full reference.
Real Results
These benchmarks were run with Evalytic — same CLI you just installed.
Do I really need the flagship model?
Schnell scores 4.3 at $0.003/img. Pro scores 4.7 at $0.05. Is 0.4 points worth 16× the cost? 3 Flux models compared.
See benchmark →Is my product photo still my product?
AI edits warp shapes, lose logos, change colors. Input fidelity scoring catches every drift. seedream-edit leads at 5.0/5.
See benchmark →Why do users say "that's not me"?
Face edits lose identity. ArcFace + VLM judges agree (r=0.99). flux-dev-i2i scores 0.04 face similarity — unusable.
See benchmark →CLIP/LPIPS Metrics
Install the metrics extra to unlock deterministic scoring alongside VLM judges:
pip install "evalytic[metrics]" # adds CLIP Score, LPIPS, ArcFace (~2 GB)
Once installed, metrics are auto-enabled: CLIP for text2img, LPIPS for img2img. Use --no-metrics to disable, or --metrics face to add ArcFace for identity preservation.