Quickstart

From zero to your first benchmark report in 3 minutes.

In 3 minutes you'll have this: A full benchmark report with model rankings, per-dimension scores, cost breakdown, and side-by-side image comparison. See a sample report →

Install

pip install evalytic

Requires Python 3.10+. Installs the CLI, VLM judge, and Rich terminal output (~5 MB).

Setup

evaly init

The interactive wizard walks you through everything: picks your use case (text2img or img2img), collects API keys, validates them, and writes your .env + evalytic.toml config.

$ evaly init

? What do you want to evaluate? [text2img / img2img / both]
? Gemini API key: AIza... validated
? fal.ai API key: fal_... validated

  .env written (2 keys)
  evalytic.toml written
  Ready! Run: evaly bench -y

Manual setup (without wizard)

Create a .env file in your project root:

# .env
GEMINI_API_KEY=AIza...   # Free: aistudio.google.com/apikey
FAL_KEY=your_fal_key      # $10 free: fal.ai/dashboard/keys

Gemini (judge): Default VLM judge — get key.
fal.ai (generation): $10 free credit — get key.

First Benchmark

evaly bench -y

Zero arguments needed. Smart defaults: generates one image with Flux Schnell, scores it with Gemini, prints a terminal report. Done in ~15 seconds.

$ evaly bench -y

  Evalytic Bench
  Models: flux-schnell | Prompts: 1 | Dimensions: auto
  Est. cost: ~$0.01

  Generating... flux-schnell: 1/1
  Scoring... gemini-2.5-flash: 2/2

  Rankings
  ┌────────────────┬─────────────────┬──────────────────┬─────────┐
  │ Model          │ visual_quality  │ prompt_adherence │ Overall │
  ├────────────────┼─────────────────┼──────────────────┼─────────┤
  │ flux-schnell   │ 4.5             │ 4.0              │ 4.2     │
  └────────────────┴─────────────────┴──────────────────┴─────────┘
  Cost: $0.004 gen + $0.000 judge = $0.004 total

Evalytic generated an image with Flux Schnell, scored it with Gemini, and printed the results — all from a single command.

Compare Models

Create a prompts.json file with your test prompts:

[
  "A photorealistic cat on a windowsill at sunset",
  "A modern minimalist logo for 'ACME Corp'",
  "Product photo: white sneakers on marble",
  "A watercolor painting of a mountain landscape"
]

Then run a multi-model benchmark with an HTML report:

evaly bench \
    -m flux-schnell -m flux-dev -m flux-pro \
    -p prompts.json \
    -o report.html \
    --review

The --review flag opens an interactive HTML report in your browser with side-by-side image comparison, per-dimension scores, radar charts, and cost breakdown.

See a sample report → — 5 models, rankings, radar chart, cost analysis, all from a real benchmark.

Already have images? Skip generation and score directly:

evaly eval --image photo.jpg --prompt "A product photo of sneakers"

Only needs a Gemini API key (free). See evaly eval for full reference.

Real Results

These benchmarks were run with Evalytic — same CLI you just installed.

CLIP/LPIPS Metrics

Install the metrics extra to unlock deterministic scoring alongside VLM judges:

pip install "evalytic[metrics]"  # adds CLIP Score, LPIPS, ArcFace (~2 GB)

Once installed, metrics are auto-enabled: CLIP for text2img, LPIPS for img2img. Use --no-metrics to disable, or --metrics face to add ArcFace for identity preservation.

Quickstart

Install

Setup

First Benchmark

Compare Models

Real Results

Do I really need the flagship model?

Is my product photo still my product?

Why do users say "that's not me"?

CLIP/LPIPS Metrics

What's Next

evaly bench

Judges

7 Dimensions

CI/CD Gate