Evalytic

Pytest for images. Know if your AI-generated visuals are good — before your users tell you they're not.

What Evalytic Does
  • Model Selection — Compare Flux Schnell vs Dev vs Pro with real prompts
  • Prompt Optimization — Measure how well models follow your prompts
  • Regression Detection — Catch quality drops when models update
  • CI/CD Quality Gate — Block deploys when image quality falls below threshold
  • 7 Semantic Dimensions — VLM judges score visual quality, prompt adherence, text rendering, and more

Evalytic is an open-source SDK for automated quality evaluation of AI-generated images. One command benchmarks multiple models, scores outputs across 7 semantic dimensions using VLM judges (Gemini, GPT-4o, Claude, or local models), and generates rich reports.

Quick Example

$ pip install evalytic

$ evaly bench \
    --models flux-schnell flux-dev flux-pro \
    --prompts "A photorealistic cat on a windowsill" \
    --output report.html

# Generates images, scores with Gemini 2.5 Flash, opens HTML report

Key Features

One Command

Generate, score, and report in a single evaly bench call.

Multi-Judge

Gemini, GPT-4o, Claude, Ollama, LM Studio — any VLM as judge.

CI/CD Gate

Pass/fail quality gates with exit codes for GitHub Actions.

Rich Reports

Terminal (Rich), HTML with image grids, JSON for automation.

Local Metrics

CLIP Score + LPIPS auto-enabled alongside VLM scores when installed.

7 Dimensions

Visual quality, prompt adherence, text rendering, input fidelity, identity preservation, and more.

How It Works

Your Prompts
fal.ai Models
VLM Judge
Scores + Report
  1. Generate — Send prompts to one or more fal.ai models (Flux Schnell, Dev, Pro, Kontext, etc.)
  2. Score — VLM judge evaluates each image across selected dimensions (1–5 scale)
  3. Report — Rich terminal output + HTML report + JSON for automation
  4. Gate — Optionally enforce quality thresholds in CI/CD pipelines

Next Steps