Evalytic
Pytest for images. Know if your AI-generated visuals are good — before your users tell you they're not.
- Model Selection — Compare Flux Schnell vs Dev vs Pro with real prompts
- Prompt Optimization — Measure how well models follow your prompts
- Regression Detection — Catch quality drops when models update
- CI/CD Quality Gate — Block deploys when image quality falls below threshold
- 7 Semantic Dimensions — VLM judges score visual quality, prompt adherence, text rendering, and more
Evalytic is an open-source SDK for automated quality evaluation of AI-generated images. One command benchmarks multiple models, scores outputs across 7 semantic dimensions using VLM judges (Gemini, GPT-4o, Claude, or local models), and generates rich reports.
Quick Example
$ pip install evalytic $ evaly bench \ --models flux-schnell flux-dev flux-pro \ --prompts "A photorealistic cat on a windowsill" \ --output report.html # Generates images, scores with Gemini 2.5 Flash, opens HTML report
Key Features
One Command
Generate, score, and report in a single evaly bench call.
Multi-Judge
Gemini, GPT-4o, Claude, Ollama, LM Studio — any VLM as judge.
CI/CD Gate
Pass/fail quality gates with exit codes for GitHub Actions.
Rich Reports
Terminal (Rich), HTML with image grids, JSON for automation.
Local Metrics
CLIP Score + LPIPS auto-enabled alongside VLM scores when installed.
7 Dimensions
Visual quality, prompt adherence, text rendering, input fidelity, identity preservation, and more.
How It Works
- Generate — Send prompts to one or more fal.ai models (Flux Schnell, Dev, Pro, Kontext, etc.)
- Score — VLM judge evaluates each image across selected dimensions (1–5 scale)
- Report — Rich terminal output + HTML report + JSON for automation
- Gate — Optionally enforce quality thresholds in CI/CD pipelines