Evalytic
Pytest for AI outputs.
Evaluate images, text, RAG, and agents with LLM judges, local metrics, and CI gates.
Know if your AI is good — before your users tell you it's not.
- Model Selection — Benchmark candidate models or prompts before you ship them
- Regression Detection — Compare baseline and candidate runs to catch quality drops early
- CI/CD Quality Gates — Turn report JSON into pass/fail checks in automation
- Reference-Free Evaluation — Score RAG answers and open-ended outputs when you do not have labels
- Reference-Based Evaluation — Measure expected-vs-actual quality for text, RAG, and agent workflows
Evalytic is an open-source SDK for evaluating AI outputs across visual, retrieval, text, and agent workflows. Use the same CLI to benchmark generation models, score existing outputs, compare runs, and gate releases with JSON-native reports.
Quick Example
pip install evalytic
Visual Benchmark
$ evaly bench \ --models flux-schnell flux-dev flux-pro \ --prompts "A photorealistic cat on a windowsill" \ --output report.html
RAG Answer
$ evaly rag eval \ --query "What is Evalytic?" \ --response "Evalytic evaluates AI outputs." \ --context "Evalytic evaluates images, text, RAG, and agents." \ --metrics faithfulness
Key Features
Multi-Modal Eval
Benchmark generated images, score existing images, evaluate RAG answers, text outputs, and agent runs in one SDK.
LLM-as-Judge
Use Gemini, GPT-5.2, Claude, Ollama, LM Studio, or fal.ai-backed judges, with consensus when you want multiple opinions.
Local Metrics & Embeddings
Combine sharpness, CLIP, LPIPS, NIMA, OCR, and local embeddings with judge-based scoring when you need more signal.
Compare Reports
Diff baseline and candidate runs for bench, RAG, text, and agent reports without writing custom scripts.
CI Quality Gates
Turn report JSON into release criteria with overall, per-dimension, or per-metric thresholds in CI/CD.
Reports for Humans and Automation
Review results in Rich terminal tables, JSON payloads, or HTML reports that are easy to share and automate.
How It Works
- Capture — Start from prompts, generated outputs, retrieved contexts, or agent tool traces
- Evaluate — Run judge-based metrics and deterministic metrics through one CLI surface
- Review — Inspect terminal, JSON, or HTML reports locally or in automation
- Ship Safely — Compare runs and enforce quality thresholds before regressions reach users
Start Here
Visual Quickstart
Benchmark image generation models or score existing images.
RAG Quickstart
Evaluate query-response-context triples with faithfulness and related metrics.
Text & Agent Quickstart
Evaluate non-visual outputs, expected answers, and tool-using agent runs.
Installation
See extras, install variants, and which packages you need for each workflow.