Evalytic

Pytest for AI outputs.

Evaluate images, text, RAG, and agents with LLM judges, local metrics, and CI gates.

Know if your AI is good — before your users tell you it's not.

What Evalytic Does
  • Model Selection — Benchmark candidate models or prompts before you ship them
  • Regression Detection — Compare baseline and candidate runs to catch quality drops early
  • CI/CD Quality Gates — Turn report JSON into pass/fail checks in automation
  • Reference-Free Evaluation — Score RAG answers and open-ended outputs when you do not have labels
  • Reference-Based Evaluation — Measure expected-vs-actual quality for text, RAG, and agent workflows

Evalytic is an open-source SDK for evaluating AI outputs across visual, retrieval, text, and agent workflows. Use the same CLI to benchmark generation models, score existing outputs, compare runs, and gate releases with JSON-native reports.

Quick Example

pip install evalytic

Visual Benchmark

$ evaly bench \
    --models flux-schnell flux-dev flux-pro \
    --prompts "A photorealistic cat on a windowsill" \
    --output report.html

RAG Answer

$ evaly rag eval \
    --query "What is Evalytic?" \
    --response "Evalytic evaluates AI outputs." \
    --context "Evalytic evaluates images, text, RAG, and agents." \
    --metrics faithfulness

Key Features

Multi-Modal Eval

Benchmark generated images, score existing images, evaluate RAG answers, text outputs, and agent runs in one SDK.

LLM-as-Judge

Use Gemini, GPT-5.2, Claude, Ollama, LM Studio, or fal.ai-backed judges, with consensus when you want multiple opinions.

Local Metrics & Embeddings

Combine sharpness, CLIP, LPIPS, NIMA, OCR, and local embeddings with judge-based scoring when you need more signal.

Compare Reports

Diff baseline and candidate runs for bench, RAG, text, and agent reports without writing custom scripts.

CI Quality Gates

Turn report JSON into release criteria with overall, per-dimension, or per-metric thresholds in CI/CD.

Reports for Humans and Automation

Review results in Rich terminal tables, JSON payloads, or HTML reports that are easy to share and automate.

Where Evalytic is strongest today: Visual evaluation is the most mature public workflow. Text, RAG, and agent evaluation now ship in the same SDK and reuse the same judge, report, compare, and gate surface.

How It Works

Your prompts / outputs / traces
Evalytic judges + metrics
Scores + reports
Compare + gate
  1. Capture — Start from prompts, generated outputs, retrieved contexts, or agent tool traces
  2. Evaluate — Run judge-based metrics and deterministic metrics through one CLI surface
  3. Review — Inspect terminal, JSON, or HTML reports locally or in automation
  4. Ship Safely — Compare runs and enforce quality thresholds before regressions reach users

Start Here