Evalytic

Pytest for AI outputs.

Evaluate images, text, RAG, and agents with LLM judges, local metrics, and CI gates.

Know if your AI is good — before your users tell you it's not.

What Evalytic Does

Model Selection — Benchmark candidate models or prompts before you ship them
Regression Detection — Compare baseline and candidate runs to catch quality drops early
CI/CD Quality Gates — Turn report JSON into pass/fail checks in automation
Reference-Free Evaluation — Score RAG answers and open-ended outputs when you do not have labels
Reference-Based Evaluation — Measure expected-vs-actual quality for text, RAG, and agent workflows

Evalytic is an open-source SDK for evaluating AI outputs across visual, retrieval, text, and agent workflows. Use the same CLI to benchmark generation models, score existing outputs, compare runs, and gate releases with JSON-native reports.

Quick Example

pip install evalytic

Visual Benchmark

$ evaly bench \
    --models flux-schnell flux-dev flux-pro \
    --prompts "A photorealistic cat on a windowsill" \
    --output report.html

RAG Answer

$ evaly rag eval \
    --query "What is Evalytic?" \
    --response "Evalytic evaluates AI outputs." \
    --context "Evalytic evaluates images, text, RAG, and agents." \
    --metrics faithfulness

Key Features

Multi-Modal Eval

Benchmark generated images, score existing images, evaluate RAG answers, text outputs, and agent runs in one SDK.

LLM-as-Judge

Use Gemini, GPT-5.2, Claude, Ollama, LM Studio, or fal.ai-backed judges, with consensus when you want multiple opinions.

Local Metrics & Embeddings

Combine sharpness, CLIP, LPIPS, NIMA, OCR, and local embeddings with judge-based scoring when you need more signal.

Compare Reports

Diff baseline and candidate runs for bench, RAG, text, and agent reports without writing custom scripts.

CI Quality Gates

Turn report JSON into release criteria with overall, per-dimension, or per-metric thresholds in CI/CD.

Reports for Humans and Automation

Review results in Rich terminal tables, JSON payloads, or HTML reports that are easy to share and automate.

Where Evalytic is strongest today: Visual evaluation is the most mature public workflow. Text, RAG, and agent evaluation now ship in the same SDK and reuse the same judge, report, compare, and gate surface.

How It Works

Your prompts / outputs / traces
→Evalytic judges + metrics
→Scores + reports
→Compare + gate

Capture — Start from prompts, generated outputs, retrieved contexts, or agent tool traces
Evaluate — Run judge-based metrics and deterministic metrics through one CLI surface
Review — Inspect terminal, JSON, or HTML reports locally or in automation
Ship Safely — Compare runs and enforce quality thresholds before regressions reach users

Evalytic

Quick Example

Visual Benchmark

RAG Answer

Key Features

Multi-Modal Eval

LLM-as-Judge

Local Metrics & Embeddings

Compare Reports

CI Quality Gates

Reports for Humans and Automation

How It Works

Start Here

Visual Quickstart

RAG Quickstart

Text & Agent Quickstart

Installation