Quickstart
Pick the path closest to what you want to evaluate.
evaly demo to browse real visual benchmark reports in your browser — no API keys needed.
Learn more →
Choose Your Path
Visual
Benchmark image generation models with evaly bench or score an existing image with evaly eval.
RAG
Evaluate query-response-context triples with evaly rag eval and gate them with per-metric thresholds.
Text / Agent
Evaluate text outputs, references, and tool-using agent runs with the same report, compare, and gate surface.
Shared Install
pip install evalytic
Core install includes the CLI, judge provider support, JSON/HTML reports, and the command groups for visual, RAG, text, and agent evaluation. Add extras only when your workflow needs them:
| Use Case | Install | Why |
|---|---|---|
| Visual benchmarks via fal.ai | evalytic[generation] | Adds fal-client for fal.ai image generation. |
| Visual benchmarks via Parel | evalytic | Parel builtins use the core httpx dependency. Set PAREL_API_KEY. |
| Visual local metrics | evalytic[metrics] / evalytic[ocr] | Add CLIP, LPIPS, NIMA, ArcFace, and OCR scoring. |
| RAG / semantic text metrics | evalytic[embeddings] | Add local embeddings for answer_relevancy and semantic_similarity. |
| Everything | evalytic[all] | Install generation, metrics, OCR, and embeddings in one go. |
Visual
Use this path when you want to benchmark image generation models or score an existing image without generation.
pip install "evalytic[generation]"
export FAL_KEY=your_fal_key
evaly bench -y
That single command generates an image, scores it, and prints a terminal report. If you already have an output image, use the visual-only scoring path instead:
You can also run Parel builtin models by setting PAREL_API_KEY and using parel/ model names:
export PAREL_API_KEY=your_parel_key
evaly bench -m parel/flux-schnell -p "A product photo on marble" --yes
export GEMINI_API_KEY=your_gemini_key
evaly eval --image output.png --prompt "A product photo of sneakers"
Next: evaly bench for generation benchmarks and evaly eval for existing image scoring.
RAG
Use this path when you already have a user query, a model response, and one or more retrieved context chunks.
pip install "evalytic[embeddings]"
export GEMINI_API_KEY=your_gemini_key
evaly rag eval \
--query "What does Evalytic evaluate?" \
--response "Evalytic evaluates images, text, RAG, and agents." \
--context "Evalytic is an evaluation SDK for AI outputs." \
--context "It supports visual, text, RAG, and agent workflows." \
-o rag.json
Use per-metric gates for RAG reports:
evaly gate --report rag.json \
--metric-threshold faithfulness:0.8 \
--metric-threshold hallucination:0.9 \
--metric-threshold contextual_relevancy:0.75 \
--metric-threshold answer_relevancy:0.7
Or assert the same thresholds directly inside pytest with
evalytic.testing.assert_test.
Next: evaly rag for the full command reference and
evaly gate for report-type-aware gating.
Text / Agent
Use this path when you want to evaluate plain text outputs, rubric-based responses, or tool-using agent runs.
pip install "evalytic[embeddings]"
export GEMINI_API_KEY=your_gemini_key
Text Output
evaly text eval \
--input "Summarize the incident in one sentence." \
--output-text "The service was unavailable for 12 minutes." \
--expected "A brief outage lasted 12 minutes." \
-o text.json
Agent Run
evaly agent eval \
--input "Find pricing and summarize it." \
--final-output "The Pro plan costs $99 per month." \
--tool-call web.search \
--expected-tool web.search \
-o agent.json
Compare two runs of the same report type with a single command:
evaly compare \
--baseline run-a.json \
--candidate run-b.json
Next: evaly text, evaly agent, and evaly compare.