evaly rag
Evaluate RAG answers and retrieval quality.
evaly rag eval [OPTIONS]
The rag command evaluates a response against its original query and retrieved context.
Use it for faithfulness, answer relevancy, and reference-based retrieval metrics when your dataset includes references.
Basic Usage
# Inline evaluation with repeated --context
evaly rag eval \
--query "What does Evalytic evaluate?" \
--response "Evalytic evaluates images, text, RAG, and agents." \
--context "Evalytic is an evaluation SDK for AI outputs." \
--context "It supports visual, text, RAG, and agent workflows."
# Evaluate a dataset instead
evaly rag eval --dataset rag-cases.json
Options
| Flag | Type | Description |
|---|---|---|
| --query | TEXT | Original user question. |
| --response | TEXT | Model answer to evaluate. |
| --context | TEXT (multiple) | Retrieved context chunk. Repeat for multiple chunks. |
| --reference | TEXT | Optional reference answer for reference-based metrics. |
| --dataset | TEXT | Path to a type: "rag" dataset JSON file. |
| --metrics | TEXT | Comma-separated metric IDs. Default: faithfulness,answer_relevancy. |
| --judge, -j | TEXT | Judge model. Defaults to gemini-2.5-flash or fal/gemini-2.5-flash when only FAL_KEY is set. |
| --judges | TEXT | Comma-separated judges for consensus mode. |
| --judge-url | TEXT | Custom judge API base URL. |
| --output, -o | TEXT | Write report JSON to file. |
Common Metrics
| Metric | What it answers | Notes |
|---|---|---|
faithfulness | Is the response supported by the retrieved context? | Judge-based. Requires at least one context chunk. |
answer_relevancy | Does the response stay on-topic for the original query? | Judge + embeddings. |
context_precision | Did retrieval bring back mostly useful context? | Reference-based. Best with --reference or a dataset reference. |
context_recall | Did retrieval cover the facts needed for a good answer? | Reference-based. Best with --reference or a dataset reference. |
Embeddings recommended: Install
evalytic[embeddings] if you want
answer_relevancy to work locally without depending on an external embeddings API.
Dataset Mode
Dataset mode accepts canonical dataset files with type: "rag" and an items array.
Each item should contain query, response, and contexts.
{
"type": "rag",
"items": [
{
"query": "What does Evalytic evaluate?",
"response": "Evalytic evaluates images, text, RAG, and agents.",
"contexts": [
{ "text": "Evalytic is an evaluation SDK for AI outputs.", "rank": 1 }
],
"reference": "Evalytic is a toolkit for evaluating AI outputs."
}
]
}
Output and Gating
RAG reports use eval_type: "rag" and expose summary.metric_averages.
Gate them with --metric-threshold:
evaly rag eval --dataset rag-cases.json -o rag.json
evaly gate --report rag.json --metric-threshold faithfulness:0.8