evaly rag

Evaluate RAG answers and retrieval quality.

evaly rag eval [OPTIONS]

The rag command evaluates a response against its original query and retrieved context. Use it for faithfulness, answer relevancy, and reference-based retrieval metrics when your dataset includes references.

Basic Usage

# Inline evaluation with repeated --context
evaly rag eval \
    --query "What does Evalytic evaluate?" \
    --response "Evalytic evaluates images, text, RAG, and agents." \
    --context "Evalytic is an evaluation SDK for AI outputs." \
    --context "It supports visual, text, RAG, and agent workflows."

# Evaluate a dataset instead
evaly rag eval --dataset rag-cases.json

Options

Flag	Type	Description
--query	TEXT	Original user question.
--response	TEXT	Model answer to evaluate.
--context	TEXT (multiple)	Retrieved context chunk. Repeat for multiple chunks.
--reference	TEXT	Optional reference answer for reference-based metrics.
--dataset	TEXT	Path to a `type: "rag"` dataset JSON file.
--metrics	TEXT	Comma-separated metric IDs. Default: `faithfulness,answer_relevancy`.
--judge, -j	TEXT	Judge model. Defaults to `gemini-2.5-flash` or `fal/gemini-2.5-flash` when only `FAL_KEY` is set.
--judges	TEXT	Comma-separated judges for consensus mode.
--judge-url	TEXT	Custom judge API base URL.
--output, -o	TEXT	Write report JSON to file.

Common Metrics

Metric	What it answers	Notes
`faithfulness`	Is the response supported by the retrieved context?	Judge-based. Requires at least one context chunk.
`answer_relevancy`	Does the response stay on-topic for the original query?	Judge + embeddings.
`contextual_relevancy`	Is each retrieved chunk relevant to the query?	Reference-free. Scores the retrieval layer independently of the answer.
`hallucination`	Does the response contradict the retrieved context?	Reference-free. Flags actively wrong claims (not merely unsupported ones).
`context_precision`	Did retrieval bring back mostly useful context?	Reference-based. Best with `--reference` or a dataset reference.
`context_recall`	Did retrieval cover the facts needed for a good answer?	Reference-based. Best with `--reference` or a dataset reference.

Embeddings recommended: Install evalytic[embeddings] if you want answer_relevancy to work locally without depending on an external embeddings API.

Dataset Mode

Dataset mode accepts canonical dataset files with type: "rag" and an items array. Each item should contain query, response, and contexts.

{
  "type": "rag",
  "items": [
    {
      "query": "What does Evalytic evaluate?",
      "response": "Evalytic evaluates images, text, RAG, and agents.",
      "contexts": [
        { "text": "Evalytic is an evaluation SDK for AI outputs.", "rank": 1 }
      ],
      "reference": "Evalytic is a toolkit for evaluating AI outputs."
    }
  ]
}

Output and Gating

RAG reports use eval_type: "rag" and expose summary.metric_averages. Gate them with --metric-threshold:

evaly rag eval --dataset rag-cases.json -o rag.json
evaly gate --report rag.json --metric-threshold faithfulness:0.8

evaly eval evaly text