evaly rag

Evaluate RAG answers and retrieval quality.

evaly rag eval [OPTIONS]

The rag command evaluates a response against its original query and retrieved context. Use it for faithfulness, answer relevancy, and reference-based retrieval metrics when your dataset includes references.

Basic Usage

# Inline evaluation with repeated --context
evaly rag eval \
    --query "What does Evalytic evaluate?" \
    --response "Evalytic evaluates images, text, RAG, and agents." \
    --context "Evalytic is an evaluation SDK for AI outputs." \
    --context "It supports visual, text, RAG, and agent workflows."

# Evaluate a dataset instead
evaly rag eval --dataset rag-cases.json

Options

FlagTypeDescription
--queryTEXTOriginal user question.
--responseTEXTModel answer to evaluate.
--contextTEXT (multiple)Retrieved context chunk. Repeat for multiple chunks.
--referenceTEXTOptional reference answer for reference-based metrics.
--datasetTEXTPath to a type: "rag" dataset JSON file.
--metricsTEXTComma-separated metric IDs. Default: faithfulness,answer_relevancy.
--judge, -jTEXTJudge model. Defaults to gemini-2.5-flash or fal/gemini-2.5-flash when only FAL_KEY is set.
--judgesTEXTComma-separated judges for consensus mode.
--judge-urlTEXTCustom judge API base URL.
--output, -oTEXTWrite report JSON to file.

Common Metrics

MetricWhat it answersNotes
faithfulnessIs the response supported by the retrieved context?Judge-based. Requires at least one context chunk.
answer_relevancyDoes the response stay on-topic for the original query?Judge + embeddings.
context_precisionDid retrieval bring back mostly useful context?Reference-based. Best with --reference or a dataset reference.
context_recallDid retrieval cover the facts needed for a good answer?Reference-based. Best with --reference or a dataset reference.
Embeddings recommended: Install evalytic[embeddings] if you want answer_relevancy to work locally without depending on an external embeddings API.

Dataset Mode

Dataset mode accepts canonical dataset files with type: "rag" and an items array. Each item should contain query, response, and contexts.

{
  "type": "rag",
  "items": [
    {
      "query": "What does Evalytic evaluate?",
      "response": "Evalytic evaluates images, text, RAG, and agents.",
      "contexts": [
        { "text": "Evalytic is an evaluation SDK for AI outputs.", "rank": 1 }
      ],
      "reference": "Evalytic is a toolkit for evaluating AI outputs."
    }
  ]
}

Output and Gating

RAG reports use eval_type: "rag" and expose summary.metric_averages. Gate them with --metric-threshold:

evaly rag eval --dataset rag-cases.json -o rag.json
evaly gate --report rag.json --metric-threshold faithfulness:0.8