evaly text

Evaluate text outputs against references or criteria.

evaly text eval [OPTIONS]

The text command evaluates plain LLM outputs with judge-based, embedding-based, and deterministic metrics. Use it for summaries, structured answers, rubric checks, and expected-vs-actual comparisons.

Basic Usage

# Evaluate one output inline
evaly text eval \
    --input "Summarize the incident in one sentence." \
    --output-text "The service was unavailable for 12 minutes." \
    --expected "A brief outage lasted 12 minutes."

# Evaluate a dataset instead
evaly text eval --dataset text-cases.json

Options

Flag	Type	Description
--input	TEXT	Original prompt or task input.
--output-text	TEXT	Model output to evaluate.
--expected	TEXT	Optional expected or reference output.
--criteria	TEXT	Optional rubric or evaluation criteria.
--dataset	TEXT	Path to a `type: "text"` dataset JSON file.
--metrics	TEXT	Comma-separated metric IDs. Default: `factual_correctness,semantic_similarity`.
--judge, -j	TEXT	Judge model.
--judges	TEXT	Comma-separated judges for consensus mode.
--judge-url	TEXT	Custom judge API base URL.
--output, -o	TEXT	Write report JSON to file.

Common Metrics

Metric	What it answers	Notes
`factual_correctness`	Is the output correct relative to the expected answer?	Judge-based.
`semantic_similarity`	How semantically close is the output to the expected answer?	Requires embeddings and `--expected`.
`g_eval`	Does the output satisfy a custom rubric?	Requires `--criteria`.
`bleu`, `rouge`, `exact_match`, `levenshtein`, `string_presence`	Deterministic reference-based checks.	Best when you have an expected answer.

Install embeddings for local semantic metrics: semantic_similarity relies on embeddings. Install evalytic[embeddings] if you want that metric to work locally.

Dataset Mode

Dataset mode accepts canonical dataset files with type: "text". Each item should contain input and output, plus optional expected and criteria.

{
  "type": "text",
  "items": [
    {
      "input": "Summarize the incident in one sentence.",
      "output": "The service was unavailable for 12 minutes.",
      "expected": "A brief outage lasted 12 minutes.",
      "criteria": "Be concise and factually correct."
    }
  ]
}

Output and Gating

Text reports use eval_type: "text" and expose summary.metric_averages. Gate them with --metric-threshold:

evaly text eval --dataset text-cases.json -o text.json
evaly gate --report text.json --metric-threshold factual_correctness:0.8

evaly rag evaly agent