evaly text

Evaluate text outputs against references or criteria.

evaly text eval [OPTIONS]

The text command evaluates plain LLM outputs with judge-based, embedding-based, and deterministic metrics. Use it for summaries, structured answers, rubric checks, and expected-vs-actual comparisons.

Basic Usage

# Evaluate one output inline
evaly text eval \
    --input "Summarize the incident in one sentence." \
    --output-text "The service was unavailable for 12 minutes." \
    --expected "A brief outage lasted 12 minutes."

# Evaluate a dataset instead
evaly text eval --dataset text-cases.json

Options

FlagTypeDescription
--inputTEXTOriginal prompt or task input.
--output-textTEXTModel output to evaluate.
--expectedTEXTOptional expected or reference output.
--criteriaTEXTOptional rubric or evaluation criteria.
--datasetTEXTPath to a type: "text" dataset JSON file.
--metricsTEXTComma-separated metric IDs. Default: factual_correctness,semantic_similarity.
--judge, -jTEXTJudge model.
--judgesTEXTComma-separated judges for consensus mode.
--judge-urlTEXTCustom judge API base URL.
--output, -oTEXTWrite report JSON to file.

Common Metrics

MetricWhat it answersNotes
factual_correctnessIs the output correct relative to the expected answer?Judge-based.
semantic_similarityHow semantically close is the output to the expected answer?Requires embeddings and --expected.
g_evalDoes the output satisfy a custom rubric?Requires --criteria.
bleu, rouge, exact_match, levenshtein, string_presenceDeterministic reference-based checks.Best when you have an expected answer.
Install embeddings for local semantic metrics: semantic_similarity relies on embeddings. Install evalytic[embeddings] if you want that metric to work locally.

Dataset Mode

Dataset mode accepts canonical dataset files with type: "text". Each item should contain input and output, plus optional expected and criteria.

{
  "type": "text",
  "items": [
    {
      "input": "Summarize the incident in one sentence.",
      "output": "The service was unavailable for 12 minutes.",
      "expected": "A brief outage lasted 12 minutes.",
      "criteria": "Be concise and factually correct."
    }
  ]
}

Output and Gating

Text reports use eval_type: "text" and expose summary.metric_averages. Gate them with --metric-threshold:

evaly text eval --dataset text-cases.json -o text.json
evaly gate --report text.json --metric-threshold factual_correctness:0.8