evaly text
Evaluate text outputs against references or criteria.
evaly text eval [OPTIONS]
The text command evaluates plain LLM outputs with judge-based, embedding-based, and deterministic metrics.
Use it for summaries, structured answers, rubric checks, and expected-vs-actual comparisons.
Basic Usage
# Evaluate one output inline
evaly text eval \
--input "Summarize the incident in one sentence." \
--output-text "The service was unavailable for 12 minutes." \
--expected "A brief outage lasted 12 minutes."
# Evaluate a dataset instead
evaly text eval --dataset text-cases.json
Options
| Flag | Type | Description |
|---|---|---|
| --input | TEXT | Original prompt or task input. |
| --output-text | TEXT | Model output to evaluate. |
| --expected | TEXT | Optional expected or reference output. |
| --criteria | TEXT | Optional rubric or evaluation criteria. |
| --dataset | TEXT | Path to a type: "text" dataset JSON file. |
| --metrics | TEXT | Comma-separated metric IDs. Default: factual_correctness,semantic_similarity. |
| --judge, -j | TEXT | Judge model. |
| --judges | TEXT | Comma-separated judges for consensus mode. |
| --judge-url | TEXT | Custom judge API base URL. |
| --output, -o | TEXT | Write report JSON to file. |
Common Metrics
| Metric | What it answers | Notes |
|---|---|---|
factual_correctness | Is the output correct relative to the expected answer? | Judge-based. |
semantic_similarity | How semantically close is the output to the expected answer? | Requires embeddings and --expected. |
g_eval | Does the output satisfy a custom rubric? | Requires --criteria. |
bleu, rouge, exact_match, levenshtein, string_presence | Deterministic reference-based checks. | Best when you have an expected answer. |
Install embeddings for local semantic metrics:
semantic_similarity relies on embeddings.
Install evalytic[embeddings] if you want that metric to work locally.
Dataset Mode
Dataset mode accepts canonical dataset files with type: "text". Each item should contain
input and output, plus optional expected and criteria.
{
"type": "text",
"items": [
{
"input": "Summarize the incident in one sentence.",
"output": "The service was unavailable for 12 minutes.",
"expected": "A brief outage lasted 12 minutes.",
"criteria": "Be concise and factually correct."
}
]
}
Output and Gating
Text reports use eval_type: "text" and expose summary.metric_averages.
Gate them with --metric-threshold:
evaly text eval --dataset text-cases.json -o text.json
evaly gate --report text.json --metric-threshold factual_correctness:0.8