evaly gate
CI/CD quality gate with pass/fail exit codes.
evaly gate --report <JSON_PATH> [OPTIONS]
The gate command reads an Evalytic report JSON and exits with code 0 (pass), 1 (fail), or 2 (error).
Bench reports use overall and per-dimension checks. RAG, text, and agent reports use per-metric checks.
Exit Codes
| Code | Meaning |
|---|---|
| 0 | All configured checks passed |
| 1 | One or more checks failed |
| 2 | Error (invalid input, bad JSON, or incompatible flags) |
Report Types
| Report Type | What gate checks | Recommended flags |
|---|---|---|
bench | overall_score, dimension_averages, optional confidence, optional baseline regression | --threshold, --dimension-threshold, --min-confidence, --baseline |
rag | metric_averages, optional baseline regression | --metric-threshold, --baseline |
text | metric_averages, optional baseline regression | --metric-threshold, --baseline |
agent | metric_averages, optional baseline regression | --metric-threshold, --baseline |
Options
| Flag | Type | Description |
|---|---|---|
| --report | TEXT | Required. Path to the report JSON file. |
| --threshold | FLOAT | Minimum overall score for bench reports only. |
| --dimension-threshold | TEXT (multiple) | Per-dimension threshold as dim:value for bench reports only. |
| --metric-threshold | TEXT (multiple) | Per-metric threshold as metric:value for RAG, text, and agent reports. |
| --min-confidence | FLOAT | Minimum average confidence score for bench reports only. |
| --baseline | TEXT | Path to a baseline report of the same eval_type for regression detection. |
| --regression-threshold | FLOAT | Maximum allowed score drop vs baseline. Applies to dimensions for bench, metrics for RAG/text/agent. |
| --json-output | TEXT | Write machine-readable gate results to a file. Use - for stdout. |
--threshold, --dimension-threshold, and
--min-confidence only work with bench reports. --metric-threshold only works with
rag, text, and agent reports.
Bench Reports
Use overall, per-dimension, confidence, and baseline checks for visual benchmark reports.
# Fail if overall score is below 3.8
evaly gate --report bench.json --threshold 3.8
# Require specific dimension minimums
evaly gate \
--report bench.json \
--dimension-threshold visual_quality:4.0 \
--dimension-threshold prompt_adherence:3.8
# Add confidence and baseline regression checks
evaly gate \
--report bench.json \
--min-confidence 0.8 \
--baseline bench-baseline.json \
--regression-threshold 0.3
With a bench baseline, Evalytic checks both summary-level dimension regression and per-item regression. That catches cases where one prompt regresses badly even if the average looks stable.
RAG, Text, and Agent Reports
Use --metric-threshold for metric-first reports. Baselines compare metric averages against the same report type.
# Gate a RAG report
evaly gate \
--report rag.json \
--metric-threshold faithfulness:0.8 \
--metric-threshold answer_relevancy:0.7
# Gate a text report against a baseline
evaly gate \
--report text.json \
--metric-threshold factual_correctness:0.8 \
--baseline text-baseline.json \
--regression-threshold 0.05
# Gate an agent report
evaly gate \
--report agent.json \
--metric-threshold goal_accuracy:0.75 \
--metric-threshold tool_call_accuracy:0.9
JSON Output
Use --json-output for machine-readable results in CI jobs, alerting, or custom automation.
The payload includes status, eval_type, all checks, and a summary count.
evaly gate \
--report rag.json \
--metric-threshold faithfulness:0.8 \
--json-output gate-result.json
{
"status": "fail",
"eval_type": "rag",
"checks": [
{
"type": "metric_threshold",
"metric": "faithfulness",
"value": 0.72,
"threshold": 0.8,
"passed": false
}
],
"summary": {
"total_checks": 1,
"passed": 0,
"failed": 1
}
}
gate --baseline when you want enforced regression checks.