evaly gate

CI/CD quality gate with pass/fail exit codes.

evaly gate --report <JSON_PATH> [OPTIONS]

The gate command reads an Evalytic report JSON and exits with code 0 (pass), 1 (fail), or 2 (error). Bench reports use overall and per-dimension checks. RAG, text, and agent reports use per-metric checks.

Exit Codes

Code	Meaning
0	All configured checks passed
1	One or more checks failed
2	Error (invalid input, bad JSON, or incompatible flags)

Report Types

Report Type	What gate checks	Recommended flags
`bench`	`overall_score`, `dimension_averages`, optional confidence, optional baseline regression	`--threshold`, `--dimension-threshold`, `--min-confidence`, `--baseline`
`rag`	`metric_averages`, optional baseline regression	`--metric-threshold`, `--baseline`
`text`	`metric_averages`, optional baseline regression	`--metric-threshold`, `--baseline`
`agent`	`metric_averages`, optional baseline regression	`--metric-threshold`, `--baseline`

Options

Flag	Type	Description
--report	TEXT	Required. Path to the report JSON file.
--threshold	FLOAT	Minimum overall score for bench reports only.
--dimension-threshold	TEXT (multiple)	Per-dimension threshold as `dim:value` for bench reports only.
--metric-threshold	TEXT (multiple)	Per-metric threshold as `metric:value` for RAG, text, and agent reports.
--min-confidence	FLOAT	Minimum average confidence score for bench reports only.
--baseline	TEXT	Path to a baseline report of the same `eval_type` for regression detection.
--regression-threshold	FLOAT	Maximum allowed score drop vs baseline. Applies to dimensions for bench, metrics for RAG/text/agent.
--json-output	TEXT	Write machine-readable gate results to a file. Use `-` for stdout.

Invalid combinations fail fast. --threshold, --dimension-threshold, and --min-confidence only work with bench reports. --metric-threshold only works with rag, text, and agent reports.

Bench Reports

Use overall, per-dimension, confidence, and baseline checks for visual benchmark reports.

# Fail if overall score is below 3.8
evaly gate --report bench.json --threshold 3.8

# Require specific dimension minimums
evaly gate \
    --report bench.json \
    --dimension-threshold visual_quality:4.0 \
    --dimension-threshold prompt_adherence:3.8

# Add confidence and baseline regression checks
evaly gate \
    --report bench.json \
    --min-confidence 0.8 \
    --baseline bench-baseline.json \
    --regression-threshold 0.3

With a bench baseline, Evalytic checks both summary-level dimension regression and per-item regression. That catches cases where one prompt regresses badly even if the average looks stable.

RAG, Text, and Agent Reports

Use --metric-threshold for metric-first reports. Baselines compare metric averages against the same report type.

# Gate a RAG report
evaly gate \
    --report rag.json \
    --metric-threshold faithfulness:0.8 \
    --metric-threshold answer_relevancy:0.7

# Gate a text report against a baseline
evaly gate \
    --report text.json \
    --metric-threshold factual_correctness:0.8 \
    --baseline text-baseline.json \
    --regression-threshold 0.05

# Gate an agent report
evaly gate \
    --report agent.json \
    --metric-threshold goal_accuracy:0.75 \
    --metric-threshold tool_call_accuracy:0.9

JSON Output

Use --json-output for machine-readable results in CI jobs, alerting, or custom automation. The payload includes status, eval_type, all checks, and a summary count.

evaly gate \
    --report rag.json \
    --metric-threshold faithfulness:0.8 \
    --json-output gate-result.json

{
  "status": "fail",
  "eval_type": "rag",
  "checks": [
    {
      "type": "metric_threshold",
      "metric": "faithfulness",
      "value": 0.72,
      "threshold": 0.8,
      "passed": false
    }
  ],
  "summary": {
    "total_checks": 1,
    "passed": 0,
    "failed": 1
  }
}

Need summary-level diffs without pass/fail behavior? Use evaly compare. Use gate --baseline when you want enforced regression checks.

evaly compare evaly demo