evaly gate

CI/CD quality gate with pass/fail exit codes.

evaly gate --report <JSON_PATH> [OPTIONS]

The gate command reads an Evalytic report JSON and exits with code 0 (pass), 1 (fail), or 2 (error). Bench reports use overall and per-dimension checks. RAG, text, and agent reports use per-metric checks.

Exit Codes

CodeMeaning
0All configured checks passed
1One or more checks failed
2Error (invalid input, bad JSON, or incompatible flags)

Report Types

Report TypeWhat gate checksRecommended flags
benchoverall_score, dimension_averages, optional confidence, optional baseline regression--threshold, --dimension-threshold, --min-confidence, --baseline
ragmetric_averages, optional baseline regression--metric-threshold, --baseline
textmetric_averages, optional baseline regression--metric-threshold, --baseline
agentmetric_averages, optional baseline regression--metric-threshold, --baseline

Options

FlagTypeDescription
--reportTEXTRequired. Path to the report JSON file.
--thresholdFLOATMinimum overall score for bench reports only.
--dimension-thresholdTEXT (multiple)Per-dimension threshold as dim:value for bench reports only.
--metric-thresholdTEXT (multiple)Per-metric threshold as metric:value for RAG, text, and agent reports.
--min-confidenceFLOATMinimum average confidence score for bench reports only.
--baselineTEXTPath to a baseline report of the same eval_type for regression detection.
--regression-thresholdFLOATMaximum allowed score drop vs baseline. Applies to dimensions for bench, metrics for RAG/text/agent.
--json-outputTEXTWrite machine-readable gate results to a file. Use - for stdout.
Invalid combinations fail fast. --threshold, --dimension-threshold, and --min-confidence only work with bench reports. --metric-threshold only works with rag, text, and agent reports.

Bench Reports

Use overall, per-dimension, confidence, and baseline checks for visual benchmark reports.

# Fail if overall score is below 3.8
evaly gate --report bench.json --threshold 3.8

# Require specific dimension minimums
evaly gate \
    --report bench.json \
    --dimension-threshold visual_quality:4.0 \
    --dimension-threshold prompt_adherence:3.8

# Add confidence and baseline regression checks
evaly gate \
    --report bench.json \
    --min-confidence 0.8 \
    --baseline bench-baseline.json \
    --regression-threshold 0.3

With a bench baseline, Evalytic checks both summary-level dimension regression and per-item regression. That catches cases where one prompt regresses badly even if the average looks stable.

RAG, Text, and Agent Reports

Use --metric-threshold for metric-first reports. Baselines compare metric averages against the same report type.

# Gate a RAG report
evaly gate \
    --report rag.json \
    --metric-threshold faithfulness:0.8 \
    --metric-threshold answer_relevancy:0.7

# Gate a text report against a baseline
evaly gate \
    --report text.json \
    --metric-threshold factual_correctness:0.8 \
    --baseline text-baseline.json \
    --regression-threshold 0.05

# Gate an agent report
evaly gate \
    --report agent.json \
    --metric-threshold goal_accuracy:0.75 \
    --metric-threshold tool_call_accuracy:0.9

JSON Output

Use --json-output for machine-readable results in CI jobs, alerting, or custom automation. The payload includes status, eval_type, all checks, and a summary count.

evaly gate \
    --report rag.json \
    --metric-threshold faithfulness:0.8 \
    --json-output gate-result.json
{
  "status": "fail",
  "eval_type": "rag",
  "checks": [
    {
      "type": "metric_threshold",
      "metric": "faithfulness",
      "value": 0.72,
      "threshold": 0.8,
      "passed": false
    }
  ],
  "summary": {
    "total_checks": 1,
    "passed": 0,
    "failed": 1
  }
}
Need summary-level diffs without pass/fail behavior? Use evaly compare. Use gate --baseline when you want enforced regression checks.