evaly gate
CI/CD quality gate with pass/fail exit codes.
evaly gate --report <JSON_PATH> [OPTIONS]
The gate command reads a bench report JSON and exits with code 0 (pass) or 1 (fail)
based on quality thresholds. Works with any scheduler — cron, GitHub Actions, Airflow, or a simple shell script.
Exit Codes
| Code | Meaning |
|---|---|
| 0 | All checks passed |
| 1 | One or more checks failed |
| 2 | Error (invalid input, missing file) |
Options
| Flag | Type | Default | Description |
|---|---|---|---|
| --report | TEXT | — | Required. Path to bench report JSON file. |
| --threshold | FLOAT | — | Minimum overall score to pass (1.0–5.0). |
| --dimension-threshold | TEXT (multiple) | — | Per-dimension threshold as dim:value. |
| --min-confidence | FLOAT | — | Minimum average confidence score (0.0–1.0) to pass. |
| --baseline | TEXT | — | Path to baseline report JSON for regression detection. |
| --regression-threshold | FLOAT | 0.3 | Max allowed score drop per dimension vs baseline. Applied to both average and per-item comparisons. |
| --json-output | TEXT | — | Write gate results as JSON to file. Use - for stdout. |
Check Types
The gate runs up to 5 types of checks. At least one check must be configured, or the gate passes with a warning.
| Check | Triggered By | Description |
|---|---|---|
threshold | --threshold | Overall score must meet minimum for every model. |
dimension_threshold | --dimension-threshold | Per-dimension average must meet minimum for every model. |
confidence | --min-confidence | Average judge confidence must meet minimum for every model. |
regression | --baseline | Per-dimension average drop from baseline must stay within threshold. |
item_regression | --baseline | Per-item, per-dimension score drop must stay within threshold. Catches regressions on specific inputs that averages might hide. |
item_id. This catches cases where one specific prompt degrades badly while the average stays stable. Items not present in the baseline are skipped (no regression to detect). Items with status: "failed" are also skipped.
Basic Usage
Overall threshold
# Fail if any model's overall score is below 3.5
evaly gate --report report.json --threshold 3.5
Per-dimension thresholds
# Different thresholds per dimension
evaly gate \
--report report.json \
--dimension-threshold visual_quality:4.0 \
--dimension-threshold prompt_adherence:3.5 \
--dimension-threshold text_rendering:3.0
Confidence check
# Fail if the judge isn't confident enough in its scores
evaly gate --report report.json --min-confidence 0.8
Regression detection
# Fail if any dimension drops more than 0.5 from baseline
# Checks both averages and individual items
evaly gate \
--report report.json \
--baseline baseline.json \
--regression-threshold 0.5
Combine all checks
evaly gate \
--report report.json \
--threshold 3.5 \
--dimension-threshold visual_quality:4.0 \
--min-confidence 0.8 \
--baseline baseline.json \
--regression-threshold 0.3
JSON Output
Use --json-output for machine-readable results. This suppresses terminal output and writes
structured JSON instead — ideal for cron jobs, webhooks, and alerting integrations.
# Write to file
evaly gate --report report.json --threshold 3.5 --json-output gate-result.json
# Write to stdout (pipe to jq, curl, etc.)
evaly gate --report report.json --threshold 3.5 --json-output -
JSON output structure:
{
"status": "fail", // "pass", "fail", or "error"
"checks": [
{
"type": "threshold",
"model": "flux-schnell",
"check": "overall_score",
"value": 3.2,
"threshold": 3.5,
"passed": false
},
{
"type": "item_regression",
"model": "flux-schnell",
"item_id": "product-photo-3",
"dimension": "visual_quality",
"baseline": 4.5,
"current": 2.0,
"drop": 2.5,
"threshold": 0.3,
"passed": false
}
],
"summary": {
"total_checks": 12,
"passed": 10,
"failed": 2
},
"item_regressions": 1 // only present when > 0
}
Scheduled Benchmarks
The gate command works with any scheduler. The typical pattern is:
run evaly bench to generate a report, then evaly gate to check it.
Cron job (weekly quality check)
#!/bin/bash
# Run weekly, alert on failure
REPORT="/data/benchmarks/$(date +%Y-%m-%d).json"
BASELINE="/data/benchmarks/baseline.json"
# Generate benchmark
evaly bench \
-m flux-kontext \
-i test-inputs.json \
-o "$REPORT" \
--yes
# Check with JSON output for alerting
evaly gate \
--report "$REPORT" \
--baseline "$BASELINE" \
--threshold 3.5 \
--regression-threshold 0.5 \
--json-output gate-result.json
if [ $? -ne 0 ]; then
# Send alert (Slack, email, PagerDuty, etc.)
curl -X POST "$SLACK_WEBHOOK" \
-d "@gate-result.json"
fi
GitHub Actions
# .github/workflows/visual-qa.yml
name: Visual Quality Gate
on:
schedule:
- cron: "0 6 * * 1" # Every Monday at 06:00
workflow_dispatch: # Manual trigger
jobs:
visual-qa:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install Evalytic
run: pip install evalytic
- name: Run benchmark
run: |
evaly bench \
-m flux-schnell \
-p test-prompts.json \
-o report.json \
--yes
env:
FAL_KEY: ${{ secrets.FAL_KEY }}
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
- name: Quality gate
run: |
evaly gate \
--report report.json \
--threshold 3.5 \
--dimension-threshold visual_quality:4.0 \
--baseline baseline.json
Typical Workflow
- Run
evaly benchwith-o report.jsonto generate a benchmark report - Run
evaly gate --report report.jsonwith your thresholds - Gate exits 0 (pass) or 1 (fail) — your pipeline acts accordingly
For regression detection, save your baseline report and compare against it:
# Save baseline after a known-good run
cp report.json baseline.json
# Later, compare new results against baseline
evaly gate --report new-report.json --baseline baseline.json