evaly gate

CI/CD quality gate with pass/fail exit codes.

evaly gate --report <JSON_PATH> [OPTIONS]

The gate command reads a bench report JSON and exits with code 0 (pass) or 1 (fail) based on quality thresholds. Works with any scheduler — cron, GitHub Actions, Airflow, or a simple shell script.

Exit Codes

Code	Meaning
0	All checks passed
1	One or more checks failed
2	Error (invalid input, missing file)

Options

Flag	Type	Default	Description
--report	TEXT	—	Required. Path to bench report JSON file.
--threshold	FLOAT	—	Minimum overall score to pass (1.0–5.0).
--dimension-threshold	TEXT (multiple)	—	Per-dimension threshold as `dim:value`.
--min-confidence	FLOAT	—	Minimum average confidence score (0.0–1.0) to pass.
--baseline	TEXT	—	Path to baseline report JSON for regression detection.
--regression-threshold	FLOAT	0.3	Max allowed score drop per dimension vs baseline. Applied to both average and per-item comparisons.
--json-output	TEXT	—	Write gate results as JSON to file. Use `-` for stdout.

Check Types

The gate runs up to 5 types of checks. At least one check must be configured, or the gate passes with a warning.

Check	Triggered By	Description
`threshold`	`--threshold`	Overall score must meet minimum for every model.
`dimension_threshold`	`--dimension-threshold`	Per-dimension average must meet minimum for every model.
`confidence`	`--min-confidence`	Average judge confidence must meet minimum for every model.
`regression`	`--baseline`	Per-dimension average drop from baseline must stay within threshold.
`item_regression`	`--baseline`	Per-item, per-dimension score drop must stay within threshold. Catches regressions on specific inputs that averages might hide.

Per-item regression compares individual items by item_id. This catches cases where one specific prompt degrades badly while the average stays stable. Items not present in the baseline are skipped (no regression to detect). Items with status: "failed" are also skipped.

Basic Usage

Overall threshold

# Fail if any model's overall score is below 3.5
evaly gate --report report.json --threshold 3.5

Per-dimension thresholds

# Different thresholds per dimension
evaly gate \
    --report report.json \
    --dimension-threshold visual_quality:4.0 \
    --dimension-threshold prompt_adherence:3.5 \
    --dimension-threshold text_rendering:3.0

Confidence check

# Fail if the judge isn't confident enough in its scores
evaly gate --report report.json --min-confidence 0.8

Regression detection

# Fail if any dimension drops more than 0.5 from baseline
# Checks both averages and individual items
evaly gate \
    --report report.json \
    --baseline baseline.json \
    --regression-threshold 0.5

Combine all checks

evaly gate \
    --report report.json \
    --threshold 3.5 \
    --dimension-threshold visual_quality:4.0 \
    --min-confidence 0.8 \
    --baseline baseline.json \
    --regression-threshold 0.3

JSON Output

Use --json-output for machine-readable results. This suppresses terminal output and writes structured JSON instead — ideal for cron jobs, webhooks, and alerting integrations.

# Write to file
evaly gate --report report.json --threshold 3.5 --json-output gate-result.json

# Write to stdout (pipe to jq, curl, etc.)
evaly gate --report report.json --threshold 3.5 --json-output -

JSON output structure:

{
  "status": "fail",           // "pass", "fail", or "error"
  "checks": [
    {
      "type": "threshold",
      "model": "flux-schnell",
      "check": "overall_score",
      "value": 3.2,
      "threshold": 3.5,
      "passed": false
    },
    {
      "type": "item_regression",
      "model": "flux-schnell",
      "item_id": "product-photo-3",
      "dimension": "visual_quality",
      "baseline": 4.5,
      "current": 2.0,
      "drop": 2.5,
      "threshold": 0.3,
      "passed": false
    }
  ],
  "summary": {
    "total_checks": 12,
    "passed": 10,
    "failed": 2
  },
  "item_regressions": 1      // only present when > 0
}

Scheduled Benchmarks

The gate command works with any scheduler. The typical pattern is: run evaly bench to generate a report, then evaly gate to check it.

Cron job (weekly quality check)

#!/bin/bash
# Run weekly, alert on failure

REPORT="/data/benchmarks/$(date +%Y-%m-%d).json"
BASELINE="/data/benchmarks/baseline.json"

# Generate benchmark
evaly bench \
    -m flux-kontext \
    -i test-inputs.json \
    -o "$REPORT" \
    --yes

# Check with JSON output for alerting
evaly gate \
    --report "$REPORT" \
    --baseline "$BASELINE" \
    --threshold 3.5 \
    --regression-threshold 0.5 \
    --json-output gate-result.json

if [ $? -ne 0 ]; then
    # Send alert (Slack, email, PagerDuty, etc.)
    curl -X POST "$SLACK_WEBHOOK" \
        -d "@gate-result.json"
fi

GitHub Actions

# .github/workflows/visual-qa.yml
name: Visual Quality Gate

on:
  schedule:
    - cron: "0 6 * * 1"  # Every Monday at 06:00
  workflow_dispatch:        # Manual trigger

jobs:
  visual-qa:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install Evalytic
        run: pip install evalytic

      - name: Run benchmark
        run: |
          evaly bench \
            -m flux-schnell \
            -p test-prompts.json \
            -o report.json \
            --yes
        env:
          FAL_KEY: ${{ secrets.FAL_KEY }}
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

      - name: Quality gate
        run: |
          evaly gate \
            --report report.json \
            --threshold 3.5 \
            --dimension-threshold visual_quality:4.0 \
            --baseline baseline.json

Typical Workflow

Run evaly bench with -o report.json to generate a benchmark report
Run evaly gate --report report.json with your thresholds
Gate exits 0 (pass) or 1 (fail) — your pipeline acts accordingly

For regression detection, save your baseline report and compare against it:

# Save baseline after a known-good run
cp report.json baseline.json

# Later, compare new results against baseline
evaly gate --report new-report.json --baseline baseline.json

evaly eval evaly dataset