evaly gate

CI/CD quality gate with pass/fail exit codes.

evaly gate --report <JSON_PATH> [OPTIONS]

The gate command reads a bench report JSON and exits with code 0 (pass) or 1 (fail) based on quality thresholds. Works with any scheduler — cron, GitHub Actions, Airflow, or a simple shell script.

Exit Codes

CodeMeaning
0All checks passed
1One or more checks failed
2Error (invalid input, missing file)

Options

FlagTypeDefaultDescription
--reportTEXTRequired. Path to bench report JSON file.
--thresholdFLOATMinimum overall score to pass (1.0–5.0).
--dimension-thresholdTEXT (multiple)Per-dimension threshold as dim:value.
--min-confidenceFLOATMinimum average confidence score (0.0–1.0) to pass.
--baselineTEXTPath to baseline report JSON for regression detection.
--regression-thresholdFLOAT0.3Max allowed score drop per dimension vs baseline. Applied to both average and per-item comparisons.
--json-outputTEXTWrite gate results as JSON to file. Use - for stdout.

Check Types

The gate runs up to 5 types of checks. At least one check must be configured, or the gate passes with a warning.

CheckTriggered ByDescription
threshold--thresholdOverall score must meet minimum for every model.
dimension_threshold--dimension-thresholdPer-dimension average must meet minimum for every model.
confidence--min-confidenceAverage judge confidence must meet minimum for every model.
regression--baselinePer-dimension average drop from baseline must stay within threshold.
item_regression--baselinePer-item, per-dimension score drop must stay within threshold. Catches regressions on specific inputs that averages might hide.
Per-item regression compares individual items by item_id. This catches cases where one specific prompt degrades badly while the average stays stable. Items not present in the baseline are skipped (no regression to detect). Items with status: "failed" are also skipped.

Basic Usage

Overall threshold

# Fail if any model's overall score is below 3.5
evaly gate --report report.json --threshold 3.5

Per-dimension thresholds

# Different thresholds per dimension
evaly gate \
    --report report.json \
    --dimension-threshold visual_quality:4.0 \
    --dimension-threshold prompt_adherence:3.5 \
    --dimension-threshold text_rendering:3.0

Confidence check

# Fail if the judge isn't confident enough in its scores
evaly gate --report report.json --min-confidence 0.8

Regression detection

# Fail if any dimension drops more than 0.5 from baseline
# Checks both averages and individual items
evaly gate \
    --report report.json \
    --baseline baseline.json \
    --regression-threshold 0.5

Combine all checks

evaly gate \
    --report report.json \
    --threshold 3.5 \
    --dimension-threshold visual_quality:4.0 \
    --min-confidence 0.8 \
    --baseline baseline.json \
    --regression-threshold 0.3

JSON Output

Use --json-output for machine-readable results. This suppresses terminal output and writes structured JSON instead — ideal for cron jobs, webhooks, and alerting integrations.

# Write to file
evaly gate --report report.json --threshold 3.5 --json-output gate-result.json

# Write to stdout (pipe to jq, curl, etc.)
evaly gate --report report.json --threshold 3.5 --json-output -

JSON output structure:

{
  "status": "fail",           // "pass", "fail", or "error"
  "checks": [
    {
      "type": "threshold",
      "model": "flux-schnell",
      "check": "overall_score",
      "value": 3.2,
      "threshold": 3.5,
      "passed": false
    },
    {
      "type": "item_regression",
      "model": "flux-schnell",
      "item_id": "product-photo-3",
      "dimension": "visual_quality",
      "baseline": 4.5,
      "current": 2.0,
      "drop": 2.5,
      "threshold": 0.3,
      "passed": false
    }
  ],
  "summary": {
    "total_checks": 12,
    "passed": 10,
    "failed": 2
  },
  "item_regressions": 1      // only present when > 0
}

Scheduled Benchmarks

The gate command works with any scheduler. The typical pattern is: run evaly bench to generate a report, then evaly gate to check it.

Cron job (weekly quality check)

#!/bin/bash
# Run weekly, alert on failure

REPORT="/data/benchmarks/$(date +%Y-%m-%d).json"
BASELINE="/data/benchmarks/baseline.json"

# Generate benchmark
evaly bench \
    -m flux-kontext \
    -i test-inputs.json \
    -o "$REPORT" \
    --yes

# Check with JSON output for alerting
evaly gate \
    --report "$REPORT" \
    --baseline "$BASELINE" \
    --threshold 3.5 \
    --regression-threshold 0.5 \
    --json-output gate-result.json

if [ $? -ne 0 ]; then
    # Send alert (Slack, email, PagerDuty, etc.)
    curl -X POST "$SLACK_WEBHOOK" \
        -d "@gate-result.json"
fi

GitHub Actions

# .github/workflows/visual-qa.yml
name: Visual Quality Gate

on:
  schedule:
    - cron: "0 6 * * 1"  # Every Monday at 06:00
  workflow_dispatch:        # Manual trigger

jobs:
  visual-qa:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install Evalytic
        run: pip install evalytic

      - name: Run benchmark
        run: |
          evaly bench \
            -m flux-schnell \
            -p test-prompts.json \
            -o report.json \
            --yes
        env:
          FAL_KEY: ${{ secrets.FAL_KEY }}
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

      - name: Quality gate
        run: |
          evaly gate \
            --report report.json \
            --threshold 3.5 \
            --dimension-threshold visual_quality:4.0 \
            --baseline baseline.json

Typical Workflow

  1. Run evaly bench with -o report.json to generate a benchmark report
  2. Run evaly gate --report report.json with your thresholds
  3. Gate exits 0 (pass) or 1 (fail) — your pipeline acts accordingly

For regression detection, save your baseline report and compare against it:

# Save baseline after a known-good run
cp report.json baseline.json

# Later, compare new results against baseline
evaly gate --report new-report.json --baseline baseline.json