Reports

Understanding bench report sections: rankings, cost efficiency, consensus analysis, and score details.

Every evaly bench run produces three report formats: terminal (Rich), HTML (self-contained), and JSON. All three contain the same data — this page documents each section and field.

Summary Band

The top-level banner shows the key result at a glance:

FieldDescription
WinnerModel with the highest overall score.
ScoreWinner's overall score on the 1–5 scale.
n=Sample count — number of prompts/inputs scored (excludes failures).
Best ValueModel with the highest Score/$ (only shown if different from winner).
Total costSum of generation + judge + metrics costs for the entire run.
DurationWall-clock time for the entire bench run.
When n < 10, a small sample warning is shown. Scores may vary significantly with more prompts. We recommend at least 10–20 prompts for publishable results.

Model Rankings Table

The main comparison table ranks all models by overall score. Columns are sortable in HTML reports.

ColumnDescription
RankPosition by overall score (1 = best).
ModelModel name. Winner gets a badge.
Dimension scoresAverage VLM score per dimension (1–5). Color-coded: green ≥4, yellow ≥3, red <3. Hover for stddev and sample count.
MetricsCLIP Score, LPIPS, face_similarity (when --metrics enabled). Shows ⚠ if below threshold, ✓ if above.
OverallMean of all dimension averages. Shows n=X sample count. Hover for ±stddev.
ConfAverage confidence (0–100%). The judge's self-assessed certainty in its scores.
AgreeConsensus mode only. Percentage of dimensions with "high" agreement across judges.
SuccessOnly shown when failures exist. Ratio of successful/total items. Failed items count as 0.0 across all dimensions.

Dimension Markers

Dimension column headers may show markers indicating cross-model variance:

MarkerNameMeaning
Differentiator Cross-model standard deviation ≥ 0.5. Models score very differently on this dimension — it's a key factor in the ranking.
Ceiling All models within 0.3 of each other and average ≥ 4.5. Everyone does well — this dimension doesn't help distinguish models.

Standard Deviation & Sample Count

Each score includes a standard deviation (σ) and sample count (n), visible on hover in HTML reports and inline in terminal for overall scores.

  • ±stddev — Population standard deviation of per-item scores. Lower = more consistent. A model scoring 4.2 ±0.3 is more reliable than 4.2 ±1.5.
  • n= — Number of items scored. Shown inline for overall scores. Higher n = more statistical confidence.

Weighted Overall Score

When objective metrics (CLIP, LPIPS, face_similarity) are configured with weights via evalytic.toml, the "Overall" column becomes "Overall (weighted)". This blends VLM judge averages with normalized metric scores for a more robust ranking.

The weighted score is calculated as:

  1. Each metric value is linearly normalized to a 0–5 scale using its configured range (e.g., CLIP 0.18–0.35 → 0–5).
  2. If a metric is below its flag threshold, it's excluded (flagged) — it doesn't count toward the weighted score.
  3. The remaining VLM average gets weight 1 - sum(metric_weights), and each included metric gets its configured weight.

Default weights (when evalytic[metrics] is installed):

MetricWeightFlag ThresholdNormalize Range
clip_score0.200.180.18 – 0.35
lpips0.200.400.40 – 0.95
face_similarity0.200.600.60 – 0.95

With all three metrics enabled and passing thresholds: VLM average gets 40% weight, each metric gets 20%.

When evalytic[metrics] is installed, CLIP/LPIPS are auto-enabled and weighted scoring activates automatically. Use --no-metrics to get a pure VLM dimension average (no weighting).

Dimension Profile (Radar Chart)

When 3 or more dimensions are scored, the HTML report includes a radar chart overlaying each model's dimension averages. This makes it easy to spot where models excel or fall behind — for example, a model may have top visual quality but weak text rendering.

Cost Efficiency Table

Ranks models by score-per-dollar. Only shown when 2+ models are compared.

ColumnDescription
ScoreOverall quality score (1–5).
Cost/ImageAverage generation cost per image for this model.
Score/$Quality divided by cost. Higher is more efficient. Use this to compare models at different price points.
vs WinnerQuality gap and cost comparison relative to the winner. Example: "-0.3 quality, 40% cheaper" means 0.3 points less quality at 40% lower cost.

A model is labeled BEST VALUE when it has the highest Score/$ and is not the winner. The best value model offers the most quality for the money — often a better choice than the winner if the quality gap is small.

Score/$ is a naive metric that favors cheap models. The "vs Winner" column provides context — check the quality gap before choosing the cheapest option.

Metric-VLM Correlation

When objective metrics (CLIP, LPIPS, face_similarity) are enabled, this table shows Pearson correlation between the metric and the VLM judge's corresponding dimension score.

FieldDescription
Pearson rCorrelation coefficient (-1 to +1). Higher absolute value = stronger agreement between metric and judge.
p-valueStatistical significance. Below 0.05 is generally meaningful.
Agreementhigh_agreement (|r| ≥ 0.7), moderate (|r| ≥ 0.4), or low_agreement (< 0.4).

High correlation validates that the VLM judge agrees with deterministic metrics. Low correlation may indicate the judge is biased — consider using consensus mode or switching judges.

Score Details (Per Image)

Each prompt/input has an expandable section showing per-model results:

FieldDescription
ImageGenerated image (click to zoom in HTML reports). For img2img, the input image is shown as the first card with an accent border.
Overall scoreMean of dimension scores for this specific image.
Generation timeAPI response time in milliseconds.
Generation costCost for this specific image generation.
Dimension breakdownPer-dimension score, confidence, explanation, and evidence from the VLM judge. In consensus mode, also shows per-judge scores and agreement badge.
MetricsCLIP Score, LPIPS, face_similarity values (when enabled).
FlagsMetric warnings (e.g., CLIP below threshold).

Failed Items

When image generation fails (API error, timeout, content policy violation), the item card shows a red error message instead of the normal image and scores. Failed items are handled as follows:

  • All dimension scores are counted as 0.0 (not skipped) — this penalizes unreliable models.
  • The Success column appears in the rankings table showing the pass ratio (e.g., 4/5).
  • The model's overall_score reflects the penalty — a model with 1 failure out of 5 items will have ~20% lower score than if all succeeded.
  • Failed items do count toward total_items but not toward sample_count (n=).

Dimension Score Fields

Each dimension score in the details section includes:

  • Score (1–5) — The final consensus or single-judge score.
  • Confidence (0–100%) — Judge's self-assessed certainty. Low confidence may indicate ambiguous images.
  • Explanation — Free-text rationale from the judge.
  • Evidence — Specific observations supporting the score (e.g., "smooth edges", "no artifacts around face").

Metric Warnings

When a model's objective metric falls below its configured threshold, a warning is shown:

  • The metric gets a ⚠ flag in the rankings table.
  • A warning box lists all flagged metrics with their values.
  • Flagged metrics are excluded from the weighted overall score to prevent low-quality outliers from distorting rankings.

Thresholds are configurable via evalytic.toml. See Configuration.

Cost Summary

Breakdown of total costs by category:

CategoryDescription
fal.ai generationTotal cost for all image generations, with per-model breakdown.
JudgeVLM judge costs. In consensus mode, shows per-provider breakdown (e.g., "gemini-2.5-flash: $0.01, gpt-5.2: $0.02").
Local metricsAlways $0.00 — CLIP, LPIPS, and face metrics run locally.
TotalSum of all categories.

Configuration

The report includes a collapsible "Configuration" section recording all settings used:

  • Models — List of evaluated models.
  • Judge — VLM judge (single or consensus).
  • JudgesConsensus mode only. List of judges used.
  • Dimensions — Scored dimensions.
  • Pipelinetext2img or img2img.
  • Metric Scoring — Thresholds and weights for CLIP/LPIPS/face.
  • Evalytic Version — SDK version used.
  • Platform — OS and Python version.

This ensures every report is fully reproducible.

Consensus Analysis

When running with --judges (2–3 judges), the report includes a Consensus Analysis panel. This section helps you evaluate judge reliability and identify where judges disagree.

For background on the consensus algorithm, see Judges → Consensus Mode.

Agreement Levels

Each dimension on each image is classified into one of three agreement levels:

LevelMeaningHow Score is Calculated
high Two primary judges scored within 0.5 points of each other. Average of the two judges' scores.
disputed Two primary judges disagreed by more than 0.5 points. A third tiebreaker judge was called. Median of all three judges' scores.
degraded One judge failed (API error, timeout, etc.). The surviving judge's score is used as-is.

Summary Statistics

The top of the consensus panel shows aggregate stats:

StatDescriptionWhat to look for
High Agreement %Percentage of (image, dimension) pairs where judges agreed.≥70% is good. Below 50% suggests judges have fundamentally different criteria.
Disputed %Percentage requiring a tiebreaker.High dispute rate increases cost (~3x instead of ~2.3x).
DegradedCount of scores where a judge failed.Should be 0. Non-zero means API reliability issues.
Total ScoresTotal (image × dimension) pairs scored.For context: 3 models × 5 prompts × 3 dimensions = 45 total.
TiebreakersNumber of times the third judge was called.Drives the cost multiplier above 2x.

Judge Scoring Bias

A table showing each judge's average score and deviation from the consensus average:

ColumnDescription
JudgeFull judge name (e.g., gemini-2.5-flash).
Roleprimary (scores every dimension) or tiebreaker (only scores disputed dimensions).
Avg ScoreMean score this judge gave across all dimensions it scored.
vs ConsensusDeviation from the consensus average. Green (< 0.3) = close to consensus. Red (> 0.6) = significant bias. Tiebreakers show "n/a" because they only score disputed dimensions (selection bias).
Scores GivenNumber of individual scores this judge produced. Primary judges score all dimensions; tiebreakers score fewer.
Tiebreaker bias is not directly comparable to primary judges. The tiebreaker only scores dimensions where primary judges disagreed — these are inherently harder to score, so the tiebreaker's average reflects the difficulty of disputed dimensions, not overall scoring tendency.

Disputed Dimensions

A collapsible list showing every dimension where judges disagreed. Each entry shows:

  • Model/Dimension — Which model and dimension had the dispute.
  • Consensus score — The final median score used in the report.
  • Per-judge scores — What each judge gave. Scores far from the consensus are highlighted in red.

This is useful for identifying systematic disagreements. For example, if judges consistently dispute on text_rendering, it may mean that dimension is poorly defined for your use case.

Per-Image Consensus Data

In the Score Details section, each dimension score shows additional consensus fields:

FieldDescription
Agreement badgehigh, disputed, or degraded — the agreement level for this specific (image, dimension).
Per-judge scoresIndividual scores from each judge (e.g., "gemini-2.5-flash: 5.0, gpt-5.2: 4.0, claude-haiku-4-5: 5.0").

Output Formats

Terminal (Rich)

Always printed after a bench run. Uses Rich for colored tables and progress bars. Cannot be disabled.

HTML

Self-contained single-file HTML with embedded images (base64). Includes interactive features: sortable tables, radar chart, image lightbox, pagination. Generated with --html report.html or the html_report config option.

JSON

Machine-readable format containing all raw data: item-level scores, per-judge scores, metrics, costs, and configuration. Generated with --json report.json or the json_report config option. Useful for CI/CD pipelines and custom analysis.

Browser Review

Interactive human-in-the-loop review via --review. Opens a local HTTP server where you can adjust scores and add notes. Human scores are merged into the report and saved. See evaly bench → --review.

# Generate all formats
evaly bench -m flux-schnell -p prompts.json \
    --html report.html \
    --json report.json

# Generate + open browser review
evaly bench -m flux-schnell -p prompts.json --review