Reports

Understanding bench report sections: rankings, cost efficiency, consensus analysis, and score details.

Every evaly bench run produces three report formats: terminal (Rich), HTML (self-contained), and JSON. All three contain the same data — this page documents each section and field.

Summary Band

The top-level banner shows the key result at a glance:

Field	Description
Winner	Model with the highest overall score.
Score	Winner's overall score on the 1–5 scale.
n=	Sample count — number of prompts/inputs scored (excludes failures).
Best Value	Model with the highest Score/$ (only shown if different from winner).
Total cost	Sum of generation + judge + metrics costs for the entire run.
Duration	Wall-clock time for the entire bench run.

When n < 10, a small sample warning is shown. Scores may vary significantly with more prompts. We recommend at least 10–20 prompts for publishable results.

Model Rankings Table

The main comparison table ranks all models by overall score. Columns are sortable in HTML reports.

Column	Description
Rank	Position by overall score (1 = best).
Model	Model name. Winner gets a badge.
Dimension scores	Average VLM score per dimension (1–5). Color-coded: green ≥4, yellow ≥3, red <3. Hover for stddev and sample count.
Metrics	CLIP Score, LPIPS, face_similarity (when `--metrics` enabled). Shows ⚠ if below threshold, ✓ if above.
Overall	Mean of all dimension averages. Shows `n=X` sample count. Hover for ±stddev.
Conf	Average confidence (0–100%). The judge's self-assessed certainty in its scores.
Agree	Consensus mode only. Percentage of dimensions with "high" agreement across judges.
Success	Only shown when failures exist. Ratio of successful/total items. Failed items count as 0.0 across all dimensions.

Dimension Markers

Dimension column headers may show markers indicating cross-model variance:

Marker	Name	Meaning
★	Differentiator	Cross-model standard deviation ≥ 0.5. Models score very differently on this dimension — it's a key factor in the ranking.
≈	Ceiling	All models within 0.3 of each other and average ≥ 4.5. Everyone does well — this dimension doesn't help distinguish models.

Standard Deviation & Sample Count

Each score includes a standard deviation (σ) and sample count (n), visible on hover in HTML reports and inline in terminal for overall scores.

±stddev — Population standard deviation of per-item scores. Lower = more consistent. A model scoring 4.2 ±0.3 is more reliable than 4.2 ±1.5.
n= — Number of items scored. Shown inline for overall scores. Higher n = more statistical confidence.

Weighted Overall Score

When objective metrics (CLIP, LPIPS, face_similarity) are configured with weights via evalytic.toml, the "Overall" column becomes "Overall (weighted)". This blends VLM judge averages with normalized metric scores for a more robust ranking.

The weighted score is calculated as:

Each metric value is linearly normalized to a 0–5 scale using its configured range (e.g., CLIP 0.18–0.35 → 0–5).
If a metric is below its flag threshold, it's excluded (flagged) — it doesn't count toward the weighted score.
The remaining VLM average gets weight 1 - sum(metric_weights), and each included metric gets its configured weight.

Default weights (when evalytic[metrics] is installed):

Metric	Weight	Flag Threshold	Normalize Range
`clip_score`	0.20	0.18	0.18 – 0.35
`lpips`	0.20	0.40	0.40 – 0.95
`face_similarity`	0.20	0.60	0.60 – 0.95

With all three metrics enabled and passing thresholds: VLM average gets 40% weight, each metric gets 20%.

When evalytic[metrics] is installed, CLIP/LPIPS are auto-enabled and weighted scoring activates automatically. Use --no-metrics to get a pure VLM dimension average (no weighting).

Dimension Profile (Radar Chart)

When 3 or more dimensions are scored, the HTML report includes a radar chart overlaying each model's dimension averages. This makes it easy to spot where models excel or fall behind — for example, a model may have top visual quality but weak text rendering.

Cost Efficiency Table

Ranks models by score-per-dollar. Only shown when 2+ models are compared.

Column	Description
Score	Overall quality score (1–5).
Cost/Image	Average generation cost per image for this model.
Score/$	Quality divided by cost. Higher is more efficient. Use this to compare models at different price points.
vs Winner	Quality gap and cost comparison relative to the winner. Example: "-0.3 quality, 40% cheaper" means 0.3 points less quality at 40% lower cost.

A model is labeled BEST VALUE when it has the highest Score/$ and is not the winner. The best value model offers the most quality for the money — often a better choice than the winner if the quality gap is small.

Score/$ is a naive metric that favors cheap models. The "vs Winner" column provides context — check the quality gap before choosing the cheapest option.

Metric-VLM Correlation

When objective metrics (CLIP, LPIPS, face_similarity) are enabled, this table shows Pearson correlation between the metric and the VLM judge's corresponding dimension score.

Field	Description
Pearson r	Correlation coefficient (-1 to +1). Higher absolute value = stronger agreement between metric and judge.
p-value	Statistical significance. Below 0.05 is generally meaningful.
Agreement	high_agreement (\|r\| ≥ 0.7), moderate (\|r\| ≥ 0.4), or low_agreement (< 0.4).

High correlation validates that the VLM judge agrees with deterministic metrics. Low correlation may indicate the judge is biased — consider using consensus mode or switching judges.

Score Details (Per Image)

Each prompt/input has an expandable section showing per-model results:

Field	Description
Image	Generated image (click to zoom in HTML reports). For img2img, the input image is shown as the first card with an accent border.
Overall score	Mean of dimension scores for this specific image.
Generation time	API response time in milliseconds.
Generation cost	Cost for this specific image generation.
Dimension breakdown	Per-dimension score, confidence, explanation, and evidence from the VLM judge. In consensus mode, also shows per-judge scores and agreement badge.
Metrics	CLIP Score, LPIPS, face_similarity values (when enabled).
Flags	Metric warnings (e.g., CLIP below threshold).

Failed Items

When image generation fails (API error, timeout, content policy violation), the item card shows a red error message instead of the normal image and scores. Failed items are handled as follows:

All dimension scores are counted as 0.0 (not skipped) — this penalizes unreliable models.
The Success column appears in the rankings table showing the pass ratio (e.g., 4/5).
The model's overall_score reflects the penalty — a model with 1 failure out of 5 items will have ~20% lower score than if all succeeded.
Failed items do count toward total_items but not toward sample_count (n=).

Dimension Score Fields

Each dimension score in the details section includes:

Score (1–5) — The final consensus or single-judge score.
Confidence (0–100%) — Judge's self-assessed certainty. Low confidence may indicate ambiguous images.
Explanation — Free-text rationale from the judge.
Evidence — Specific observations supporting the score (e.g., "smooth edges", "no artifacts around face").

Metric Warnings

When a model's objective metric falls below its configured threshold, a warning is shown:

The metric gets a ⚠ flag in the rankings table.
A warning box lists all flagged metrics with their values.
Flagged metrics are excluded from the weighted overall score to prevent low-quality outliers from distorting rankings.

Thresholds are configurable via evalytic.toml. See Configuration.

Cost Summary

Breakdown of total costs by category:

Category	Description
fal.ai generation	Total cost for all image generations, with per-model breakdown.
Judge	VLM judge costs. In consensus mode, shows per-provider breakdown (e.g., "gemini-2.5-flash: $0.01, gpt-5.2: $0.02").
Local metrics	Always $0.00 — CLIP, LPIPS, and face metrics run locally.
Total	Sum of all categories.

Configuration

The report includes a collapsible "Configuration" section recording all settings used:

Models — List of evaluated models.
Judge — VLM judge (single or consensus).
Judges — Consensus mode only. List of judges used.
Dimensions — Scored dimensions.
Pipeline — text2img or img2img.
Metric Scoring — Thresholds and weights for CLIP/LPIPS/face.
Evalytic Version — SDK version used.
Platform — OS and Python version.

This ensures every report is fully reproducible.

Consensus Analysis

When running with --judges (2–3 judges), the report includes a Consensus Analysis panel. This section helps you evaluate judge reliability and identify where judges disagree.

For background on the consensus algorithm, see Judges → Consensus Mode.

Agreement Levels

Each dimension on each image is classified into one of three agreement levels:

Level	Meaning	How Score is Calculated
high	Two primary judges scored within 0.5 points of each other.	Average of the two judges' scores.
disputed	Two primary judges disagreed by more than 0.5 points. A third tiebreaker judge was called.	Median of all three judges' scores.
degraded	One judge failed (API error, timeout, etc.).	The surviving judge's score is used as-is.

Summary Statistics

The top of the consensus panel shows aggregate stats:

Stat	Description	What to look for
High Agreement %	Percentage of (image, dimension) pairs where judges agreed.	≥70% is good. Below 50% suggests judges have fundamentally different criteria.
Disputed %	Percentage requiring a tiebreaker.	High dispute rate increases cost (~3x instead of ~2.3x).
Degraded	Count of scores where a judge failed.	Should be 0. Non-zero means API reliability issues.
Total Scores	Total (image × dimension) pairs scored.	For context: 3 models × 5 prompts × 3 dimensions = 45 total.
Tiebreakers	Number of times the third judge was called.	Drives the cost multiplier above 2x.

Judge Scoring Bias

A table showing each judge's average score and deviation from the consensus average:

Column	Description
Judge	Full judge name (e.g., `gemini-2.5-flash`).
Role	`primary` (scores every dimension) or `tiebreaker` (only scores disputed dimensions).
Avg Score	Mean score this judge gave across all dimensions it scored.
vs Consensus	Deviation from the consensus average. Green (< 0.3) = close to consensus. Red (> 0.6) = significant bias. Tiebreakers show "n/a" because they only score disputed dimensions (selection bias).
Scores Given	Number of individual scores this judge produced. Primary judges score all dimensions; tiebreakers score fewer.

Tiebreaker bias is not directly comparable to primary judges. The tiebreaker only scores dimensions where primary judges disagreed — these are inherently harder to score, so the tiebreaker's average reflects the difficulty of disputed dimensions, not overall scoring tendency.

Disputed Dimensions

A collapsible list showing every dimension where judges disagreed. Each entry shows:

Model/Dimension — Which model and dimension had the dispute.
Consensus score — The final median score used in the report.
Per-judge scores — What each judge gave. Scores far from the consensus are highlighted in red.

This is useful for identifying systematic disagreements. For example, if judges consistently dispute on text_rendering, it may mean that dimension is poorly defined for your use case.

Per-Image Consensus Data

In the Score Details section, each dimension score shows additional consensus fields:

Field	Description
Agreement badge	`high`, `disputed`, or `degraded` — the agreement level for this specific (image, dimension).
Per-judge scores	Individual scores from each judge (e.g., "gemini-2.5-flash: 5.0, gpt-5.2: 4.0, claude-haiku-4-5: 5.0").

Output Formats

Terminal (Rich)

Always printed after a bench run. Uses Rich for colored tables and progress bars. Cannot be disabled.

HTML

Self-contained single-file HTML with embedded images (base64). Includes interactive features: sortable tables, radar chart, image lightbox, pagination. Generated with --html report.html or the html_report config option.

JSON

Machine-readable format containing all raw data: item-level scores, per-judge scores, metrics, costs, and configuration. Generated with --json report.json or the json_report config option. Useful for CI/CD pipelines and custom analysis.

Browser Review

Interactive human-in-the-loop review via --review. Opens a local HTTP server where you can adjust scores and add notes. Human scores are merged into the report and saved. See evaly bench → --review.

# Generate all formats
evaly bench -m flux-schnell -p prompts.json \
    --html report.html \
    --json report.json

# Generate + open browser review
evaly bench -m flux-schnell -p prompts.json --review

Dimensions Flagship Model Benchmark