Reports
Understanding bench report sections: rankings, cost efficiency, consensus analysis, and score details.
Every evaly bench run produces three report formats: terminal (Rich), HTML (self-contained), and JSON.
All three contain the same data — this page documents each section and field.
Summary Band
The top-level banner shows the key result at a glance:
| Field | Description |
|---|---|
| Winner | Model with the highest overall score. |
| Score | Winner's overall score on the 1–5 scale. |
| n= | Sample count — number of prompts/inputs scored (excludes failures). |
| Best Value | Model with the highest Score/$ (only shown if different from winner). |
| Total cost | Sum of generation + judge + metrics costs for the entire run. |
| Duration | Wall-clock time for the entire bench run. |
Model Rankings Table
The main comparison table ranks all models by overall score. Columns are sortable in HTML reports.
| Column | Description |
|---|---|
| Rank | Position by overall score (1 = best). |
| Model | Model name. Winner gets a badge. |
| Dimension scores | Average VLM score per dimension (1–5). Color-coded: green ≥4, yellow ≥3, red <3. Hover for stddev and sample count. |
| Metrics | CLIP Score, LPIPS, face_similarity (when --metrics enabled). Shows ⚠ if below threshold, ✓ if above. |
| Overall | Mean of all dimension averages. Shows n=X sample count. Hover for ±stddev. |
| Conf | Average confidence (0–100%). The judge's self-assessed certainty in its scores. |
| Agree | Consensus mode only. Percentage of dimensions with "high" agreement across judges. |
| Success | Only shown when failures exist. Ratio of successful/total items. Failed items count as 0.0 across all dimensions. |
Dimension Markers
Dimension column headers may show markers indicating cross-model variance:
| Marker | Name | Meaning |
|---|---|---|
| ★ | Differentiator | Cross-model standard deviation ≥ 0.5. Models score very differently on this dimension — it's a key factor in the ranking. |
| ≈ | Ceiling | All models within 0.3 of each other and average ≥ 4.5. Everyone does well — this dimension doesn't help distinguish models. |
Standard Deviation & Sample Count
Each score includes a standard deviation (σ) and sample count (n), visible on hover in HTML reports and inline in terminal for overall scores.
- ±stddev — Population standard deviation of per-item scores. Lower = more consistent. A model scoring 4.2 ±0.3 is more reliable than 4.2 ±1.5.
- n= — Number of items scored. Shown inline for overall scores. Higher n = more statistical confidence.
Weighted Overall Score
When objective metrics (CLIP, LPIPS, face_similarity) are configured with weights via evalytic.toml,
the "Overall" column becomes "Overall (weighted)". This blends VLM judge averages with
normalized metric scores for a more robust ranking.
The weighted score is calculated as:
- Each metric value is linearly normalized to a 0–5 scale using its configured range (e.g., CLIP 0.18–0.35 → 0–5).
- If a metric is below its flag threshold, it's excluded (flagged) — it doesn't count toward the weighted score.
- The remaining VLM average gets weight
1 - sum(metric_weights), and each included metric gets its configured weight.
Default weights (when evalytic[metrics] is installed):
| Metric | Weight | Flag Threshold | Normalize Range |
|---|---|---|---|
clip_score | 0.20 | 0.18 | 0.18 – 0.35 |
lpips | 0.20 | 0.40 | 0.40 – 0.95 |
face_similarity | 0.20 | 0.60 | 0.60 – 0.95 |
With all three metrics enabled and passing thresholds: VLM average gets 40% weight, each metric gets 20%.
evalytic[metrics] is installed, CLIP/LPIPS are auto-enabled and weighted scoring activates automatically. Use --no-metrics to get a pure VLM dimension average (no weighting).
Dimension Profile (Radar Chart)
When 3 or more dimensions are scored, the HTML report includes a radar chart overlaying each model's dimension averages. This makes it easy to spot where models excel or fall behind — for example, a model may have top visual quality but weak text rendering.
Cost Efficiency Table
Ranks models by score-per-dollar. Only shown when 2+ models are compared.
| Column | Description |
|---|---|
| Score | Overall quality score (1–5). |
| Cost/Image | Average generation cost per image for this model. |
| Score/$ | Quality divided by cost. Higher is more efficient. Use this to compare models at different price points. |
| vs Winner | Quality gap and cost comparison relative to the winner. Example: "-0.3 quality, 40% cheaper" means 0.3 points less quality at 40% lower cost. |
A model is labeled BEST VALUE when it has the highest Score/$ and is not the winner. The best value model offers the most quality for the money — often a better choice than the winner if the quality gap is small.
Metric-VLM Correlation
When objective metrics (CLIP, LPIPS, face_similarity) are enabled, this table shows Pearson correlation between the metric and the VLM judge's corresponding dimension score.
| Field | Description |
|---|---|
| Pearson r | Correlation coefficient (-1 to +1). Higher absolute value = stronger agreement between metric and judge. |
| p-value | Statistical significance. Below 0.05 is generally meaningful. |
| Agreement | high_agreement (|r| ≥ 0.7), moderate (|r| ≥ 0.4), or low_agreement (< 0.4). |
High correlation validates that the VLM judge agrees with deterministic metrics. Low correlation may indicate the judge is biased — consider using consensus mode or switching judges.
Score Details (Per Image)
Each prompt/input has an expandable section showing per-model results:
| Field | Description |
|---|---|
| Image | Generated image (click to zoom in HTML reports). For img2img, the input image is shown as the first card with an accent border. |
| Overall score | Mean of dimension scores for this specific image. |
| Generation time | API response time in milliseconds. |
| Generation cost | Cost for this specific image generation. |
| Dimension breakdown | Per-dimension score, confidence, explanation, and evidence from the VLM judge. In consensus mode, also shows per-judge scores and agreement badge. |
| Metrics | CLIP Score, LPIPS, face_similarity values (when enabled). |
| Flags | Metric warnings (e.g., CLIP below threshold). |
Failed Items
When image generation fails (API error, timeout, content policy violation), the item card shows a red error message instead of the normal image and scores. Failed items are handled as follows:
- All dimension scores are counted as 0.0 (not skipped) — this penalizes unreliable models.
- The Success column appears in the rankings table showing the pass ratio (e.g., 4/5).
- The model's
overall_scorereflects the penalty — a model with 1 failure out of 5 items will have ~20% lower score than if all succeeded. - Failed items do count toward
total_itemsbut not towardsample_count(n=).
Dimension Score Fields
Each dimension score in the details section includes:
- Score (1–5) — The final consensus or single-judge score.
- Confidence (0–100%) — Judge's self-assessed certainty. Low confidence may indicate ambiguous images.
- Explanation — Free-text rationale from the judge.
- Evidence — Specific observations supporting the score (e.g., "smooth edges", "no artifacts around face").
Metric Warnings
When a model's objective metric falls below its configured threshold, a warning is shown:
- The metric gets a ⚠ flag in the rankings table.
- A warning box lists all flagged metrics with their values.
- Flagged metrics are excluded from the weighted overall score to prevent low-quality outliers from distorting rankings.
Thresholds are configurable via evalytic.toml. See Configuration.
Cost Summary
Breakdown of total costs by category:
| Category | Description |
|---|---|
| fal.ai generation | Total cost for all image generations, with per-model breakdown. |
| Judge | VLM judge costs. In consensus mode, shows per-provider breakdown (e.g., "gemini-2.5-flash: $0.01, gpt-5.2: $0.02"). |
| Local metrics | Always $0.00 — CLIP, LPIPS, and face metrics run locally. |
| Total | Sum of all categories. |
Configuration
The report includes a collapsible "Configuration" section recording all settings used:
- Models — List of evaluated models.
- Judge — VLM judge (single or consensus).
- Judges — Consensus mode only. List of judges used.
- Dimensions — Scored dimensions.
- Pipeline —
text2imgorimg2img. - Metric Scoring — Thresholds and weights for CLIP/LPIPS/face.
- Evalytic Version — SDK version used.
- Platform — OS and Python version.
This ensures every report is fully reproducible.
Consensus Analysis
When running with --judges (2–3 judges), the report includes a Consensus Analysis panel.
This section helps you evaluate judge reliability and identify where judges disagree.
Agreement Levels
Each dimension on each image is classified into one of three agreement levels:
| Level | Meaning | How Score is Calculated |
|---|---|---|
| high | Two primary judges scored within 0.5 points of each other. | Average of the two judges' scores. |
| disputed | Two primary judges disagreed by more than 0.5 points. A third tiebreaker judge was called. | Median of all three judges' scores. |
| degraded | One judge failed (API error, timeout, etc.). | The surviving judge's score is used as-is. |
Summary Statistics
The top of the consensus panel shows aggregate stats:
| Stat | Description | What to look for |
|---|---|---|
| High Agreement % | Percentage of (image, dimension) pairs where judges agreed. | ≥70% is good. Below 50% suggests judges have fundamentally different criteria. |
| Disputed % | Percentage requiring a tiebreaker. | High dispute rate increases cost (~3x instead of ~2.3x). |
| Degraded | Count of scores where a judge failed. | Should be 0. Non-zero means API reliability issues. |
| Total Scores | Total (image × dimension) pairs scored. | For context: 3 models × 5 prompts × 3 dimensions = 45 total. |
| Tiebreakers | Number of times the third judge was called. | Drives the cost multiplier above 2x. |
Judge Scoring Bias
A table showing each judge's average score and deviation from the consensus average:
| Column | Description |
|---|---|
| Judge | Full judge name (e.g., gemini-2.5-flash). |
| Role | primary (scores every dimension) or tiebreaker (only scores disputed dimensions). |
| Avg Score | Mean score this judge gave across all dimensions it scored. |
| vs Consensus | Deviation from the consensus average. Green (< 0.3) = close to consensus. Red (> 0.6) = significant bias. Tiebreakers show "n/a" because they only score disputed dimensions (selection bias). |
| Scores Given | Number of individual scores this judge produced. Primary judges score all dimensions; tiebreakers score fewer. |
Disputed Dimensions
A collapsible list showing every dimension where judges disagreed. Each entry shows:
- Model/Dimension — Which model and dimension had the dispute.
- Consensus score — The final median score used in the report.
- Per-judge scores — What each judge gave. Scores far from the consensus are highlighted in red.
This is useful for identifying systematic disagreements. For example, if judges consistently dispute on text_rendering, it may mean that dimension is poorly defined for your use case.
Per-Image Consensus Data
In the Score Details section, each dimension score shows additional consensus fields:
| Field | Description |
|---|---|
| Agreement badge | high, disputed, or degraded — the agreement level for this specific (image, dimension). |
| Per-judge scores | Individual scores from each judge (e.g., "gemini-2.5-flash: 5.0, gpt-5.2: 4.0, claude-haiku-4-5: 5.0"). |
Output Formats
Terminal (Rich)
Always printed after a bench run. Uses Rich for colored tables and progress bars. Cannot be disabled.
HTML
Self-contained single-file HTML with embedded images (base64). Includes interactive features: sortable tables, radar chart, image lightbox, pagination. Generated with --html report.html or the html_report config option.
JSON
Machine-readable format containing all raw data: item-level scores, per-judge scores, metrics, costs, and configuration. Generated with --json report.json or the json_report config option. Useful for CI/CD pipelines and custom analysis.
Browser Review
Interactive human-in-the-loop review via --review. Opens a local HTTP server where you can adjust scores and add notes. Human scores are merged into the report and saved. See evaly bench → --review.
# Generate all formats
evaly bench -m flux-schnell -p prompts.json \
--html report.html \
--json report.json
# Generate + open browser review
evaly bench -m flux-schnell -p prompts.json --review