Dimensions

7 quality dimensions for evaluating AI-generated images.

Evalytic scores images across up to 7 semantic dimensions, each evaluated on a 1.0–5.0 scale (0.1 increments) by a VLM judge. Dimensions are grouped by pipeline type: text-to-image (text2img) and image-to-image (img2img).

All Dimensions

Dimension	Pipeline	Images Evaluated	Description
`visual_quality`	Both	Output only	Overall visual quality: composition, lighting, color, sharpness.
`prompt_adherence`	text2img	Output only	How well the image matches the text prompt.
`text_rendering`	text2img	Output only	Quality of rendered text in the image (legibility, accuracy).
`input_fidelity`	img2img	Input + Output	How well the output preserves key elements from the input.
`transformation_quality`	img2img	Input + Output	Quality of the transformation applied (style transfer, enhancement).
`artifact_detection`	img2img	Input + Output	Presence of artifacts, glitches, or unwanted modifications.
`identity_preservation`	img2img	Input + Output	Face and identity preservation in transformations. Opt-in only.

Scoring Scale

All dimensions use a 1.0–5.0 scale with 0.1 increments (e.g., 3.7, 4.2):

Score	Meaning
1	Poor — Major issues, unusable
2	Below Average — Noticeable issues
3	Average — Acceptable with some issues
4	Good — High quality, minor issues
5	Excellent — Production-ready, no issues

Text-to-Image Dimensions

visual_quality

Evaluates overall visual quality independent of prompt. Considers:

Composition and framing
Lighting and exposure
Color accuracy and harmony
Sharpness and detail
Absence of visual artifacts

visual_quality is always included in auto-detection regardless of context. The built-in sharpness metric (Variance of Laplacian) provides a deterministic cross-check for the sharpness component — no extra install required.

prompt_adherence

Measures how accurately the generated image reflects the text prompt. Evaluates:

Subject presence and accuracy
Scene composition matching prompt description
Attribute accuracy (colors, sizes, quantities)
Spatial relationships
Style and mood alignment

Requires a prompt. Auto-selected when --prompt is provided to evaly eval, or always included for evaly bench.

text_rendering

Evaluates the quality of text rendered within the image:

Character accuracy (correct spelling)
Legibility and readability
Font consistency
Text placement and integration
Absence of garbled characters

Auto-selected when the prompt contains text-related keywords: "text", "word", "letter", "write", "say", "font", "type", "sign".

Image-to-Image Dimensions

input_fidelity

Measures how well the output preserves important elements from the input image:

Subject identity preservation
Key feature retention
Color palette consistency
Structural integrity

transformation_quality

Evaluates the quality of the applied transformation:

Smoothness and consistency of the transformation
Appropriate level of modification
Natural-looking results
Effective application of the intended effect

artifact_detection

Checks for unwanted artifacts introduced during transformation:

Edge artifacts and halos
Color bleeding or banding
Unnatural distortions
Missing or duplicated elements
Noise or grain introduction

For artifact_detection, a higher score means fewer artifacts. Score 5 = no artifacts detected.

identity_preservation

Evaluates face and identity preservation when transforming images containing people:

Facial feature accuracy (eyes, nose, mouth, jawline)
Skin tone and complexion consistency
Body proportions and posture
Expression preservation

Opt-in only. This dimension is not included in auto-detection. Add it explicitly with -d identity_preservation. When no human faces are detected in the input image, it automatically scores 5 (not applicable) so it doesn't penalize the overall score.

Pair with --metrics face for a deterministic ArcFace embedding comparison alongside the VLM assessment.

Weighting Dimensions

By default, all active dimensions contribute equally to the overall score (1/n). You can customize weights to match your use case — for example, an e-commerce app might weight input_fidelity heavily while a creative tool might prioritize visual_quality.

# CLI: JSON string
evaly bench -m flux-kontext -i inputs.json \
    --dim-weights '{"input_fidelity": 0.5, "visual_quality": 0.1}'

# evalytic.toml
[bench.dimension_weights]
input_fidelity = 0.5
visual_quality = 0.1

How weighting works:

Specified weights are used directly
Unspecified dimensions share the remaining weight equally
Weights are normalized to sum to 1.0
Reports show weight percentages in column headers when active

Use case examples

Use Case	Recommended Weights
E-commerce product photos	`input_fidelity: 0.5, visual_quality: 0.3`
Portrait editing	`identity_preservation: 0.5, visual_quality: 0.2`
Marketing creative	`visual_quality: 0.4, prompt_adherence: 0.4`
Logo/text generation	`text_rendering: 0.5, visual_quality: 0.3`

Selecting Dimensions

You can select specific dimensions or rely on auto-detection:

# Auto-detect (recommended)
evaly bench -m flux-schnell -p "A cat"

# Explicit selection
evaly bench -m flux-schnell -p "A cat" -d visual_quality -d prompt_adherence

# All text2img dimensions
evaly bench -m flux-pro -p "Sign: HELLO" \
    -d visual_quality -d prompt_adherence -d text_rendering

Set default dimensions in evalytic.toml:

# evalytic.toml
[bench]
dimensions = ["visual_quality", "prompt_adherence", "text_rendering"]

Judges Reports