Dimensions
7 quality dimensions for evaluating AI-generated images.
Evalytic scores images across up to 7 semantic dimensions, each evaluated on a 1.0–5.0 scale (0.1 increments) by a VLM judge. Dimensions are grouped by pipeline type: text-to-image (text2img) and image-to-image (img2img).
All Dimensions
| Dimension | Pipeline | Images Evaluated | Description |
|---|---|---|---|
visual_quality |
Both | Output only | Overall visual quality: composition, lighting, color, sharpness. |
prompt_adherence |
text2img | Output only | How well the image matches the text prompt. |
text_rendering |
text2img | Output only | Quality of rendered text in the image (legibility, accuracy). |
input_fidelity |
img2img | Input + Output | How well the output preserves key elements from the input. |
transformation_quality |
img2img | Input + Output | Quality of the transformation applied (style transfer, enhancement). |
artifact_detection |
img2img | Input + Output | Presence of artifacts, glitches, or unwanted modifications. |
identity_preservation |
img2img | Input + Output | Face and identity preservation in transformations. Opt-in only. |
Scoring Scale
All dimensions use a 1.0–5.0 scale with 0.1 increments (e.g., 3.7, 4.2):
| Score | Meaning |
|---|---|
| 1 | Poor — Major issues, unusable |
| 2 | Below Average — Noticeable issues |
| 3 | Average — Acceptable with some issues |
| 4 | Good — High quality, minor issues |
| 5 | Excellent — Production-ready, no issues |
Text-to-Image Dimensions
visual_quality
Evaluates overall visual quality independent of prompt. Considers:
- Composition and framing
- Lighting and exposure
- Color accuracy and harmony
- Sharpness and detail
- Absence of visual artifacts
visual_quality is always included in auto-detection regardless of context.
The built-in sharpness metric (Variance of Laplacian) provides a deterministic cross-check for the sharpness component — no extra install required.
prompt_adherence
Measures how accurately the generated image reflects the text prompt. Evaluates:
- Subject presence and accuracy
- Scene composition matching prompt description
- Attribute accuracy (colors, sizes, quantities)
- Spatial relationships
- Style and mood alignment
Requires a prompt. Auto-selected when --prompt is provided to evaly eval, or always included for evaly bench.
text_rendering
Evaluates the quality of text rendered within the image:
- Character accuracy (correct spelling)
- Legibility and readability
- Font consistency
- Text placement and integration
- Absence of garbled characters
Auto-selected when the prompt contains text-related keywords: "text", "word", "letter", "write", "say", "font", "type", "sign".
Image-to-Image Dimensions
input_fidelity
Measures how well the output preserves important elements from the input image:
- Subject identity preservation
- Key feature retention
- Color palette consistency
- Structural integrity
transformation_quality
Evaluates the quality of the applied transformation:
- Smoothness and consistency of the transformation
- Appropriate level of modification
- Natural-looking results
- Effective application of the intended effect
artifact_detection
Checks for unwanted artifacts introduced during transformation:
- Edge artifacts and halos
- Color bleeding or banding
- Unnatural distortions
- Missing or duplicated elements
- Noise or grain introduction
artifact_detection, a higher score means fewer artifacts. Score 5 = no artifacts detected.
identity_preservation
Evaluates face and identity preservation when transforming images containing people:
- Facial feature accuracy (eyes, nose, mouth, jawline)
- Skin tone and complexion consistency
- Body proportions and posture
- Expression preservation
-d identity_preservation.
When no human faces are detected in the input image, it automatically scores 5 (not applicable) so it doesn't penalize the overall score.
Pair with --metrics face for a deterministic ArcFace embedding comparison alongside the VLM assessment.
Weighting Dimensions
By default, all active dimensions contribute equally to the overall score (1/n). You can customize weights to
match your use case — for example, an e-commerce app might weight input_fidelity heavily
while a creative tool might prioritize visual_quality.
# CLI: JSON string
evaly bench -m flux-kontext -i inputs.json \
--dim-weights '{"input_fidelity": 0.5, "visual_quality": 0.1}'
# evalytic.toml
[bench.dimension_weights]
input_fidelity = 0.5
visual_quality = 0.1
How weighting works:
- Specified weights are used directly
- Unspecified dimensions share the remaining weight equally
- Weights are normalized to sum to 1.0
- Reports show weight percentages in column headers when active
Use case examples
| Use Case | Recommended Weights |
|---|---|
| E-commerce product photos | input_fidelity: 0.5, visual_quality: 0.3 |
| Portrait editing | identity_preservation: 0.5, visual_quality: 0.2 |
| Marketing creative | visual_quality: 0.4, prompt_adherence: 0.4 |
| Logo/text generation | text_rendering: 0.5, visual_quality: 0.3 |
Selecting Dimensions
You can select specific dimensions or rely on auto-detection:
# Auto-detect (recommended)
evaly bench -m flux-schnell -p "A cat"
# Explicit selection
evaly bench -m flux-schnell -p "A cat" -d visual_quality -d prompt_adherence
# All text2img dimensions
evaly bench -m flux-pro -p "Sign: HELLO" \
-d visual_quality -d prompt_adherence -d text_rendering
Set default dimensions in evalytic.toml:
# evalytic.toml
[bench]
dimensions = ["visual_quality", "prompt_adherence", "text_rendering"]