Dimensions

7 quality dimensions for evaluating AI-generated images.

Evalytic scores images across up to 7 semantic dimensions, each evaluated on a 1.0–5.0 scale (0.1 increments) by a VLM judge. Dimensions are grouped by pipeline type: text-to-image (text2img) and image-to-image (img2img).

All Dimensions

DimensionPipelineImages EvaluatedDescription
visual_quality Both Output only Overall visual quality: composition, lighting, color, sharpness.
prompt_adherence text2img Output only How well the image matches the text prompt.
text_rendering text2img Output only Quality of rendered text in the image (legibility, accuracy).
input_fidelity img2img Input + Output How well the output preserves key elements from the input.
transformation_quality img2img Input + Output Quality of the transformation applied (style transfer, enhancement).
artifact_detection img2img Input + Output Presence of artifacts, glitches, or unwanted modifications.
identity_preservation img2img Input + Output Face and identity preservation in transformations. Opt-in only.

Scoring Scale

All dimensions use a 1.0–5.0 scale with 0.1 increments (e.g., 3.7, 4.2):

ScoreMeaning
1Poor — Major issues, unusable
2Below Average — Noticeable issues
3Average — Acceptable with some issues
4Good — High quality, minor issues
5Excellent — Production-ready, no issues

Text-to-Image Dimensions

visual_quality

Evaluates overall visual quality independent of prompt. Considers:

  • Composition and framing
  • Lighting and exposure
  • Color accuracy and harmony
  • Sharpness and detail
  • Absence of visual artifacts
visual_quality is always included in auto-detection regardless of context. The built-in sharpness metric (Variance of Laplacian) provides a deterministic cross-check for the sharpness component — no extra install required.

prompt_adherence

Measures how accurately the generated image reflects the text prompt. Evaluates:

  • Subject presence and accuracy
  • Scene composition matching prompt description
  • Attribute accuracy (colors, sizes, quantities)
  • Spatial relationships
  • Style and mood alignment

Requires a prompt. Auto-selected when --prompt is provided to evaly eval, or always included for evaly bench.

text_rendering

Evaluates the quality of text rendered within the image:

  • Character accuracy (correct spelling)
  • Legibility and readability
  • Font consistency
  • Text placement and integration
  • Absence of garbled characters

Auto-selected when the prompt contains text-related keywords: "text", "word", "letter", "write", "say", "font", "type", "sign".

Image-to-Image Dimensions

input_fidelity

Measures how well the output preserves important elements from the input image:

  • Subject identity preservation
  • Key feature retention
  • Color palette consistency
  • Structural integrity

transformation_quality

Evaluates the quality of the applied transformation:

  • Smoothness and consistency of the transformation
  • Appropriate level of modification
  • Natural-looking results
  • Effective application of the intended effect

artifact_detection

Checks for unwanted artifacts introduced during transformation:

  • Edge artifacts and halos
  • Color bleeding or banding
  • Unnatural distortions
  • Missing or duplicated elements
  • Noise or grain introduction
For artifact_detection, a higher score means fewer artifacts. Score 5 = no artifacts detected.

identity_preservation

Evaluates face and identity preservation when transforming images containing people:

  • Facial feature accuracy (eyes, nose, mouth, jawline)
  • Skin tone and complexion consistency
  • Body proportions and posture
  • Expression preservation
Opt-in only. This dimension is not included in auto-detection. Add it explicitly with -d identity_preservation. When no human faces are detected in the input image, it automatically scores 5 (not applicable) so it doesn't penalize the overall score.

Pair with --metrics face for a deterministic ArcFace embedding comparison alongside the VLM assessment.

Weighting Dimensions

By default, all active dimensions contribute equally to the overall score (1/n). You can customize weights to match your use case — for example, an e-commerce app might weight input_fidelity heavily while a creative tool might prioritize visual_quality.

# CLI: JSON string
evaly bench -m flux-kontext -i inputs.json \
    --dim-weights '{"input_fidelity": 0.5, "visual_quality": 0.1}'

# evalytic.toml
[bench.dimension_weights]
input_fidelity = 0.5
visual_quality = 0.1

How weighting works:

  • Specified weights are used directly
  • Unspecified dimensions share the remaining weight equally
  • Weights are normalized to sum to 1.0
  • Reports show weight percentages in column headers when active

Use case examples

Use CaseRecommended Weights
E-commerce product photosinput_fidelity: 0.5, visual_quality: 0.3
Portrait editingidentity_preservation: 0.5, visual_quality: 0.2
Marketing creativevisual_quality: 0.4, prompt_adherence: 0.4
Logo/text generationtext_rendering: 0.5, visual_quality: 0.3

Selecting Dimensions

You can select specific dimensions or rely on auto-detection:

# Auto-detect (recommended)
evaly bench -m flux-schnell -p "A cat"

# Explicit selection
evaly bench -m flux-schnell -p "A cat" -d visual_quality -d prompt_adherence

# All text2img dimensions
evaly bench -m flux-pro -p "Sign: HELLO" \
    -d visual_quality -d prompt_adherence -d text_rendering

Set default dimensions in evalytic.toml:

# evalytic.toml
[bench]
dimensions = ["visual_quality", "prompt_adherence", "text_rendering"]