evaly dataset

Manage evaluation datasets with metadata and expected scores.

evaly dataset <SUBCOMMAND>

Datasets are enriched prompt files with metadata and expected scores. They extend plain prompt arrays with structure for regression detection, golden test sets, and reproducible benchmarks. Use datasets to define what “good” looks like, then catch regressions automatically.

Backward compatible: Datasets accept all existing prompt formats — plain string arrays, "prompts" objects, "inputs" objects — and normalize them automatically. Your existing prompt files work as-is with evaly bench --dataset.

Dataset Format

A dataset is a JSON file with the following structure:

{
  "name": "product-photos-golden",
  "description": "Golden test set for product photo generation",
  "pipeline": "text2img",
  "items": [
    {
      "prompt": "A product photo of white sneakers on marble",
      "metadata": { "category": "footwear", "style": "minimal" },
      "expected": { "visual_quality": 4.5, "prompt_adherence": 4.0 }
    },
    {
      "prompt": "A modern minimalist logo for 'ACME Corp'",
      "metadata": { "category": "branding" },
      "expected": { "visual_quality": 4.0, "text_rendering": 3.5 }
    }
  ]
}

For img2img datasets, items include image_url and instruction instead of prompt:

{
  "name": "bg-editing-golden",
  "pipeline": "img2img",
  "items": [
    {
      "image_url": "https://example.com/product.jpg",
      "instruction": "Place the product on a kitchen counter",
      "expected": { "input_fidelity": 4.5, "transformation_quality": 4.0 }
    }
  ]
}

Subcommands

Subcommand	Description
`create`	Create a new empty dataset file.
`show`	Show dataset contents as a table.
`add`	Add an item to an existing dataset.
`from-bench`	Create a dataset from a bench report with expected scores.
`validate`	Validate a dataset file for errors and warnings.
`stats`	Show dataset statistics and expected score distribution.

evaly dataset create

Creates a new empty dataset file with metadata.

Flag	Type	Default	Description
--name, -n	TEXT	—	Required. Dataset name.
--pipeline	CHOICE	text2img	Pipeline type: `text2img` or `img2img`.
--description, -d	TEXT	—	Dataset description.
--output, -o	TEXT	<name>.json	Output file path.

# Create a text2img dataset
evaly dataset create -n "product-photos" -d "Product photography golden set"

# Create an img2img dataset with custom output path
evaly dataset create -n "bg-editing" --pipeline img2img -o datasets/bg-editing.json

evaly dataset show

Displays dataset contents as a Rich table with prompts, metadata, and expected scores.

evaly dataset show product-photos.json

evaly dataset add

Adds a single item to an existing dataset. Preserves the original file format (plain array, "prompts", "inputs", or "items").

Flag	Type	Description
--prompt	TEXT	Prompt text (for text2img items).
--image	TEXT	Input image URL (for img2img items).
--instruction	TEXT	Edit instruction (for img2img items).
--metadata, -m	TEXT (multiple)	Metadata as `key=value` pairs.
--expected, -e	TEXT (multiple)	Expected scores as `dim:value` pairs.

text2img item

evaly dataset add product-photos.json \
    --prompt "A product photo of white sneakers on marble" \
    -m category=footwear \
    -e visual_quality:4.5 -e prompt_adherence:4.0

img2img item

evaly dataset add bg-editing.json \
    --image "https://example.com/product.jpg" \
    --instruction "Place the product on a kitchen counter" \
    -m scene=kitchen \
    -e input_fidelity:4.5 -e transformation_quality:4.0

evaly dataset from-bench

Creates a dataset from a bench report JSON. Scores from the report become expected values, turning a successful benchmark into a golden test set for regression detection.

Flag	Type	Default	Description
--output, -o	TEXT	—	Required. Output file path.
--min-score	FLOAT	—	Only include items with overall score ≥ this value.
--model	TEXT	—	Use scores from a specific model (default: winner or first).
--name, -n	TEXT	from-<report>	Dataset name.

Basic usage

# Turn a bench report into a golden dataset
evaly dataset from-bench report.json -o golden.json

Filter by minimum score

# Only include high-scoring items (4.0+)
evaly dataset from-bench report.json --min-score 4.0 -o golden.json

Use a specific model's scores

# Use flux-pro scores as expected values
evaly dataset from-bench report.json --model flux-pro -o golden.json

evaly dataset validate

Validates a dataset file and reports warnings. Checks for empty datasets, pipeline consistency (e.g., img2img items missing image_url), and expected score ranges (0–5).

evaly dataset validate golden.json

# Output: Valid -- 12 items, no issues found.

evaly dataset stats

Shows dataset statistics including item count, expected score distribution (min/avg/max per dimension), and metadata key frequency.

evaly dataset stats golden.json

Typical Workflow

The dataset commands support a “Bench → Golden Set → Regression Detection” workflow:

Benchmark: Run evaly bench to generate and score images, saving a JSON report.
Golden set: Use evaly dataset from-bench to extract high-scoring results as expected values.
Regression detection: Run evaly bench --dataset golden.json --check-expected to compare future results against your baseline.

Full Example

# 1. Create a dataset
evaly dataset create -n "sneakers" -d "Product photo test set"

# 2. Add test cases with expected scores
evaly dataset add sneakers.json \
    --prompt "White sneakers on marble, studio lighting" \
    -m category=footwear \
    -e visual_quality:4.5 -e prompt_adherence:4.0

evaly dataset add sneakers.json \
    --prompt "Running shoes on a track, action shot" \
    -m category=footwear \
    -e visual_quality:4.0 -e prompt_adherence:4.0

# 3. Run benchmark with dataset
evaly bench -m flux-schnell --dataset sneakers.json -o report.json -y

# 4. Check results against expected scores
evaly bench -m flux-schnell --dataset sneakers.json --check-expected -y

# Or: create a golden set from a successful report
evaly dataset from-bench report.json --min-score 4.0 -o golden.json

evaly gate evaly init