evaly dataset
Manage evaluation datasets with metadata and expected scores.
evaly dataset <SUBCOMMAND>
Datasets are enriched prompt files with metadata and expected scores. They extend plain prompt arrays with structure for regression detection, golden test sets, and reproducible benchmarks. Use datasets to define what “good” looks like, then catch regressions automatically.
"prompts" objects, "inputs" objects — and normalize them automatically.
Your existing prompt files work as-is with evaly bench --dataset.
Dataset Format
A dataset is a JSON file with the following structure:
{
"name": "product-photos-golden",
"description": "Golden test set for product photo generation",
"pipeline": "text2img",
"items": [
{
"prompt": "A product photo of white sneakers on marble",
"metadata": { "category": "footwear", "style": "minimal" },
"expected": { "visual_quality": 4.5, "prompt_adherence": 4.0 }
},
{
"prompt": "A modern minimalist logo for 'ACME Corp'",
"metadata": { "category": "branding" },
"expected": { "visual_quality": 4.0, "text_rendering": 3.5 }
}
]
}
For img2img datasets, items include image_url and instruction instead of prompt:
{
"name": "bg-editing-golden",
"pipeline": "img2img",
"items": [
{
"image_url": "https://example.com/product.jpg",
"instruction": "Place the product on a kitchen counter",
"expected": { "input_fidelity": 4.5, "transformation_quality": 4.0 }
}
]
}
Subcommands
| Subcommand | Description |
|---|---|
create | Create a new empty dataset file. |
show | Show dataset contents as a table. |
add | Add an item to an existing dataset. |
from-bench | Create a dataset from a bench report with expected scores. |
validate | Validate a dataset file for errors and warnings. |
stats | Show dataset statistics and expected score distribution. |
evaly dataset create
Creates a new empty dataset file with metadata.
| Flag | Type | Default | Description |
|---|---|---|---|
| --name, -n | TEXT | — | Required. Dataset name. |
| --pipeline | CHOICE | text2img | Pipeline type: text2img or img2img. |
| --description, -d | TEXT | — | Dataset description. |
| --output, -o | TEXT | <name>.json | Output file path. |
# Create a text2img dataset
evaly dataset create -n "product-photos" -d "Product photography golden set"
# Create an img2img dataset with custom output path
evaly dataset create -n "bg-editing" --pipeline img2img -o datasets/bg-editing.json
evaly dataset show
Displays dataset contents as a Rich table with prompts, metadata, and expected scores.
evaly dataset show product-photos.json
evaly dataset add
Adds a single item to an existing dataset. Preserves the original file format (plain array, "prompts", "inputs", or "items").
| Flag | Type | Description |
|---|---|---|
| --prompt | TEXT | Prompt text (for text2img items). |
| --image | TEXT | Input image URL (for img2img items). |
| --instruction | TEXT | Edit instruction (for img2img items). |
| --metadata, -m | TEXT (multiple) | Metadata as key=value pairs. |
| --expected, -e | TEXT (multiple) | Expected scores as dim:value pairs. |
text2img item
evaly dataset add product-photos.json \
--prompt "A product photo of white sneakers on marble" \
-m category=footwear \
-e visual_quality:4.5 -e prompt_adherence:4.0
img2img item
evaly dataset add bg-editing.json \
--image "https://example.com/product.jpg" \
--instruction "Place the product on a kitchen counter" \
-m scene=kitchen \
-e input_fidelity:4.5 -e transformation_quality:4.0
evaly dataset from-bench
Creates a dataset from a bench report JSON. Scores from the report become expected values, turning a successful benchmark into a golden test set for regression detection.
| Flag | Type | Default | Description |
|---|---|---|---|
| --output, -o | TEXT | — | Required. Output file path. |
| --min-score | FLOAT | — | Only include items with overall score ≥ this value. |
| --model | TEXT | — | Use scores from a specific model (default: winner or first). |
| --name, -n | TEXT | from-<report> | Dataset name. |
Basic usage
# Turn a bench report into a golden dataset
evaly dataset from-bench report.json -o golden.json
Filter by minimum score
# Only include high-scoring items (4.0+)
evaly dataset from-bench report.json --min-score 4.0 -o golden.json
Use a specific model's scores
# Use flux-pro scores as expected values
evaly dataset from-bench report.json --model flux-pro -o golden.json
evaly dataset validate
Validates a dataset file and reports warnings. Checks for empty datasets, pipeline consistency
(e.g., img2img items missing image_url), and expected score ranges (0–5).
evaly dataset validate golden.json
# Output: Valid -- 12 items, no issues found.
evaly dataset stats
Shows dataset statistics including item count, expected score distribution (min/avg/max per dimension), and metadata key frequency.
evaly dataset stats golden.json
Typical Workflow
The dataset commands support a “Bench → Golden Set → Regression Detection” workflow:
- Benchmark: Run
evaly benchto generate and score images, saving a JSON report. - Golden set: Use
evaly dataset from-benchto extract high-scoring results as expected values. - Regression detection: Run
evaly bench --dataset golden.json --check-expectedto compare future results against your baseline.
Full Example
# 1. Create a dataset
evaly dataset create -n "sneakers" -d "Product photo test set"
# 2. Add test cases with expected scores
evaly dataset add sneakers.json \
--prompt "White sneakers on marble, studio lighting" \
-m category=footwear \
-e visual_quality:4.5 -e prompt_adherence:4.0
evaly dataset add sneakers.json \
--prompt "Running shoes on a track, action shot" \
-m category=footwear \
-e visual_quality:4.0 -e prompt_adherence:4.0
# 3. Run benchmark with dataset
evaly bench -m flux-schnell --dataset sneakers.json -o report.json -y
# 4. Check results against expected scores
evaly bench -m flux-schnell --dataset sneakers.json --check-expected -y
# Or: create a golden set from a successful report
evaly dataset from-bench report.json --min-score 4.0 -o golden.json