evaly dataset
Manage evaluation datasets across visual, RAG, text, and agent workflows.
evaly dataset <SUBCOMMAND>
Datasets are reusable evaluation inputs with metadata and optional expected scores. They extend plain prompt arrays with structure for regression detection, golden test sets, and reproducible benchmarks across visual, RAG, text, and agent workflows.
type: "text2img", "img2img",
"rag", "text", and "agent". Legacy formats such as plain string arrays,
"prompts", and "inputs" still normalize automatically.
Dataset Format
A dataset is a JSON file with a top-level type and an items array:
{
"name": "product-photos-golden",
"description": "Golden test set for product photo generation",
"type": "text2img",
"pipeline": "text2img",
"items": [
{
"prompt": "A product photo of white sneakers on marble",
"metadata": { "category": "footwear", "style": "minimal" },
"expected": { "visual_quality": 4.5, "prompt_adherence": 4.0 }
},
{
"prompt": "A modern minimalist logo for 'ACME Corp'",
"metadata": { "category": "branding" },
"expected": { "visual_quality": 4.0, "text_rendering": 3.5 }
}
]
}
For img2img datasets, items include image_url and instruction instead of prompt:
{
"name": "bg-editing-golden",
"type": "img2img",
"pipeline": "img2img",
"items": [
{
"image_url": "https://example.com/product.jpg",
"instruction": "Place the product on a kitchen counter",
"expected": { "input_fidelity": 4.5, "transformation_quality": 4.0 }
}
]
}
For rag, text, and agent datasets, use items objects with the same
fields expected by the matching CLI commands. dataset show, dataset validate, and
dataset stats understand all five canonical dataset types.
Subcommands
| Subcommand | Description |
|---|---|
create | Create a new empty dataset file. |
show | Show dataset contents as a table. |
add | Add an item to an existing dataset. |
from-bench | Create a dataset from a bench report with expected scores. |
validate | Validate a dataset file for errors and warnings. |
stats | Show dataset statistics and expected score distribution. |
evaly dataset create
Creates a new empty dataset file with metadata.
| Flag | Type | Default | Description |
|---|---|---|---|
| --name, -n | TEXT | — | Required. Dataset name. |
| --type | CHOICE | text2img | Canonical dataset type: text2img, img2img, rag, text, or agent. |
| --pipeline | CHOICE | — | Legacy alias for visual dataset types only: text2img or img2img. |
| --description, -d | TEXT | — | Dataset description. |
| --output, -o | TEXT | <name>.json | Output file path. |
# Create a text2img dataset
evaly dataset create -n "product-photos" --type text2img -d "Product photography golden set"
# Create an img2img dataset with custom output path
evaly dataset create -n "bg-editing" --type img2img -o datasets/bg-editing.json
# Create a RAG dataset shell
evaly dataset create -n "rag-golden" --type rag
dataset create, show, validate, and stats
work across all dataset types. dataset add and from-bench are still visual-first helpers.
evaly dataset show
Displays dataset contents as a Rich table with prompts, metadata, and expected scores.
evaly dataset show product-photos.json
evaly dataset add
Adds a single item to an existing dataset. Preserves the original file format (plain array, "prompts", "inputs", or "items").
| Flag | Type | Description |
|---|---|---|
| --prompt | TEXT | Prompt text (for text2img items). |
| --image | TEXT | Input image URL (for img2img items). |
| --instruction | TEXT | Edit instruction (for img2img items). |
| --metadata, -m | TEXT (multiple) | Metadata as key=value pairs. |
| --expected, -e | TEXT (multiple) | Expected scores as dim:value pairs. |
text2img item
evaly dataset add product-photos.json \
--prompt "A product photo of white sneakers on marble" \
-m category=footwear \
-e visual_quality:4.5 -e prompt_adherence:4.0
img2img item
evaly dataset add bg-editing.json \
--image "https://example.com/product.jpg" \
--instruction "Place the product on a kitchen counter" \
-m scene=kitchen \
-e input_fidelity:4.5 -e transformation_quality:4.0
evaly dataset from-bench
Creates a dataset from a bench report JSON. Scores from the report become expected values, turning a successful benchmark into a golden test set for regression detection.
| Flag | Type | Default | Description |
|---|---|---|---|
| --output, -o | TEXT | — | Required. Output file path. |
| --min-score | FLOAT | — | Only include items with overall score ≥ this value. |
| --model | TEXT | — | Use scores from a specific model (default: winner or first). |
| --name, -n | TEXT | from-<report> | Dataset name. |
Basic usage
# Turn a bench report into a golden dataset
evaly dataset from-bench report.json -o golden.json
Filter by minimum score
# Only include high-scoring items (4.0+)
evaly dataset from-bench report.json --min-score 4.0 -o golden.json
Use a specific model's scores
# Use flux-pro scores as expected values
evaly dataset from-bench report.json --model flux-pro -o golden.json
evaly dataset validate
Validates a dataset file and reports warnings. Checks for empty datasets, pipeline consistency
(e.g., img2img items missing image_url), and expected score ranges (0–5).
evaly dataset validate golden.json
# Output: Valid -- 12 items, no issues found.
evaly dataset stats
Shows dataset statistics including item count, expected score distribution (min/avg/max per dimension), and metadata key frequency.
evaly dataset stats golden.json
Typical Workflow
The dataset commands support a “Bench → Golden Set → Regression Detection” workflow:
- Benchmark: Run
evaly benchto generate and score images, saving a JSON report. - Golden set: Use
evaly dataset from-benchto extract high-scoring results as expected values. - Regression detection: Run
evaly bench --dataset golden.json --check-expectedto compare future results against your baseline.
Full Example
# 1. Create a dataset
evaly dataset create -n "sneakers" -d "Product photo test set"
# 2. Add test cases with expected scores
evaly dataset add sneakers.json \
--prompt "White sneakers on marble, studio lighting" \
-m category=footwear \
-e visual_quality:4.5 -e prompt_adherence:4.0
evaly dataset add sneakers.json \
--prompt "Running shoes on a track, action shot" \
-m category=footwear \
-e visual_quality:4.0 -e prompt_adherence:4.0
# 3. Run benchmark with dataset
evaly bench -m flux-schnell --dataset sneakers.json -o report.json -y
# 4. Check results against expected scores
evaly bench -m flux-schnell --dataset sneakers.json --check-expected -y
# Or: create a golden set from a successful report
evaly dataset from-bench report.json --min-score 4.0 -o golden.json