evaly dataset

Manage evaluation datasets across visual, RAG, text, and agent workflows.

evaly dataset <SUBCOMMAND>

Datasets are reusable evaluation inputs with metadata and optional expected scores. They extend plain prompt arrays with structure for regression detection, golden test sets, and reproducible benchmarks across visual, RAG, text, and agent workflows.

Canonical type field: Datasets now support type: "text2img", "img2img", "rag", "text", and "agent". Legacy formats such as plain string arrays, "prompts", and "inputs" still normalize automatically.

Dataset Format

A dataset is a JSON file with a top-level type and an items array:

{
  "name": "product-photos-golden",
  "description": "Golden test set for product photo generation",
  "type": "text2img",
  "pipeline": "text2img",
  "items": [
    {
      "prompt": "A product photo of white sneakers on marble",
      "metadata": { "category": "footwear", "style": "minimal" },
      "expected": { "visual_quality": 4.5, "prompt_adherence": 4.0 }
    },
    {
      "prompt": "A modern minimalist logo for 'ACME Corp'",
      "metadata": { "category": "branding" },
      "expected": { "visual_quality": 4.0, "text_rendering": 3.5 }
    }
  ]
}

For img2img datasets, items include image_url and instruction instead of prompt:

{
  "name": "bg-editing-golden",
  "type": "img2img",
  "pipeline": "img2img",
  "items": [
    {
      "image_url": "https://example.com/product.jpg",
      "instruction": "Place the product on a kitchen counter",
      "expected": { "input_fidelity": 4.5, "transformation_quality": 4.0 }
    }
  ]
}

For rag, text, and agent datasets, use items objects with the same fields expected by the matching CLI commands. dataset show, dataset validate, and dataset stats understand all five canonical dataset types.

Subcommands

SubcommandDescription
createCreate a new empty dataset file.
showShow dataset contents as a table.
addAdd an item to an existing dataset.
from-benchCreate a dataset from a bench report with expected scores.
validateValidate a dataset file for errors and warnings.
statsShow dataset statistics and expected score distribution.

evaly dataset create

Creates a new empty dataset file with metadata.

FlagTypeDefaultDescription
--name, -nTEXTRequired. Dataset name.
--typeCHOICEtext2imgCanonical dataset type: text2img, img2img, rag, text, or agent.
--pipelineCHOICELegacy alias for visual dataset types only: text2img or img2img.
--description, -dTEXTDataset description.
--output, -oTEXT<name>.jsonOutput file path.
# Create a text2img dataset
evaly dataset create -n "product-photos" --type text2img -d "Product photography golden set"

# Create an img2img dataset with custom output path
evaly dataset create -n "bg-editing" --type img2img -o datasets/bg-editing.json

# Create a RAG dataset shell
evaly dataset create -n "rag-golden" --type rag
Authoring note: dataset create, show, validate, and stats work across all dataset types. dataset add and from-bench are still visual-first helpers.

evaly dataset show

Displays dataset contents as a Rich table with prompts, metadata, and expected scores.

evaly dataset show product-photos.json

evaly dataset add

Adds a single item to an existing dataset. Preserves the original file format (plain array, "prompts", "inputs", or "items").

FlagTypeDescription
--promptTEXTPrompt text (for text2img items).
--imageTEXTInput image URL (for img2img items).
--instructionTEXTEdit instruction (for img2img items).
--metadata, -mTEXT (multiple)Metadata as key=value pairs.
--expected, -eTEXT (multiple)Expected scores as dim:value pairs.

text2img item

evaly dataset add product-photos.json \
    --prompt "A product photo of white sneakers on marble" \
    -m category=footwear \
    -e visual_quality:4.5 -e prompt_adherence:4.0

img2img item

evaly dataset add bg-editing.json \
    --image "https://example.com/product.jpg" \
    --instruction "Place the product on a kitchen counter" \
    -m scene=kitchen \
    -e input_fidelity:4.5 -e transformation_quality:4.0

evaly dataset from-bench

Creates a dataset from a bench report JSON. Scores from the report become expected values, turning a successful benchmark into a golden test set for regression detection.

FlagTypeDefaultDescription
--output, -oTEXTRequired. Output file path.
--min-scoreFLOATOnly include items with overall score ≥ this value.
--modelTEXTUse scores from a specific model (default: winner or first).
--name, -nTEXTfrom-<report>Dataset name.

Basic usage

# Turn a bench report into a golden dataset
evaly dataset from-bench report.json -o golden.json

Filter by minimum score

# Only include high-scoring items (4.0+)
evaly dataset from-bench report.json --min-score 4.0 -o golden.json

Use a specific model's scores

# Use flux-pro scores as expected values
evaly dataset from-bench report.json --model flux-pro -o golden.json

evaly dataset validate

Validates a dataset file and reports warnings. Checks for empty datasets, pipeline consistency (e.g., img2img items missing image_url), and expected score ranges (0–5).

evaly dataset validate golden.json

# Output: Valid -- 12 items, no issues found.

evaly dataset stats

Shows dataset statistics including item count, expected score distribution (min/avg/max per dimension), and metadata key frequency.

evaly dataset stats golden.json

Typical Workflow

The dataset commands support a “Bench → Golden Set → Regression Detection” workflow:

  1. Benchmark: Run evaly bench to generate and score images, saving a JSON report.
  2. Golden set: Use evaly dataset from-bench to extract high-scoring results as expected values.
  3. Regression detection: Run evaly bench --dataset golden.json --check-expected to compare future results against your baseline.

Full Example

# 1. Create a dataset
evaly dataset create -n "sneakers" -d "Product photo test set"

# 2. Add test cases with expected scores
evaly dataset add sneakers.json \
    --prompt "White sneakers on marble, studio lighting" \
    -m category=footwear \
    -e visual_quality:4.5 -e prompt_adherence:4.0

evaly dataset add sneakers.json \
    --prompt "Running shoes on a track, action shot" \
    -m category=footwear \
    -e visual_quality:4.0 -e prompt_adherence:4.0

# 3. Run benchmark with dataset
evaly bench -m flux-schnell --dataset sneakers.json -o report.json -y

# 4. Check results against expected scores
evaly bench -m flux-schnell --dataset sneakers.json --check-expected -y

# Or: create a golden set from a successful report
evaly dataset from-bench report.json --min-score 4.0 -o golden.json