Configuration
Configure Evalytic with evalytic.toml, environment variables, and CLI flags.
Precedence Order
Configuration is resolved in this order (highest priority first):
- CLI flags —
--judge openai/gpt-5.2 - Environment variables —
GEMINI_API_KEY=... .envfile — auto-loaded from current directoryevalytic.toml— project config file- Defaults — built-in defaults
evalytic.toml
Create an evalytic.toml in your project root. The easiest way is the interactive wizard:
evaly init
Evalytic searches for config files in:
./evalytic.toml(current directory)~/.evalytic/config.toml(user home)
Full example
# evalytic.toml
[keys]
fal = "fal_key_xxx"
gemini = "gemini_key_xxx"
openai = "sk-xxx"
anthropic = "sk-ant-xxx"
[bench]
judge = "gemini-2.5-flash"
concurrency = 4
dimensions = ["visual_quality", "prompt_adherence"]
image_size = "landscape_16_9"
seed = 42
output_dir = "./reports"
# Weight VLM dimensions (default: equal)
[bench.dimension_weights]
input_fidelity = 0.5
visual_quality = 0.1
[bench.metrics]
clip_threshold = 0.18
clip_weight = 0.20
clip_range = [0.20, 0.40]
lpips_threshold = 0.40
lpips_weight = 0.20
lpips_range = [0.40, 0.95]
face_range = [0.60, 0.95]
# Override model cost or settings
[bench.model_overrides.flux-kontext]
cost = 0.06
[bench.model_overrides.my-custom-model]
endpoint = "fal-ai/my-custom/v1"
pipeline = "img2img"
cost = 0.04
image_field = "image_urls"
[keys] Section
API keys defined here are set as environment variables when Evalytic loads:
| Config Key | Environment Variable | Used By |
|---|---|---|
fal | FAL_KEY | fal.ai image generation + fal/* judges |
gemini | GEMINI_API_KEY | Gemini judge |
openai | OPENAI_API_KEY | OpenAI judge |
anthropic | ANTHROPIC_API_KEY | Anthropic judge |
evalytic.toml with API keys to version control. Add it to .gitignore, or use environment variables / .env instead.
[bench] Section
Default settings for the evaly bench command:
| Key | Type | Default | Description |
|---|---|---|---|
judge | string | "gemini-2.5-flash" | Default VLM judge (single mode) |
judges | string[] | — | Multi-judge consensus mode (2-3 judges). Overrides judge when set. |
models | string[] | — | Default models for evaly bench (avoids -m flag) |
prompts | string | — | Default prompts file path or inline prompt |
concurrency | int | 4 | Max parallel generation requests |
dimensions | string[] | auto | Default dimensions to score |
image_size | string | — | Default image size |
seed | int | — | Fixed seed for reproducibility |
output_dir | string | — | Default output directory. Each run creates a timestamped subfolder with reports and error log. |
[bench.dimension_weights] Section
Customize how VLM dimensions contribute to the overall score. By default all dimensions are weighted equally (1/n). When you specify weights, unspecified dimensions share the remaining weight equally. Weights are normalized to sum to 1.0.
# E-commerce: product shape matters most
[bench.dimension_weights]
input_fidelity = 0.5
visual_quality = 0.1
# Remaining 0.4 split equally among other active dimensions
--dim-weights '{"input_fidelity": 0.5}'. CLI flags override toml values.
[bench.metrics] Section
Thresholds, weights, and normalize ranges for local metrics. Sharpness is always available (no torch required); CLIP/LPIPS/face require evalytic[metrics].
| Key | Type | Default | Description |
|---|---|---|---|
clip_threshold | float | 0.18 | CLIP score flag threshold |
clip_weight | float | 0.20 | CLIP weight in overall score |
clip_range | float[2] | [0.18, 0.35] | CLIP normalize range [min, max] for mapping to 0–5 |
lpips_threshold | float | 0.40 | LPIPS flag threshold |
lpips_weight | float | 0.20 | LPIPS weight in overall score |
lpips_range | float[2] | [0.40, 0.95] | LPIPS normalize range [min, max] |
face_range | float[2] | [0.60, 0.95] | Face similarity normalize range [min, max] |
[bench.model_overrides] Section
Override cost or settings for any model. Useful when fal.ai prices change or you're using a custom endpoint. Overrides take priority over both the built-in registry and auto-detected pricing.
# Override cost for an existing model
[bench.model_overrides.flux-kontext]
cost = 0.06
# Register a custom model
[bench.model_overrides.my-custom-model]
endpoint = "fal-ai/my-custom/v1"
pipeline = "img2img"
cost = 0.04
image_field = "image_urls"
| Key | Type | Description |
|---|---|---|
endpoint | string | fal.ai endpoint path |
pipeline | string | "text2img" or "img2img" |
cost | float | USD per image (overrides auto-detect) |
image_field | string | "image_url" or "image_urls" |
model_overrides > fal.ai live pricing (auto-detected) > built-in registry defaults. Run evaly bench --list-models to see current prices.
.env File
Evalytic auto-loads .env from the current directory using python-dotenv:
# .env
FAL_KEY=fal_key_xxx
GEMINI_API_KEY=gemini_key_xxx
OPENAI_API_KEY=sk-xxx
ANTHROPIC_API_KEY=sk-ant-xxx
Environment Variables
| Variable | Description |
|---|---|
FAL_KEY | fal.ai API key for image generation + fal/* judges |
GEMINI_API_KEY | Google Gemini API key for judge |
OPENAI_API_KEY | OpenAI API key for judge |
ANTHROPIC_API_KEY | Anthropic API key for judge |
Example Configurations
Single key (fal.ai only)
# One key for both generation and judging
[keys]
fal = "fal_key_xxx"
[bench]
judge = "fal/gemini-2.5-flash"
Two keys (fal.ai + Gemini)
[keys]
fal = "fal_key_xxx"
gemini = "gemini_key_xxx"
CI/CD with GPT-5.2 judge
[bench]
judge = "openai/gpt-5.2"
concurrency = 2
dimensions = ["visual_quality", "prompt_adherence", "text_rendering"]
Local development with Ollama
[keys]
fal = "fal_key_xxx"
[bench]
judge = "ollama/qwen2.5-vl:7b"
seed = 42
Consensus mode (multi-judge)
# Consensus via fal.ai — single key, multiple judges
[keys]
fal = "fal_key_xxx"
[bench]
judges = ["fal/gemini-2.5-flash", "fal/gpt-5.2"]
# Or with separate API keys per provider
# [keys]
# gemini = "gemini_key_xxx"
# openai = "sk-xxx"
# [bench]
# judges = ["gemini-2.5-flash", "gpt-5.2"]
Default models and prompts
# Saves you from typing -m and -p every time
[keys]
fal = "fal_key_xxx"
gemini = "gemini_key_xxx"
[bench]
models = ["flux-schnell", "flux-dev"]
prompts = "prompts.json"
With this config, evaly bench -y is all you need — models and prompts are loaded from the config file.
Inspect Configuration
Use evaly config show to see the active configuration, which keys are loaded, and where they came from:
evaly config show