evaly agent
Evaluate tool-using agent runs.
evaly agent eval [OPTIONS]
The agent command evaluates a single tool-using agent run. It focuses on three metrics:
tool_call_accuracy, goal_accuracy, and step_efficiency.
Basic Usage
evaly agent eval \
--input "Find pricing and summarize it." \
--final-output "The Pro plan costs $99 per month." \
--tool-call web.search \
--expected-tool web.search \
--expected-max-steps 3
Options
| Flag | Type | Description |
|---|---|---|
| --input | TEXT | Required. User task for the agent. |
| --final-output | TEXT | Required. Final agent output. |
| --expected-output | TEXT | Optional expected final output. |
| --tool-call | TEXT (multiple) | Observed tool call name. Repeat for multiple calls. |
| --expected-tool | TEXT (multiple) | Expected tool call name. Repeat for multiple calls. |
| --expected-max-steps | INT | Optional maximum number of tool steps. |
| --judge, -j | TEXT | Judge model for goal_accuracy. |
| --judges | TEXT | Comma-separated judges for consensus mode. |
| --judge-url | TEXT | Custom judge API base URL. |
| --output, -o | TEXT | Write report JSON to file. |
Metrics
| Metric | What it measures | Notes |
|---|---|---|
tool_call_accuracy | How well executed tool names match expected tool names. | Deterministic. Falls back to a lightweight heuristic if no expected tools are provided. |
goal_accuracy | Whether the final output achieved the user goal. | Uses embeddings when --expected-output is provided and embeddings are available; otherwise uses the judge. |
step_efficiency | How efficiently the agent used its tool budget. | Deterministic. Uses --expected-max-steps when available. |
Optional embeddings boost: If you install
evalytic[embeddings] and pass
--expected-output, goal_accuracy can use embedding similarity instead of only judge scoring.
Output and Gating
Agent reports use eval_type: "agent" with summary.metric_averages.
Gate them with per-metric thresholds:
evaly agent eval ... -o agent.json
evaly gate --report agent.json \
--metric-threshold goal_accuracy:0.75 \
--metric-threshold tool_call_accuracy:0.9