evaly agent

Evaluate tool-using agent runs.

evaly agent eval [OPTIONS]

The agent command evaluates a single tool-using agent run. It focuses on three metrics: tool_call_accuracy, goal_accuracy, and step_efficiency.

Basic Usage

evaly agent eval \
    --input "Find pricing and summarize it." \
    --final-output "The Pro plan costs $99 per month." \
    --tool-call web.search \
    --expected-tool web.search \
    --expected-max-steps 3

Options

Flag	Type	Description
--input	TEXT	Required. User task for the agent.
--final-output	TEXT	Required. Final agent output.
--expected-output	TEXT	Optional expected final output.
--tool-call	TEXT (multiple)	Observed tool call name. Repeat for multiple calls.
--expected-tool	TEXT (multiple)	Expected tool call name. Repeat for multiple calls.
--expected-max-steps	INT	Optional maximum number of tool steps.
--judge, -j	TEXT	Judge model for `goal_accuracy`.
--judges	TEXT	Comma-separated judges for consensus mode.
--judge-url	TEXT	Custom judge API base URL.
--output, -o	TEXT	Write report JSON to file.

Metrics

Metric	What it measures	Notes
`tool_call_accuracy`	How well executed tool names match expected tool names.	Deterministic. Falls back to a lightweight heuristic if no expected tools are provided.
`goal_accuracy`	Whether the final output achieved the user goal.	Uses embeddings when `--expected-output` is provided and embeddings are available; otherwise uses the judge.
`step_efficiency`	How efficiently the agent used its tool budget.	Deterministic. Uses `--expected-max-steps` when available.

Optional embeddings boost: If you install evalytic[embeddings] and pass --expected-output, goal_accuracy can use embedding similarity instead of only judge scoring.

Output and Gating

Agent reports use eval_type: "agent" with summary.metric_averages. Gate them with per-metric thresholds:

evaly agent eval ... -o agent.json
evaly gate --report agent.json \
    --metric-threshold goal_accuracy:0.75 \
    --metric-threshold tool_call_accuracy:0.9

evaly text evaly compare