evaly agent

Evaluate tool-using agent runs.

evaly agent eval [OPTIONS]

The agent command evaluates a single tool-using agent run. It focuses on three metrics: tool_call_accuracy, goal_accuracy, and step_efficiency.

Basic Usage

evaly agent eval \
    --input "Find pricing and summarize it." \
    --final-output "The Pro plan costs $99 per month." \
    --tool-call web.search \
    --expected-tool web.search \
    --expected-max-steps 3

Options

FlagTypeDescription
--inputTEXTRequired. User task for the agent.
--final-outputTEXTRequired. Final agent output.
--expected-outputTEXTOptional expected final output.
--tool-callTEXT (multiple)Observed tool call name. Repeat for multiple calls.
--expected-toolTEXT (multiple)Expected tool call name. Repeat for multiple calls.
--expected-max-stepsINTOptional maximum number of tool steps.
--judge, -jTEXTJudge model for goal_accuracy.
--judgesTEXTComma-separated judges for consensus mode.
--judge-urlTEXTCustom judge API base URL.
--output, -oTEXTWrite report JSON to file.

Metrics

MetricWhat it measuresNotes
tool_call_accuracyHow well executed tool names match expected tool names.Deterministic. Falls back to a lightweight heuristic if no expected tools are provided.
goal_accuracyWhether the final output achieved the user goal.Uses embeddings when --expected-output is provided and embeddings are available; otherwise uses the judge.
step_efficiencyHow efficiently the agent used its tool budget.Deterministic. Uses --expected-max-steps when available.
Optional embeddings boost: If you install evalytic[embeddings] and pass --expected-output, goal_accuracy can use embedding similarity instead of only judge scoring.

Output and Gating

Agent reports use eval_type: "agent" with summary.metric_averages. Gate them with per-metric thresholds:

evaly agent eval ... -o agent.json
evaly gate --report agent.json \
    --metric-threshold goal_accuracy:0.75 \
    --metric-threshold tool_call_accuracy:0.9