25 Jul 2025

Agent Evals Should Measure Behavior, Not Vibes

Agent evaluation gets weak when it only asks whether an answer sounds good.

A lot of AI evaluation starts with a reasonable instinct: look at the answer and decide whether it seems good.

That is not enough for agents.

Agentic systems do work across steps. They choose tools, pass arguments, inspect results, recover from errors, ask for approval, and decide when to stop. If the eval only looks at the final text, it misses most of the behavior that determines whether the system is reliable.

The evaluation target should be behavior, not vibes.

For an agent that calls tools, useful questions include:

Did it choose the right tool?
Did it call the tool with valid arguments?
Did it avoid destructive actions without approval?
Did it recover when the first attempt failed?
Did it expose enough context for a human to review?
Did it stop once the task was complete?
Did it preserve user intent?
Did it stay within cost, latency, and policy boundaries?

For an AI coach like Swoleby, the questions are different but the principle is the same. A response can sound supportive and still fail the product. It might be too long, too vague, too generic, too pushy, too repetitive, or disconnected from the user’s current state.

A better eval asks whether the response creates a useful next action. Is it short enough to read in an SMS thread? Does it acknowledge the user’s constraint? Does it avoid shame? Does it suggest something concrete? Does it remember the right amount? Does it respect opt-out and safety boundaries?

This is why evals need product judgment. The “right” behavior depends on the job the system is doing.

Generic answer-quality scoring can still help, but it should not be the whole system. Good evals combine deterministic checks, model-graded judgments, trace review, golden examples, regression tests, and human inspection. They make quality visible across the workflow, not only at the final message.

The practical goal is not to prove that an AI system is perfect. It is to catch regressions, make tradeoffs explicit, and help teams improve the behavior that matters.

If an eval cannot tell the difference between a polished answer and a useful action, it is measuring the wrong thing.

Related writing