Agent evaluation is the highest-growth specialty in AI training in 2026. Few raters do it well because the rubric isn't obvious. Here's the practical framework for scoring agent runs accurately.
What an agent run looks like
You're given:
- The user's goal in plain English.
- The agent's full execution trace — every action, observation, tool call, and intermediate decision.
- The final outcome.
Your job: score the run on multiple dimensions. Most platforms use a 5-point scale per dimension; some use binary pass/fail.
The five scoring dimensions
1. Goal completion
Did the agent achieve what the user asked? Three sub-questions: was the goal correctly understood, was the final outcome the right thing, did partial completion get partial credit if applicable?
2. Planning quality
Did the agent decompose the task sensibly? Score the plan separately from execution. Sometimes agents have great plans they fail to execute; sometimes they execute weak plans well.
3. Tool selection and use
Did the agent choose appropriate tools? Did it use them correctly (parameters, sequencing)? Did it handle tool errors gracefully?
4. Recovery from failures
When something went wrong (broken page, tool error, unexpected response), did the agent recover or persist mindlessly?
5. Side effects and safety
Did the agent take any unexpected actions? Modify state it shouldn't have? Send unintended emails? Make irreversible changes? Side effects matter especially in production-deployed agents.
The common scoring mistakes
Outcome bias
The agent succeeded → therefore it did well. Wrong. An agent that succeeded by luck or brute force should score lower than an agent that failed but planned well. Decouple outcome from process.
Hindsight bias
"Obviously the agent should have done X." Maybe — but only if X was knowable from the agent's information state at that step. Score against what was actually visible to the agent, not against perfect hindsight.
Underweighting side effects
Agent achieves the goal but also sends 3 unintended emails. Most raters score the goal completion and miss the side effects. Senior raters always check for unexpected state changes.
Treating "many steps" as "good planning"
An agent that takes 30 actions to complete a 3-step task isn't planning better — it's planning worse. Step count isn't a proxy for thoroughness.
The mental model that scores well
For each agent run, ask:
- Was the plan sound at the outset? Score before seeing the outcome.
- Was each tool call appropriate at that step? Score independently of later steps.
- When something went wrong, what did the agent do? Recovery quality is a major dimension.
- What state did the agent change? List actual side effects.
- What's the comparison to the best possible run? Senior raters benchmark against ideal, not just acceptable.
Specific failure patterns to look for
- Repeated tool calls with same parameters. Agent stuck in a loop without realizing.
- Hallucinated tool outputs. Agent uses information that wasn't actually returned by the tool.
- Ignoring error responses. Tool returned an error; agent proceeds as if success.
- Goal drift. Original task forgotten; agent doing something tangentially related.
- Premature termination. Agent stops before goal is met but reports completion.
- Unsafe action. Agent makes irreversible state changes that user didn't authorize.
How to write strong agent eval justifications
- State the overall verdict (succeeded / failed / partial).
- Identify the specific decision points that mattered most.
- Describe what the agent did vs what an ideal agent would have done at those points.
- List any side effects you observed.
- Give concrete suggestions for what would have changed the outcome.
Aim for 200–350 words. Longer than coding eval justifications because there's more to evaluate.
Bottom line
Agent evaluation requires explicit decoupling of plan, execution, recovery, and outcome. Senior agent evaluators score each dimension independently and document side effects systematically. The skill is learnable but takes 4–6 weeks of focused practice. Pay reflects the specialty: $95–$130/hr at senior tier vs $75–$95 at senior generalist coding.