Find your job

AI agent evaluation deep dive.

How to score agent runs across goal completion, tool use, planning, recovery, and side effects. Practical framework for the fastest-growing AI eval category.

Agent evaluation is the highest-growth specialty in AI training in 2026. Few raters do it well because the rubric isn't obvious. Here's the practical framework for scoring agent runs accurately.

What an agent run looks like

You're given:

  • The user's goal in plain English.
  • The agent's full execution trace — every action, observation, tool call, and intermediate decision.
  • The final outcome.

Your job: score the run on multiple dimensions. Most platforms use a 5-point scale per dimension; some use binary pass/fail.

The five scoring dimensions

1. Goal completion

Did the agent achieve what the user asked? Three sub-questions: was the goal correctly understood, was the final outcome the right thing, did partial completion get partial credit if applicable?

2. Planning quality

Did the agent decompose the task sensibly? Score the plan separately from execution. Sometimes agents have great plans they fail to execute; sometimes they execute weak plans well.

3. Tool selection and use

Did the agent choose appropriate tools? Did it use them correctly (parameters, sequencing)? Did it handle tool errors gracefully?

4. Recovery from failures

When something went wrong (broken page, tool error, unexpected response), did the agent recover or persist mindlessly?

5. Side effects and safety

Did the agent take any unexpected actions? Modify state it shouldn't have? Send unintended emails? Make irreversible changes? Side effects matter especially in production-deployed agents.

Senior agent eval income$95/hr × 14 hrs/wk = $5,300/month at senior agent eval tier
Open calculator →

The common scoring mistakes

Outcome bias

The agent succeeded → therefore it did well. Wrong. An agent that succeeded by luck or brute force should score lower than an agent that failed but planned well. Decouple outcome from process.

Hindsight bias

"Obviously the agent should have done X." Maybe — but only if X was knowable from the agent's information state at that step. Score against what was actually visible to the agent, not against perfect hindsight.

Underweighting side effects

Agent achieves the goal but also sends 3 unintended emails. Most raters score the goal completion and miss the side effects. Senior raters always check for unexpected state changes.

Treating "many steps" as "good planning"

An agent that takes 30 actions to complete a 3-step task isn't planning better — it's planning worse. Step count isn't a proxy for thoroughness.

The mental model that scores well

For each agent run, ask:

  • Was the plan sound at the outset? Score before seeing the outcome.
  • Was each tool call appropriate at that step? Score independently of later steps.
  • When something went wrong, what did the agent do? Recovery quality is a major dimension.
  • What state did the agent change? List actual side effects.
  • What's the comparison to the best possible run? Senior raters benchmark against ideal, not just acceptable.

Specific failure patterns to look for

  • Repeated tool calls with same parameters. Agent stuck in a loop without realizing.
  • Hallucinated tool outputs. Agent uses information that wasn't actually returned by the tool.
  • Ignoring error responses. Tool returned an error; agent proceeds as if success.
  • Goal drift. Original task forgotten; agent doing something tangentially related.
  • Premature termination. Agent stops before goal is met but reports completion.
  • Unsafe action. Agent makes irreversible state changes that user didn't authorize.

How to write strong agent eval justifications

  1. State the overall verdict (succeeded / failed / partial).
  2. Identify the specific decision points that mattered most.
  3. Describe what the agent did vs what an ideal agent would have done at those points.
  4. List any side effects you observed.
  5. Give concrete suggestions for what would have changed the outcome.

Aim for 200–350 words. Longer than coding eval justifications because there's more to evaluate.

Bottom line

Agent evaluation requires explicit decoupling of plan, execution, recovery, and outcome. Senior agent evaluators score each dimension independently and document side effects systematically. The skill is learnable but takes 4–6 weeks of focused practice. Pay reflects the specialty: $95–$130/hr at senior tier vs $75–$95 at senior generalist coding.

Find AI training contractsAll open roles · 9 platforms · filter by rate and hours.
Find your job

Frequently asked questions

What is AI agent evaluation?
Scoring AI agents (models that take multi-step actions like browsing, coding, calling APIs) on dimensions including goal completion, planning quality, tool use, error recovery, and side effects. Tasks are typically 30–90 minutes each.
How much does AI agent evaluation pay?
Entry tier: $55–$70/hr. Mid: $75–$95/hr. Senior: $100–$130/hr. Specialty (long-horizon, complex tool-use): $120–$170/hr. Pays roughly 25% more than generalist coding eval at equivalent tiers.
What's the most common mistake in agent evaluation?
Outcome bias — scoring the agent based on whether it succeeded rather than whether it acted correctly. An agent that succeeded by luck should score lower than an agent that failed but planned well.
How long are agent evaluation tasks?
Typically 30–90 minutes per task. Longer than coding eval because you need to read the agent's full execution trace (every action, tool call, observation) and score multiple dimensions independently.