Find your job

Reading code under time pressure skills for AI evaluators.

How senior AI training evaluators read 100+ lines of unfamiliar code in 5 minutes and find bugs accurately. The mental model and the specific patterns to look for.

The single most important skill for senior AI coding evaluators is reading unfamiliar code quickly and accurately. Here's how the contractors who hit 0.92+ scores actually do it.

The four-pass approach

Don't read code line-by-line. Senior evaluators do four passes, each looking for different things:

Pass 1: Shape (15 seconds)

Without reading any line carefully — what's the structure? Functions, classes, file count. Where are the entry points? What's the obvious data flow?

This pass tells you what kind of bug to look for next. Algorithm code has different bug shapes than IO code than concurrent code.

Pass 2: Names and types (30 seconds)

Read function signatures and variable names. What is each function supposed to do based on its name? Do the names match the parameters?

Half of "model produced wrong code" cases involve mismatch between what a function is named and what it does. Catching these takes 30 seconds and is high-leverage.

Pass 3: Control flow (60–120 seconds)

Walk the happy-path execution mentally. What does the function do when called with normal inputs?

This pass catches: missing branches, dead code, infinite loops, missed exits.

Pass 4: Edge cases (120–240 seconds)

Now hunt specifically for failure modes. Empty inputs. Null/undefined. Off-by-one. Resource leaks. Type confusion. Concurrency races.

This is where most points are won or lost.

Speed × accuracySenior tier evaluators read 80–120 lines in 5 minutes. Practice the 4-pass.
Open calculator →

Common bug shapes by language

Python:

  • Mutable default arguments (def f(x, y=[])).
  • Late binding in closures.
  • Integer division vs float division mismatches.
  • Generators consumed twice.

JavaScript / TypeScript:

  • this rebinding in callbacks.
  • Async without await.
  • Type narrowing failures.
  • Coercion-based comparisons (== vs ===).

Rust:

  • Unbounded recursion in async.
  • unwrap() on potentially-None values.
  • Lifetime annotations that are technically correct but unsound at scale.

Go:

  • Goroutine leaks.
  • Loop variable capture.
  • Returned slices sharing memory.

Mental hygiene during code review

  • Don't accept code "looks right" without verifying. Models produce plausible-looking wrong code constantly.
  • Distrust comments. Comments and code drift; trust the code, verify the comment.
  • Treat unfamiliar libraries like black boxes. Don't assume their behavior; read the docs or flag the assumption.
  • Walk through hidden tests mentally. The grader runs more tests than the visible ones.

What separates fast readers from slow

The fastest evaluators we've observed:

  • Don't move mouse cursor unnecessarily. Eyes track faster than mouse navigation.
  • Use keyboard shortcuts. Ctrl+F to search for variable usage. Ctrl+G for line jump.
  • Take handwritten notes. Constraints, edge cases to verify, suspicious lines.
  • Read aloud quietly. Verbalizing code catches issues silent reading misses.

Bottom line

Reading code under time pressure is a learnable skill. The four-pass approach (shape → names → control flow → edge cases) is more accurate and faster than line-by-line reading. Combined with knowing common bug shapes by language, senior evaluators read 80–120 lines in 5 minutes and consistently find the bugs the model intended you to find.

Find AI training contractsAll open roles · 9 platforms · filter by rate and hours.
Find your job

Frequently asked questions

How do AI evaluators read code quickly?
Four passes: shape (15s), names and types (30s), control flow (60–120s), edge cases (120–240s). Don't read line-by-line — read in passes targeting different bug categories.
How long should code review take?
Senior evaluators read 80–120 lines in about 5 minutes for routine bug-hunt tasks. Long-form tasks (1000+ lines, multi-file) take 30–60 minutes including detailed verification.
What are the most common bug patterns by language?
Python: mutable default arguments, late binding closures. JavaScript: this rebinding, async without await. Rust: unwrap on None, unbounded async recursion. Go: goroutine leaks, loop variable capture.
Should AI evaluators trust code comments?
No. Comments and code drift. Verify comments against the actual code behavior, especially in evaluation contexts where the model may have produced misleading comments.