The single most important skill for senior AI coding evaluators is reading unfamiliar code quickly and accurately. Here's how the contractors who hit 0.92+ scores actually do it.
The four-pass approach
Don't read code line-by-line. Senior evaluators do four passes, each looking for different things:
Pass 1: Shape (15 seconds)
Without reading any line carefully — what's the structure? Functions, classes, file count. Where are the entry points? What's the obvious data flow?
This pass tells you what kind of bug to look for next. Algorithm code has different bug shapes than IO code than concurrent code.
Pass 2: Names and types (30 seconds)
Read function signatures and variable names. What is each function supposed to do based on its name? Do the names match the parameters?
Half of "model produced wrong code" cases involve mismatch between what a function is named and what it does. Catching these takes 30 seconds and is high-leverage.
Pass 3: Control flow (60–120 seconds)
Walk the happy-path execution mentally. What does the function do when called with normal inputs?
This pass catches: missing branches, dead code, infinite loops, missed exits.
Pass 4: Edge cases (120–240 seconds)
Now hunt specifically for failure modes. Empty inputs. Null/undefined. Off-by-one. Resource leaks. Type confusion. Concurrency races.
This is where most points are won or lost.
Common bug shapes by language
Python:
- Mutable default arguments (
def f(x, y=[])). - Late binding in closures.
- Integer division vs float division mismatches.
- Generators consumed twice.
JavaScript / TypeScript:
thisrebinding in callbacks.- Async without await.
- Type narrowing failures.
- Coercion-based comparisons (
==vs===).
Rust:
- Unbounded recursion in async.
unwrap()on potentially-None values.- Lifetime annotations that are technically correct but unsound at scale.
Go:
- Goroutine leaks.
- Loop variable capture.
- Returned slices sharing memory.
Mental hygiene during code review
- Don't accept code "looks right" without verifying. Models produce plausible-looking wrong code constantly.
- Distrust comments. Comments and code drift; trust the code, verify the comment.
- Treat unfamiliar libraries like black boxes. Don't assume their behavior; read the docs or flag the assumption.
- Walk through hidden tests mentally. The grader runs more tests than the visible ones.
What separates fast readers from slow
The fastest evaluators we've observed:
- Don't move mouse cursor unnecessarily. Eyes track faster than mouse navigation.
- Use keyboard shortcuts. Ctrl+F to search for variable usage. Ctrl+G for line jump.
- Take handwritten notes. Constraints, edge cases to verify, suspicious lines.
- Read aloud quietly. Verbalizing code catches issues silent reading misses.
Bottom line
Reading code under time pressure is a learnable skill. The four-pass approach (shape → names → control flow → edge cases) is more accurate and faster than line-by-line reading. Combined with knowing common bug shapes by language, senior evaluators read 80–120 lines in 5 minutes and consistently find the bugs the model intended you to find.