Multi-step reasoning evaluation is the work of judging whether an AI's chain of thought through a complex problem is correct, or whether it gets the right answer through bad reasoning. The role pays well because it's mentally exhausting and requires careful logical attention. Here's how it works in 2026.
The problem these tasks solve
Modern language models can produce correct final answers through wrong reasoning — they reach the right number but their step-by-step is incoherent. For complex applications (math, science, complex business reasoning), a model that's "right by coincidence" is dangerous. Frontier labs need contractors who can read each step and judge each one independently of the final answer.
What you actually do
You're shown:
- A problem (math, science, logic, complex reasoning).
- The model's chain-of-thought, step by step.
- The model's final answer.
Your job: score each step independently. Was step 1 correct? Step 2? Step 3? Did any step contain a logical leap that the model got away with because the final answer happened to be right? Tasks take 15–45 minutes depending on chain length.
What scores high quality
- Catching subtle errors that produce correct outputs. A division by zero handled silently, an off-by-one in step 4 that cancels with another error in step 7.
- Distinguishing valid alternative reasoning from incorrect reasoning. Sometimes the model takes a different path than the rubric expects but arrives validly. Recognizing this without misclassifying is hard.
- Documenting why each step is correct or incorrect. Vague flags ("step 3 looks wrong") score lower than specific flags ("step 3 assumes commutativity but the operation is non-commutative").
Pay ranges
- Entry tier: $60–$80/hr.
- Mid tier: $80–$110/hr.
- Senior tier: $110–$140/hr.
- Specialty (math reasoning, complex science, multi-domain): $130–$180/hr.
Hours are lower than generalist coding (10–18/week typical) because the tasks are slower and more cognitively expensive.
Who qualifies
The role requires structured logical attention. Backgrounds that translate well:
- Math, physics, theoretical CS PhDs or strong undergrads from competitive programs.
- Senior software engineers with experience in correctness-critical systems (compilers, databases, formal verification).
- Researchers or analysts trained to read other people's work critically.
- Olympiad-grade problem solvers (math, programming, philosophy).
The role does not require an advanced credential by default — the screen is task-based, not paperwork-based. Many strong undergrads from competitive programs qualify.
How to apply
- Mercor: Has a "reasoning specialist" intake. Lead application with relevant background (math, CS, research).
- Outlier specialty pools: Accessed after senior tier in standard tracks.
- Direct lab engagements: Anthropic, OpenAI, and DeepMind run reasoning evaluation programs through contractor agencies.
Bottom line
Multi-step reasoning evaluation is among the most demanding and best-compensated specialty roles in 2026. If you have a strong logical-reasoning background and can sustain careful attention for 30–60 minute task chunks, it's one of the fastest paths to $100/hr+ AI training income.