Find your job

How AI training quality scores actually work.

What goes into the rolling weighted average, how disagreements with consensus affect your score, and the leverage points that move scores fastest.

Your quality score is the single biggest determinant of your tier, your hour allocation, and your specialty access on AI training platforms. Most contractors don't understand exactly how it's computed. Here's the actual mechanics.

The rolling weighted average

Every major platform uses a rolling weighted average over your last N tasks, where N is typically 80–120 depending on platform:

  • Outlier: Last 80 tasks, weighted by task complexity
  • Mercor: Last 100 tasks, weighted by task type and length
  • Surge AI: Last 120 tasks, weighted by program tier

This means a single bad week doesn't tank your career — bad scores age out of the window. But sustained low scores compound, and recovery takes time proportional to the window size.

What goes into a single task score

Each task generates a score between 0 and 1. The components:

  • Rubric agreement (60–70% of weight): How closely your evaluation matches the platform's reference answer or consensus.
  • Justification quality (15–25% of weight): How well-reasoned your written explanation is.
  • Edge-case identification (10–15% of weight): Did you flag the cases the model commonly fails on.
  • Completion (5% of weight): Did you address every sub-question.

Most contractors over-focus on rubric agreement and underestimate justification quality. The fastest path to a 0.92+ score isn't more rubric-correct answers — it's better justifications on the answers you already give.

Why your score might be dropping

Common causes, in order of frequency:

  1. Skipping the justification field. A 5-word justification scores 0.6; a 100-word reasoned justification scores 0.9.
  2. Going too fast. Tasks under your platform's recommended completion time score lower; the system penalizes apparent skimming.
  3. Disagreement without explanation. Flagging "I disagree" without explaining why on a task where you disagree with consensus drops the score significantly.
  4. Cross-language drift. Multi-tasking languages dilutes your score in each.
  5. Calibration drift. Your judgment slowly shifts away from the platform consensus over time. Periodic re-calibration corrects this.
Score impact on monthly incomeMid-tier (0.85) → Senior (0.91) is roughly +60% monthly income.
Open calculator →

The disagreement paradox

Counterintuitively, disagreeing with consensus can raise your score if you do it well. The system tracks two metrics:

  • Agreement rate: How often you match consensus.
  • Disagreement quality: When you disagree, how well-reasoned is your case.

A rater who agrees with consensus 90% of the time and provides excellent justifications on the 10% disagreements scores higher than a rater who agrees 95% of the time but provides poor justifications on disagreements.

The lesson: don't avoid disagreement to game your agreement rate. Express disagreement clearly when you have it, and back it up.

How to recover from a low score

If you've drifted below your previous tier:

  1. Pause and diagnose. Don't keep grinding tasks at low quality — you're embedding the bad pattern in your rolling window.
  2. Take 5 calibration tasks. Most platforms let you opt into calibration tasks. Score these carefully.
  3. Restart with longer tasks. Long-form tasks weight more in the average. A few high-quality long-form tasks pull your score up faster than many quick ones.
  4. Use the justification field consistently. Aim for 80–120 words on every disagreement.

Realistic recovery time from 0.78 → 0.88: about 4 weeks of focused work.

The senior-tier mindset

Contractors at the top of the score distribution share two habits:

  • They write to the grader. Justifications are written as if explaining to a senior reviewer who'll second-guess them. This naturally produces 80–150 word justifications with concrete reasoning.
  • They re-read tasks before submitting. 30 seconds of re-reading before submit catches half the careless mistakes that drop scores.

Bottom line

Quality scores are mostly justification-driven, not rubric-agreement-driven. The contractors who tier up fastest don't have better judgment than mid-tier contractors — they explain their judgment better. Spend the extra two minutes on the justification field; the rolling-average math rewards it heavily.

Find AI training contractsAll open roles · 9 platforms · filter by rate and hours.
Find your job

Frequently asked questions

How are AI training quality scores calculated?
Each task generates a 0–1 score based on rubric agreement (60–70%), justification quality (15–25%), edge-case identification (10–15%), and completion (5%). Your overall score is a rolling weighted average of your last 80–120 tasks.
What's a good AI training quality score?
Above 0.85 is mid-tier on most platforms. Above 0.91 qualifies for senior tier. Above 0.94 unlocks specialty pools and priority routing.
How long does it take to recover from a low quality score?
About 4 weeks of focused work to recover from 0.78 to 0.88. Bad scores age out of your rolling window over the same period the window covers (80–120 tasks).
Should I avoid disagreeing with the platform's reference answers?
No. Well-reasoned disagreements with strong justifications score higher than rubric-agreement with weak justifications. Express disagreement clearly when you have it, and back it up with 80–120 words of reasoning.