Your quality score is the single biggest determinant of your tier, your hour allocation, and your specialty access on AI training platforms. Most contractors don't understand exactly how it's computed. Here's the actual mechanics.
The rolling weighted average
Every major platform uses a rolling weighted average over your last N tasks, where N is typically 80–120 depending on platform:
- Outlier: Last 80 tasks, weighted by task complexity
- Mercor: Last 100 tasks, weighted by task type and length
- Surge AI: Last 120 tasks, weighted by program tier
This means a single bad week doesn't tank your career — bad scores age out of the window. But sustained low scores compound, and recovery takes time proportional to the window size.
What goes into a single task score
Each task generates a score between 0 and 1. The components:
- Rubric agreement (60–70% of weight): How closely your evaluation matches the platform's reference answer or consensus.
- Justification quality (15–25% of weight): How well-reasoned your written explanation is.
- Edge-case identification (10–15% of weight): Did you flag the cases the model commonly fails on.
- Completion (5% of weight): Did you address every sub-question.
Most contractors over-focus on rubric agreement and underestimate justification quality. The fastest path to a 0.92+ score isn't more rubric-correct answers — it's better justifications on the answers you already give.
Why your score might be dropping
Common causes, in order of frequency:
- Skipping the justification field. A 5-word justification scores 0.6; a 100-word reasoned justification scores 0.9.
- Going too fast. Tasks under your platform's recommended completion time score lower; the system penalizes apparent skimming.
- Disagreement without explanation. Flagging "I disagree" without explaining why on a task where you disagree with consensus drops the score significantly.
- Cross-language drift. Multi-tasking languages dilutes your score in each.
- Calibration drift. Your judgment slowly shifts away from the platform consensus over time. Periodic re-calibration corrects this.
The disagreement paradox
Counterintuitively, disagreeing with consensus can raise your score if you do it well. The system tracks two metrics:
- Agreement rate: How often you match consensus.
- Disagreement quality: When you disagree, how well-reasoned is your case.
A rater who agrees with consensus 90% of the time and provides excellent justifications on the 10% disagreements scores higher than a rater who agrees 95% of the time but provides poor justifications on disagreements.
The lesson: don't avoid disagreement to game your agreement rate. Express disagreement clearly when you have it, and back it up.
How to recover from a low score
If you've drifted below your previous tier:
- Pause and diagnose. Don't keep grinding tasks at low quality — you're embedding the bad pattern in your rolling window.
- Take 5 calibration tasks. Most platforms let you opt into calibration tasks. Score these carefully.
- Restart with longer tasks. Long-form tasks weight more in the average. A few high-quality long-form tasks pull your score up faster than many quick ones.
- Use the justification field consistently. Aim for 80–120 words on every disagreement.
Realistic recovery time from 0.78 → 0.88: about 4 weeks of focused work.
The senior-tier mindset
Contractors at the top of the score distribution share two habits:
- They write to the grader. Justifications are written as if explaining to a senior reviewer who'll second-guess them. This naturally produces 80–150 word justifications with concrete reasoning.
- They re-read tasks before submitting. 30 seconds of re-reading before submit catches half the careless mistakes that drop scores.
Bottom line
Quality scores are mostly justification-driven, not rubric-agreement-driven. The contractors who tier up fastest don't have better judgment than mid-tier contractors — they explain their judgment better. Spend the extra two minutes on the justification field; the rolling-average math rewards it heavily.