Senior AI evaluators occasionally help platforms design rubrics. The skill of rubric design — what makes a useful one vs a confusing one — is the bridge between contracting work and AI evaluation research roles. Here's the framework.
What makes a good rubric
Three criteria:
- Inter-rater agreement. Two careful raters using the rubric independently arrive at similar scores most of the time.
- Discrimination. The rubric produces meaningful differences between good and bad outputs, not just "all 4s."
- Coverage. The rubric captures the dimensions of quality that matter to the downstream use case.
Most production rubrics fail at least one of these. Rubric improvement is a tractable, high-leverage activity.
The standard rubric failures
1. Vague criteria
"Score helpfulness from 1–5." Without anchors, raters interpret differently. Score variance is high; signal is weak.
Fix: Anchor each score level with a specific example. "5 = answers the user's question completely with no errors. 4 = answers but with minor inaccuracies. 3 = partially answers. 2 = mostly fails to answer. 1 = no useful information."
2. Over-fine-grained scoring
1–10 scales sound precise but raters can't reliably distinguish 7 from 8. The actual signal lives in 4–5 buckets.
Fix: Use 4–5 levels, each clearly distinguished.
3. Mixing dimensions in one score
"Rate the response on overall quality." If a response is accurate but rude, what's the score? Different raters weight differently. High variance.
Fix: Score dimensions separately. Accuracy, helpfulness, tone, format. Compute composite if needed.
4. Missing edge cases
Rubric tested on typical examples but not edge cases. When edge cases appear in real evaluation, raters disagree.
Fix: Include edge-case examples in the rubric documentation. Show how to score them specifically.
5. Implicit knowledge
Rubric assumes raters have specific domain knowledge. New raters lack it; calibration scores plummet.
Fix: Document required domain knowledge explicitly. Either link to references or specify the rater pool requires that knowledge.
The good rubric structure
A well-designed rubric has these sections:
- Scoring overview: What you're measuring, why, in 2–3 sentences.
- Score levels with anchors: 4–5 levels, each with a specific example.
- Edge case guidance: 5–10 worked examples of tricky cases.
- Required knowledge: What raters need to know.
- Out-of-scope: What this rubric does NOT measure (so raters don't penalize for unrelated factors).
Calibration design
Good rubrics come with calibration sets — 5–10 reference examples with consensus scores. New raters score these, get feedback, and align before working on real tasks.
The calibration set should include:
- Examples at each score level.
- Edge cases the rubric specifically calls out.
- Common confusions (similar items at different score levels).
How to provide rubric feedback
If you spot rubric failures during your work, document them:
- Specific task ID or example where the rubric was unclear.
- Which dimension or level was ambiguous.
- What you scored and why.
- What you'd recommend changing.
Submit through the platform's rubric feedback channel (most platforms have one, often hidden). Senior raters who consistently provide thoughtful rubric feedback often get invited into rubric design or program manager roles.
Bottom line
Good rubrics have clear criteria, 4–5 score levels with anchors, separated dimensions, edge-case guidance, and calibration sets. Bad rubrics are vague, over-fine-grained, mix dimensions, miss edge cases, or assume implicit knowledge. Senior evaluators who learn to spot rubric failures and provide structured feedback build a path toward AI evaluation research roles.