Senior AI training evaluators who can produce structured documentation alongside their tasks have a meaningful career advantage. Here's what to write and why it matters.
Why documentation matters
Three reasons:
- Visible skills for transition. Public-facing documentation demonstrates evaluation thinking that contractor-only resumes don't show.
- Internal influence. Platforms surface evaluators who write thoughtfully — for rubric design roles, program manager promotions, or direct frontier-lab connections.
- Personal compound. Documenting your own work patterns surfaces issues you'd otherwise miss.
What to write
1. Personal failure-mode log
Maintain a private document tracking every interesting model failure you observe. Format: task type, what the model did wrong, what should have happened, why it failed.
After 100 entries, you have a personal taxonomy of model failure patterns. After 500 entries, that taxonomy is publishable insight.
2. Rubric improvement notes
When you encounter a rubric that's unclear or fails to discriminate well, document the specific failure. Most platforms have rubric feedback channels but few contractors use them. Senior raters who consistently provide thoughtful rubric feedback get noticed.
3. Public AI evaluation writing
If you can write publicly without violating NDAs (most platforms have specific guidelines):
- General-purpose AI evaluation methodology essays.
- Failure mode taxonomy posts.
- Rubric design critiques (anonymized).
- "Why this benchmark is misleading" pieces.
The format that works
Strong AI evaluation documentation:
- Concrete examples. Specific failures, not generic patterns.
- Mechanism-level explanation. Why the failure happened, not just that it did.
- Implications. What this means for evaluation methodology or model improvement.
- Tight prose. 600–1500 words usually optimal. Longer reads as padding.
Where to publish
- LessWrong / Alignment Forum: AI safety community engagement.
- Personal blog or Substack: Searchable, demonstrates ownership.
- Twitter/X threads: Lower commitment, broader reach.
- Hugging Face papers / blog: ML community engagement.
NDA navigation
Most AI training platforms have NDAs that restrict:
- Specific task content from the platform.
- Identifying information about clients.
- Internal scoring details.
- Specific model behavior on specific tasks.
You can typically write about: general methodology, failure mode patterns at the category level, rubric design principles, your own evaluation approach. When in doubt, ask the platform what's writable.
What writing produces
Long-term, contractors who maintain consistent public AI evaluation writing accumulate:
- Direct outreach from AI labs and consultancies.
- Speaking opportunities at AI safety / evaluation events.
- Peer network of AI evaluators and researchers.
- Job offers in AI evaluation roles without traditional applications.
Bottom line
Senior AI training evaluators who maintain personal documentation (failure-mode logs, rubric notes) and write publicly accumulate career capital that pure contracting work doesn't generate. The skill is straightforward: track patterns, write tight prose, publish consistently. Returns compound over 12–24 months.