AI training contracting in 2030 will look different from 2026. Some roles will scale up; others will shrink as automation eats them. Here's the realistic outlook based on what we're already seeing.
Roles likely to grow
Agent evaluation and orchestration eval
The fastest-growing category. Agents get more capable; evaluation gets harder, not easier. By 2030 we expect agent evaluation to be the largest single specialty in AI training.
Domain-specialty evaluation
Medical, legal, scientific, and quantitative finance evaluation will continue to grow. Frontier labs invest more in safety-critical domains; the qualified evaluator pool grows slowly.
Long-horizon reasoning evaluation
Models get better at long-context but the cliff at "very long horizons" remains. Multi-day agent runs, complex business workflows — evaluating these requires sustained human judgment.
Constitutional / safety evaluation
Safety remains undersolved. Edge-case evaluation, refusal calibration, and helpfulness vs harmlessness trade-offs require careful human judgment indefinitely.
Roles likely to shrink
Generic English RLHF
The pool is saturated; rates are flat; automation tools (constitutional AI, model-judges) handle more of the work. Entry-tier generalist English RLHF will likely shrink as a category.
Basic image annotation
Computer vision is fast — automated annotation handles much of the volume. Specialty (medical imaging, lidar) will hold; basic bounding boxes will shrink.
Multiple-choice evaluation
Easily automated. The work that requires explicit reasoning will remain; pure multiple-choice work will gradually decline.
How AI changes AI training
The recursion: AI itself is increasingly used to generate, evaluate, and grade AI training data. This doesn't eliminate human evaluators — but it changes the shape of the work.
Model-judge calibration
Frontier labs use AI judges (constitutional AI) for first-pass evaluation. Humans then evaluate the judges' output. The work shifts from primary evaluation to meta-evaluation.
Synthetic data + human verification
Models generate training data that humans verify. This is faster than humans generating from scratch but still requires the human verification step. The hours per task drop; the rate per hour holds or rises.
Agent-on-agent evaluation
Specialized evaluator agents grade other agents' work. Humans evaluate the evaluator agents. This stack has 2–3 levels of indirection by 2028.
Skills that hold their value
- Domain depth. Medical, legal, scientific specialties remain valuable indefinitely.
- Calibrated judgment. The skill of making correct calls under structured criteria translates across role types.
- Hallucination detection and verification. Models won't reliably self-correct; humans remain the ground-truth check.
- Edge-case generation. Creating cases models will fail on requires creative human thinking.
- Rubric design. Defining what good looks like remains human work.
Skills likely to commodify
- Generic rubric application (simple cases).
- Basic code review (syntax-level).
- Standard translation evaluation for high-resource languages.
- Multiple-choice eval grading.
The career arc most likely to compound
Senior contractor → specialty depth (one specific domain) → contributing to rubric design → AI evaluation researcher or AI safety researcher → senior research role.
Each step builds on prior work. The specialty + rubric experience translates directly into research roles. Many AI evaluation researchers in 2030 will have started as senior contractors in 2025–2026.
Bottom line
AI training contracting will exist in 2030 but the shape will shift. Specialty work and agent evaluation will grow; generic generalist work will commodify. The career arc that compounds is specialty depth → rubric design → research role. Position for the future by investing in one domain or one specialty now.