Listings for AI training contracts read like every other tech job ad: "strong programming fundamentals," "attention to detail," "communication skills." Useless. Here's what platforms actually screen for, based on what gets candidates rejected and what doesn't.
1. Reading carefully — really carefully
By far the most common reason for early rejection across every platform: the candidate missed a stated constraint in the prompt.
"Limit changes to auth.py" — they touch four files. "Don't use external libraries" — they import requests. "Return JSON, not a Python dict" — they hand back a dict.
This isn't tested directly. It's tested indirectly — every coding sample has 1–3 constraints buried in it, and missing any of them is an automatic fail. Read the prompt twice. Then read it a third time after you finish.
2. Calibrated judgment, not maximalism
RLHF and evaluation tasks ask "which answer is better?" Most candidates answer based on which is longer or more impressive-sounding. Trained raters answer based on which is actually correct and useful for the task.
Calibration means: when the prompt asks for a 1-sentence answer, you prefer the 1-sentence answer over the 5-paragraph answer, even if the 5-paragraph answer "sounds smarter." This is harder than it sounds. Practice it deliberately.
3. Concrete examples beat abstract claims
"I'm familiar with Python" → useless. "I've shipped a Django backend serving ~5M requests/day" → strong signal. The screener — human or AI — assigns weight to specifics.
Comb through your resume and replace every adjective ("strong," "deep," "extensive") with a number ("4 years," "2.3M users," "30% latency reduction"). If you don't have a number, the experience probably isn't strong enough to claim.
4. Honest failure analysis
Every AI interview asks some version of "tell me about a project that failed" or "describe a mistake you made." Most candidates fail this question by:
- Picking a fake failure ("I'm too detail-oriented sometimes" — useless).
- Blaming external factors ("the team didn't follow my design").
- Skipping the lesson ("we just moved on to the next project").
What scores well: a real failure, with you as the cause, and a specific lesson that changed how you work afterward. The screener is looking for evidence you can think about your own work without ego.
5. Stack-naming over framework-naming
"Python developer" → weak. "Python · FastAPI · Pydantic · pytest · Postgres · Redis · Docker" → strong.
Most platforms use vector embeddings to match your profile against active job listings. Specific framework + tooling names match more listings than generic role descriptions. List the stack like you'd list it on a take-home assignment, not on a recruiter-facing summary.
6. Long-form coherence
Coding samples are short — 60 to 90 minutes. Written work samples (especially at Mercor and Surge) are longer — 1–3 hours of writing. The single most common rejection reason on long-form samples: the candidate's argument falls apart after the first 600 words.
Strong written samples have:
- A clear thesis stated in the first paragraph.
- Each subsequent paragraph supporting that thesis with concrete evidence.
- A failure-mode section where you stress-test your own argument.
- A short conclusion that doesn't introduce new ideas.
Most candidates' samples ramble. The screener notices.
7. Specialty depth in one domain
Generalists struggle to tier up on every platform. The contractors who reach top-tier rates have one specialty — Rust, medical reasoning, quantitative finance, multi-step agent eval, niche language pairs.
You don't need 10 years of specialty depth. You need enough depth that you can talk about edge cases nobody else can. That's typically 6–12 months of focused work in the specialty.
If you're a generalist applying broadly, pick the specialty that's closest to your existing skills and lean into it on your profile. "Python developer with strong async + concurrency experience" beats "full-stack engineer."
What's not actually tested
A few things candidates worry about that don't matter much:
- School name. Tier-1 universities help marginally on initial screening. Once you're past the profile filter, your work output is what matters.
- Years of experience. Mercor in particular pays based on demonstrated quality, not seniority. A strong 2-year contractor will out-earn a weak 8-year contractor.
- Country. Most platforms pay the same rate globally for the same role. (Mercor and Outlier do; Surge does for senior roles, less consistently for entry.)
- Resume formatting. Plain text, well-structured. Designer-quality formatting buys you nothing.
Bottom line
Get good at reading carefully, give specific examples, own your failures, list your stack, write coherent long-form work, and develop one specialty. Those seven things, in our data, separate contractors who get hired from contractors who don't — across every major platform.
The good news: none of these skills require you to be a 10x engineer. They mostly require you to be a careful, honest, specific communicator. Most candidates aren't, which is why these skills are the ones being screened.