Do Coding Assessments Actually Predict Job Performance?
What does predictive validity mean for technical assessments?
Predictive validity measures whether assessment scores correlate with actual job performance outcomes. It is the only type of validity that directly answers the question “does this test predict how well someone will do the job?” Most assessment vendors use the phrase without publishing the statistical evidence required to support it.
Predictive validity is not a feeling. It is a number. Specifically, it is a correlation coefficient between assessment scores and some measure of on-the-job performance, calculated across a sample of people who were assessed and then observed in the role. When an assessment vendor says their platform “predicts job performance,” this is the claim they are making. The question is whether they can show you the data.
Three types of validity, and why the distinction matters
Assessment science recognises three types of validity, each answering a different question.
Content validity asks: does this test measure skills that are relevant to the job? This is established through job analysis and expert review. If your coding assessment tests skills your engineers actually use, it has content validity. This is the minimum standard for legal defensibility, and most reputable assessment platforms can demonstrate it.
Construct validity asks: does this test measure the skill it claims to measure? If a test is labelled “Python proficiency,” construct validity means the test actually measures Python proficiency rather than, say, speed-reading comprehension or abstract reasoning under time pressure.
Criterion validity asks: do scores on this test correlate with actual job performance? This requires collecting performance data after hire and running statistical analysis against assessment scores. It is the hardest type of validity to establish, the most expensive to study, and the only type that directly supports the claim “this assessment predicts job success.” Most assessment vendors demonstrate content validity. Some can demonstrate construct validity. Almost none publish criterion validity data for their own platform. When a vendor says their assessments are “validated,” ask which type they mean.
Codility’s approach to assessment validity, including the evidence documented in our technical manual and the customer-specific validation services we offer, is covered in detail below.
What does the research say about whether coding tests predict job performance?
Meta-analytic research shows that work sample tests (the category that includes well-designed coding assessments) correlate with job performance at approximately .33. This is meaningful but moderate. No published peer-reviewed study has validated coding assessments specifically as predictors of software engineer job performance.
The most cited research on selection method validity comes from two landmark meta-analyses, both of which have implications for how you evaluate coding assessments.
The original hierarchy (Schmidt and Hunter, 1998)
Schmidt and Hunter’s 1998 meta-analysis of 85 years of research established a validity hierarchy for selection methods. Work sample tests topped the list with a corrected validity coefficient of .54, followed by general mental ability tests (.51) and structured interviews (.51). This hierarchy became the foundation for the claim that skills-based assessments are the best predictor of job performance.
The revised estimates (Sackett et al., 2022)
In 2022, Sackett, Zhang, Berry and Lievens published a major revision in the Journal of Applied Psychology. Their central finding: prior meta-analyses had systematically overcorrected for range restriction, inflating validity estimates across nearly all selection methods.
The revised hierarchy looks different. Structured interviews now sit at the top (.42). Work sample tests drop to .33 (incorporating the revised estimate from Roth, Bobko and McFarland, 2005). General mental ability drops to .31. A validity coefficient of .33 is still meaningful, meaning work sample tests explain roughly 11% of the variance in job performance, but it is considerably more modest than the .54 figure that still circulates in marketing materials across the industry.
What this means for coding assessments
Whether a coding assessment qualifies as a “work sample test” in the I/O psychology sense depends on how closely it resembles actual job tasks. An assessment that asks candidates to build features in a realistic IDE, work with existing codebases, debug real issues, and use the tools they would use on the job has a strong claim to work sample classification. An assessment that asks candidates to solve algorithm puzzles from memory, without documentation or tooling, is closer to a cognitive ability test.
This distinction matters because the validity coefficients are different. If your coding assessment is genuinely a work sample, the .33 coefficient applies. If it is closer to an algorithm quiz, you are relying on cognitive ability validity (.31) with additional noise from the artificial testing environment.
No published peer-reviewed study has tested this directly for coding assessments. The evidence is extrapolated from broader work sample research, not demonstrated for software engineering roles specifically. This is an industry-wide gap, not a weakness unique to any single vendor.
Why has no assessment vendor published predictive validity data?
Publishing criterion validity data requires collecting post-hire job performance ratings, matching them to assessment scores across a statistically meaningful sample, and obtaining customer permission to share the results. Most assessment vendors sell one-off tests to hiring teams. They never see what happens after the hire, so they cannot measure the outcome.
This is the gap at the centre of the technical assessment market. Every vendor talks about predicting job performance. No vendor publishes the evidence.
The reasons are practical, not conspiratorial.
The data access problem
Criterion validity requires two data points for each individual: an assessment score and a subsequent measure of job performance. Assessment vendors have the first. They almost never have the second. Job performance data lives inside your organisation, in performance reviews, manager ratings, promotion decisions, and project outcomes. It does not flow back to the assessment vendor unless you share it.
The sample size problem
A meaningful criterion validity study requires a sample large enough to detect a real correlation. For the effect sizes typical in selection research (.20 to .40), you need at minimum 50 to 100 individuals who were assessed, hired, and then rated on job performance. For a single customer hiring for a specific role, this can take years to accumulate.
The permission problem
Even when an assessment vendor conducts criterion validation for an individual customer (some competitors offer this as a professional service), the results belong to the customer. Publishing aggregated findings requires explicit permission, and most organisations treat hiring data as confidential.
The incentive problem
If your marketing already claims predictive validity and buyers are not asking for the evidence, the incentive to invest in expensive validation studies is low. Publishing data also carries risk: a study might show modest correlations, which is harder to market than unquantified claims of prediction.
None of this means coding assessments do not work. It means the industry has a gap between what is claimed and what is demonstrated. Understanding this gap makes you a more informed buyer.
How does Codility design assessments to predict engineering performance?
Codility assessments are designed by an I/O psychologist using established assessment science methodology. Tasks are built around job-relevant engineering skills, administered in a production-grade VS Code environment, and scored using consistent, auditable criteria. The design follows the same principles that underpin the work sample validity research.
Codility’s approach to assessment design is grounded in three principles drawn from I/O psychology and the work sample validity literature.
Job-relevant task design
Codility’s assessment tasks are designed to reflect the skills and activities engineers perform in their actual roles. This is content validity in practice: the connection between what the test measures and what the job requires is established through structured analysis, not guesswork.
Tasks are not algorithm puzzles or trivia questions. They are engineering problems that require candidates to read existing code, build on it, debug issues, and make decisions about approach, the same activities your team performs daily.
A realistic engineering environment
Codility’s VS Code assessment environment gives candidates access to the same tooling they would use on the job: a full IDE with syntax highlighting, terminal access, debugging tools, file trees, and package managers. This is not cosmetic. Environmental fidelity is a core component of work sample validity.
Research consistently shows that the closer an assessment environment resembles the actual work environment, the better it predicts performance. An assessment completed in a stripped-down text box with no tooling tells you how someone codes under artificial constraints. An assessment completed in a production-grade IDE tells you how they actually work.
Consistent, auditable scoring
Assessment scoring follows defined rubrics applied consistently across all candidates. Automated code analysis evaluates solutions against predetermined criteria: correctness, efficiency, code quality, and test coverage. The scoring is transparent, reproducible, and auditable.
This matters for two reasons. First, consistent scoring is a prerequisite for any validity claim. If the same performance produces different scores depending on who reviews it, you cannot establish a correlation with anything. Second, auditable scoring provides the documentation required for legal defensibility under EEOC guidelines, EU non-discrimination directives, and the UK Equality Act.
Assessment science expertise
Codility’s assessment science team includes Tony Mellek, Ph.D., an industrial organizational psychologist specialising in assessment design and validation, and James Meaden, an assessment scientist whose work spans assessment methodology and AI code quality research (including the COMPASS benchmark for evaluating AI-generated code). Assessment development follows I/O psychology methodology, including structured content review, difficulty calibration, and bias analysis.
This is not a single consultant brought in for a review. Assessment science is embedded across both the product and go-to-market organisations at Codility, with dedicated practitioners working on task design, scoring methodology, and validation processes. James Meaden leads Codility’s assessment research and development team and is a recognized leader in the field of talent assessment. With over a decade of experience applying machine learning and AI to psychometric assessment at companies including Accenture and Revolut, he ensures Codility’s assessments are the industry’s most valid, fair, and predictive of real-world performance.
Codility applies a structured, evidence-based assessment development and review methodology. This includes item-level analysis during task development and pilot testing, where subject matter experts and assessment scientists evaluate task clarity, relevance, difficulty, and potential sources of bias. Difficulty calibration is performed through SME review, pilot performance data, and iterative refinement.
Fairness and bias reviews are integrated into task development and ongoing quality assurance processes, including linguistic accessibility reviews, and periodic content audits. At the assessment level, Codility conducts and supports EEOC-aligned adverse impact analyses using customer-specific data, with monitoring frequency determined by data volume and customer needs (e.g., annually, quarterly, or ongoing through Professional Services).
How is Codility working to demonstrate assessment outcomes?
Codility is building the infrastructure to connect assessment scores to post-hire engineering performance. This requires partnership with customers over time, not a one-off test and forget model. Outcome tracking is the direction of travel for the platform, not a claim about where it is today.
The gap between assessment scores and job performance data exists because most assessment relationships end at the hiring decision. Codility’s approach is different.
Assessment as a programme, not a transaction
Codility works with engineering organisations to build assessment programmes that evolve over time. This is not a philosophical distinction. It is the structural prerequisite for outcome tracking. When you use the same platform for screening, interviewing, and skills development (through Skills Intelligence), you create a continuous data set that connects pre-hire assessment to post-hire performance.
What outcome tracking requires
Connecting assessment scores to job outcomes means building four things: a consistent assessment methodology applied across enough candidates to generate meaningful data, a mechanism for capturing post-hire performance signals (retention, promotion, performance review data, project contribution metrics), a customer partnership model where this data is shared and analysed collaboratively, and a statistical framework for analysing the correlation whilst controlling for confounding variables.
This is not simple. It takes time, trust, and a sample size that only comes from long-term customer relationships. It is also the only honest path to demonstrating criterion validity.
Some vendors take shortcuts, syncing data from applicant tracking systems or HRIS platforms without making the methodology transparent to the customer. This may generate numbers, but it does not generate defensible validity evidence. Criterion validation requires a deliberate research design with defined performance criteria, controlled timing, and statistical rigour. Quietly pulling post-hire data and correlating it with assessment scores is not the same thing.
Why Codility is positioned to close this gap
Most assessment vendors sell to talent acquisition teams for one-off hiring needs. They never see what happens after the offer letter. Codility’s expanding focus on Skills Intelligence, continuous assessment for existing engineering teams, creates a fundamentally different data relationship. When your assessment platform is used for both hiring and development, the assessment-to-outcome pipeline exists by design.
This is where the platform is heading. Codility is not claiming to have solved outcome tracking today. It is building the infrastructure to solve it, and being transparent about the journey.
What should you ask any assessment vendor about predictive validity?
Ask for published criterion validity data: correlation coefficients, sample sizes, and the populations studied. Ask which validity types they can demonstrate with documentation. Ask whether their “validation” refers to content validity (test relevance) or criterion validity (performance prediction). The specificity of the answer tells you everything.
If an assessment vendor claims their platform predicts job performance, these are the questions that separate evidence from assertion.
The evidence checklist
Published validity data
Ask for correlation coefficients between assessment scores and job performance measures. Ask for the sample size, the performance criterion used (manager ratings, promotion data, retention), and the time period studied. If the answer is “we conduct validation studies for individual customers under NDA,” ask why no aggregated or anonymised results have been published.
Which validity type
Ask specifically whether they can demonstrate criterion validity (scores predict performance) or only content validity (test measures job-relevant skills). Both have value. Only one supports a predictive claim. If the answer blurs the two, the vendor is relying on content validity whilst implying criterion validity.
Academic citations
Ask which research supports their predictive claims. If they cite Schmidt and Hunter (1998), ask whether they are aware of the Sackett et al. (2022) revisions that reduced validity estimates across most selection methods. If they cite no academic research at all, the claims are marketing assertions.
I/O psychology credentials
Ask who designed the assessments and what their qualifications are. Assessment science is a real discipline with trained practitioners. Assessments designed by I/O psychologists using established methodology have a fundamentally different evidential basis than assessments assembled by product teams or engineering managers.
Adverse impact analysis
Ask whether they monitor pass rates across demographic groups and how they handle disparate impact. This is not directly about predictive validity, but it is a strong signal of assessment science rigour. Vendors who conduct adverse impact analysis typically also invest in broader validation work.
Outcome data from case studies
Ask whether any published case study includes post-hire performance data. If every case study reports only process metrics (time saved, candidates screened, hires made), the vendor has never measured the outcome they claim to predict.
Frequently asked questions
Is Codility scientifically validated?
Yes. Codility assessments are scientifically validated by I/O psychologists and grounded in established assessment science. Validation evidence is documented in Codility’s Technical Manual and aligns with the Standards for Educational and Psychological Testing. This includes a clear theoretical rationale, systematic job and skills analysis, SME-developed high-fidelity work-sample tasks, psychometric reliability evidence, criterion-related validity studies linking assessment performance to job outcomes, and fairness and adverse impact analyses aligned with EEOC guidance.