Blog post

How we hold ourselves to the SIOP standard for AI-based assessment

Tony Mellek June 5, 2026

We recently attended the 41st annual SIOP conference. AI dominated almost every session we attended, from how it is changing assessment design to how the field is starting to measure AI as a job-relevant skill in its own right. What we found ourselves thinking about on the way home was the gap between how confidently AI is being marketed across our industry and how few of those tools could actually answer the questions SIOP itself asks vendors to be able to answer.

This piece sets out what those questions are, how Codility scores itself against them, and where we think the field is heading next.

The SIOP standard for AI-based assessment

In 2023, SIOP published its principles for the validation and use of AI-based assessments in employee selection. Those principles remain the reference point in 2026, and they were extended in April this year by a joint SIOP Foundation and CHRO Association guide for HR practitioners evaluating AI vendors.

Together they set out five things any AI-based assessment vendor should be able to demonstrate.

1. Validity. Scores should accurately predict future job performance or another relevant outcome. The 2026 guide adds an important point: the validation sample should reflect actual job contexts, not convenience samples drawn from paid survey panels with no link to real hiring.

2. Reliability. Scores should be consistent on retest and should measure job-related characteristics rather than artefacts of the model.

3. Fairness. Assessments should produce fair and unbiased scores across subgroups, with predictive bias and measurement bias identified and mitigated.

4. Appropriate use. Tools should be used as designed, with operational considerations in mind, and not stretched beyond the use case they were validated for.

5. Documentation for audit. Decision-making related to AI-driven assessments should be documented in enough detail to support verification and external auditing. In practice this means a technical manual covering design, development, deployment, and validation.

The 2026 guide adds emphasis on two things: ongoing evaluation rather than a one-time review, and the need to look past vendor claims and peer benchmarking to evidence drawn from the tool’s actual use.

How Codility scores itself against the standard

Here we explain where we stand against each of these.

On validity. Codility assessments are validated against on-the-job performance for engineering roles, with task-level evidence built from real work product. Our scoring is anchored to what the candidate actually produces, not to a model’s inference about them. We do not use AI to score candidates. Scoring is deterministic and rule-based, which is what makes it defensible.

On reliability. Codility scoring is reproducible. The same submission produces the same score. This is the bar we hold ourselves to, and it is one of the areas where our assessment science and engineering teams spend the most time.

On fairness. We run adverse impact analysis for clients on request, with subgroup performance reviewed against the relevant comparison group. Fairness is not a launch step. It is an ongoing process, and the analysis is repeated as candidate populations and item pools change.

On appropriate use. We work with customers to match the assessment to the role and the stage of the process, and we are clear in our documentation about what each product is and is not designed to do. Getting this right matters because most validity failures we see in the wild come from a tool being used outside its intended scope.

On documentation for audit. Our technical manual covers design, development, deployment, and validation in enough depth to support external audit. It is updated on a six-month cycle, owned by our Assessment Science team, and shared with customers who need it for their own governance review.

We are not finished. The honest position on AI-enabled scoring in particular is that the industry, including us, is still working out how to score AI interactions in a way that is fully deterministic, explainable, and fair. We say more about that below. But the standard SIOP sets is the right one, and the discipline of measuring ourselves against it is how we keep our work credible.

The integrity question is different for technical hiring

One of the topics we have been debating internally is that the cheating conversation in talent assessment does not generalize cleanly between assessment types. Personality and judgement-based assessments are structurally harder to game. A language model cannot infer what an employer is looking for in a forced-choice item and prompt its way to a better answer. The construct is, in effect, protected by its own design.

Technical assessment works differently. A coding candidate using AI is not gaming an inference about themselves. They are using a tool that can produce the work product the assessment is measuring. That is a different kind of integrity problem, and it needs a different response.

Ted Irland, on our solutions engineering team, put it well in an internal conversation.

“Using AI tools is not, on its own, cheating. Engineers use AI every day on the job. The integrity question is whether the candidate can replicate the answer, explain the reasoning, and show the judgement that produced it.”
— Ted Irland, Solutions Engineering, Codility

If they cannot, the assessment is no longer measuring what it claims to measure. That is a more useful line than a binary AI-allowed-or-not framing. It also points to where the work needs to go: not in detecting AI use as such, but in measuring whether the candidate’s contribution stands on its own when AI is part of the workflow.

Where assessment is heading: from blocking AI to measuring how candidates use it

Our prediction is that technical assessment moves through four stages over the next two years, and most enterprise hiring programmes will land somewhere between stages three and four by the end of 2027.

Stage one: prohibited. AI is blocked. The candidate completes the assessment unaided. This is still the right approach for foundational cognitive screens and for skills where the construct is the unaided ability itself.

Stage two: monitored. AI is permitted, and its use is logged and surfaced to the reviewer. Many enterprise hiring programmes work this way today.

Stage three: required. The assessment assumes the candidate will use AI because the job assumes they will. The measurement target shifts from “can you produce this output unaided” to “can you produce this output the way the job actually produces it.”

Stage four: scored directly. AI fluency itself becomes part of the construct. Prompt quality, evaluation of model output, debugging of AI-generated work, and judgement about when not to use the tool all become measurable dimensions in their own right.

Most of our enterprise customers sit between stages two and three today. The fourth stage is still being worked out across the industry. Scoring an AI interaction in a way that is deterministic, explainable, and fair is genuinely hard, and we would not claim to have fully solved it. Anyone who does claim that is probably ahead of the evidence.

A small but important caveat sits underneath all of this. At least one major employer presented work at SIOP suggesting that AI interviewers, when used as the primary assessment instrument, can quietly bias scores upward by helping candidates with prompts, clarifications, and framings. That is not an argument against AI in assessment. It is an argument for measuring carefully, validating against on-the-job performance, and being clear-eyed about what each design decision does to the signal.

AI readiness still needs a measurement model

A related question came up across many of the sessions we attended. What does it actually mean to be “AI ready” in a given role, and how should it be measured?

The answer is that the construct is unsettled. Several competing models were presented at SIOP this year. None of them is yet the accepted standard. We think that is the right state of affairs for now. AI use is changing too quickly for any one model to claim authority.

Our working list of dimensions, drawn from our own ESM (v2.1), is this.

AI Fluency — Working effectively, efficiently, ethically, and safely within emerging modalities of human-AI interaction by understanding what AI is, how it works, what it can and cannot do, and developing the critical awareness to engage with AI systems appropriately.

AI Collaboration — Working effectively with AI as a collaborative partner across the complete interaction loop — from task delegation through output refinement, and supervising autonomous AI agents.

AI Responsibility — Practicing the values, principles, and skills required to ensure AI is developed and used fairly, transparently, and accountably treating ethics as an active competency rather than abstract knowledge.

AI Governance — Operating within regulatory frameworks, organizational policies, and industry standards governing AI use, ensuring AI readiness includes legal and procedural competence.

AI Engineering — Designing, developing, deploying, and maintaining AI and ML systems — primarily for technical roles but including skills relevant to technical product managers and AI-adjacent engineering roles.

The harder question is how to separate three things when you measure performance on an AI-assisted task: how much of the result comes from the candidate’s raw skill, how much from AI augmentation, and how much from the quality of the interaction between the two. Each of these is a different measurement target. Treating them as one signal is how you end up with assessment scores that look good in a demo but do not predict on-the-job performance.

This is one of the questions we are working on, alongside the field.

Token economics will become a measurement question

One area that got relatively little airtime this year but will probably matter more by 2027 is token economics.

As AI becomes a default tool in knowledge work, the question shifts from “did the candidate get the right answer” to “how efficiently did they get there.” Two candidates can arrive at the same solution, but if one uses minimal tokens and the other burns through a budget, the cost difference compounds across a career. CFOs will start asking about it, and assessment will be expected to give a meaningful answer.

The raw data is already available. Token usage is returned on every API response. What is missing is the measurement framework: how to combine quality of output with cost of output into a single signal of efficiency, how to compare candidates fairly across different model choices, and how to surface that signal to a reviewer in a way that supports rather than replaces human judgement.

This is one of the areas we are actively working on.

What we are working on

Four areas of work have moved up our list as a result of what we saw and heard in New Orleans.

Cut-score guidance in assessment setup. Banding rather than single-point thresholds, with use-case labelling so that customers can distinguish between a hard gate, a recommendation, and a rank-ordering.

Our AI-fluency scoring framework. Mapped to the ESM v2.1 dimensions above: Fluency, Collaboration, Responsibility, Governance, Engineering.

Item-behaviour monitoring in the platform. So that customers can see how their items are performing over time rather than discovering drift after the fact.

Better evidence for technical-assessment cheating mitigation. The integrity problem in technical hiring is structurally different from the one most of the industry’s data addresses. We want to close that gap with evidence drawn from the work product itself.

Some of this is shipped. Some is in flight. Some is still an open question, and our thinking will move as the field’s thinking moves.

Our conclusion from SIOP 2026 is that AI capability is moving faster than the science is settling. The measurement frameworks, the governance standards, and the validation evidence are still catching up. The temptation for vendors is to overclaim. The more useful response is to measure ourselves against the standards the field has set, show our working, and keep doing the science.

Written by

Tony Mellek Head of Assessment Science, June 2026, Codility