How AI Fits Into Technical Assessment

AI has fundamentally changed how engineers write code, and technical assessments need to keep pace. The most effective approach combines three layers: assessment design that resists AI gaming by testing judgment rather than recall, behavioural monitoring that analyses how candidates solve problems rather than scanning for AI-generated output, and configurable AI access that lets you observe how engineers actually work with AI tools. Detection alone is not enough. The evidence shows that AI-generated code detection is unreliable, and the platforms that acknowledge this honestly are the ones building assessments that will still work in two years.

The core problem is straightforward: the tools engineers use every day can now solve most traditional coding assessments without human input. This is not a future concern. It is the current reality, and it has exposed a fundamental weakness in how the industry has approached technical hiring for the past decade.

The scale of the problem

The statistics circulating across the industry paint a consistent picture, even if the exact figures vary by source. CodeSignal’s 2025 fraud data showed that 35% of proctored assessments were flagged for suspicious behaviour, up from 16% the previous year. Entry-level roles saw flag rates approaching 40%. Fabric’s analysis of over 19,000 technical interviews found that 38.5% of candidates exhibited cheating behaviour, with the rate climbing to 48% in technical roles. Most concerning: 61% of flagged candidates scored above passing thresholds, meaning they would have advanced through the process undetected.

Most concerning

61%

of flagged candidates scored above passing thresholds, meaning they would have advanced through the process undetected.

Why traditional assessments are vulnerable

The real issue is not that candidates are cheating. It is that most coding assessments were designed for a world where looking up a solution required effort. Algorithmic puzzles, syntax recall questions, and isolated function-writing tasks were reasonable when the alternative was memorising documentation. They are not reasonable when a candidate can paste the problem into ChatGPT and receive working code in seconds.

If an AI tool can solve your assessment without human judgment, the assessment is testing the wrong thing. The question is not how to stop candidates using AI. It is how to design assessments where AI assistance does not eliminate the need for genuine engineering skill.

What engineering leaders actually worry about

Surveys of engineering leaders reveal concerns that go deeper than cheating statistics. Karat’s 2025-2026 AI Workforce Transformation Report found that confidence among engineering leaders that the right candidates receive offers dropped from 68% to 47% in a single year. Meanwhile, 71% of U.S. engineering leaders said it is difficult to accurately assess the skills needed for AI-era roles. The problem is not just integrity. It is relevance. Engineering leaders are losing faith that current assessment methods measure what actually matters.

Can you reliably detect AI-generated code?

The honest answer is: not with the consistency that most vendors imply. This matters because detection is the foundation of nearly every competitor’s integrity pitch, and the evidence base for those claims is weaker than the marketing suggests.

What the academic research says

The academic literature on AI-generated code detection is sobering for anyone selling detection as a primary solution. A 2026 study published by Springer (Cuellar Argotty and Manrique) evaluated seven detection tools across 1,644 code samples and found critical precision-recall trade-offs with significant performance drops when code was even slightly modified. Their conclusion: current AI detection tools are unreliable for practical use. A separate 2025 paper in Frontiers in Computer Science stated directly that reliable methods for detecting AI-generated code do not currently exist.

The core technical challenge is that AI-generated code is non-deterministic. The same prompt produces different outputs each time, and those outputs increasingly resemble competent human code. Traditional plagiarism tools like MOSS and JPlag rely on code similarity, which fails when every AI output is unique. Newer ML-based detectors can achieve reasonable accuracy on unmodified AI output, but performance degrades sharply when candidates make even minor edits to the generated code.

The accuracy claims do not hold up to scrutiny

Across the market, detection accuracy claims are inconsistent and poorly documented. One major competitor publishes a 93% accuracy figure in marketing materials while their own technical documentation states 85% overall precision. Neither figure is accompanied by a methodology paper, an explanation of what “accuracy” means in context (precision, recall, F1 score, or something else), or disclosure of the test dataset. No vendor in this category has published a peer-reviewed validation study for their AI detection system.

Codility takes a different approach. Rather than publishing a single accuracy figure, Codility’s integrity system combines three signal categories (identity and network verification, behavioural signals, and similarity checks) into a structured risk assessment developed in collaboration with I/O psychologists. The output is a four-tier risk level, not a binary cheating verdict. Hiring teams see exactly what contributed to each assessment and interpret the evidence in context. The philosophy is signals, not conclusions.

This does not mean detection is worthless. It means detection should be one signal among many, not the primary line of defence. The platforms making the boldest detection claims are also the ones least likely to tell you about false positive rates, because those numbers would undermine the confidence they are trying to sell.

Behavioural signals are more promising than content analysis

The more defensible approach to integrity monitoring focuses on how candidates work rather than what they produce. Research on “para data” (keystroke dynamics, typing patterns, pause analysis, edit sequences, and browser behaviour) shows stronger predictive validity for identifying problematic behaviour than attempting to classify whether finished code was AI-generated.

This is the approach grounded in assessment science: observe the process, not just the output. When you record the full sequence of how a candidate arrives at a solution, including their typing rhythm, their debugging approach, and their revision patterns, you build a richer picture than any binary “AI or human” classifier can provide. This is the principle behind Codility’s integrity approach, which surfaces behavioural signals for human review rather than rendering automated verdicts.

If detection is unreliable, what actually works?

The most effective defence against AI-assisted cheating is not better detection. It is better assessment design. This is not a novel insight in assessment science, but it is one the industry has been reluctant to embrace because selling detection features is easier than redesigning assessments from the ground up.

Assessment design is the primary defence

The principle is simple: design tasks where AI assistance does not eliminate the need for genuine engineering judgment. In practice, this means several things.

Multi-file, realistic environments

Multi-file, realistic environments force candidates to navigate complexity that AI tools handle poorly. Configuring a build system, debugging across file boundaries, understanding how components interact, and making architectural decisions in context are tasks where AI can assist but cannot replace human understanding. An engineer who knows what they are doing will use AI to accelerate their work. An engineer who does not will produce code that technically runs but reveals fundamental misunderstanding when you examine the approach.

Process-oriented evaluation

Process-oriented evaluation shifts the focus from “does the code work?” to “how did the candidate get there?” When you can observe the full problem-solving trajectory, including false starts, debugging strategies, and how candidates respond when their first approach fails, you learn far more about engineering capability than any final-output analysis can reveal.

Architectural and design judgment tasks

Architectural and design judgment tasks target skills that AI tools consistently struggle with. System design, trade-off analysis, code review exercises, and tasks requiring candidates to evaluate and improve existing code test the kind of judgment that distinguishes senior engineers from junior ones, and that AI tools cannot yet replicate.

The I/O psychology perspective

From an I/O psychology standpoint, the shift toward process-based assessment aligns with decades of research on work sample tests. The Sackett et al. (2022) meta-analysis confirmed that work samples, where candidates perform tasks representative of actual job duties, remain among the strongest predictors of job performance. The question for technical assessment is whether the “work sample” reflects how engineers actually work today, which increasingly includes AI tools.

Louis Hickman’s research at Virginia Tech provides the most direct evidence for the technical assessment context. His 2025 paper in the International Journal of Selection and Assessment demonstrated that advanced language models can now solve unproctored cognitive tests at the 95th percentile, a dramatic increase from below the 20th percentile just one year earlier. His conclusion: few, if any, unproctored tests will remain safe from AI-assisted completion. His proposed solutions include proctoring, strict time limits, trace data analysis, and notably, allowing AI use and assessing collaboration skills instead.

Layered integrity, not a single solution

The most robust approach combines multiple layers, each addressing a different dimension of the problem:

Layer 1: Assessment design

Tasks that require genuine understanding even when AI tools are available. This is the foundation that makes every other layer more effective.

Layer 2: Behavioural monitoring.

Keystroke analysis, session recording, paste detection, defocus tracking, and performance anomaly detection. Codility combines these signals into a four-tier integrity risk level developed with I/O psychologists, giving reviewers a structured starting point rather than a list of isolated flags.

Layer 3: Configurable AI access.

Rather than trying to prevent all AI use, provide controlled environments where you choose whether candidates work with AI tools enabled, restricted, or fully open, and observe the results. Codility captures every AI interaction during both take-home assessments and live interviews, with full transcripts available for review.

Layer 4: Verification

Follow-up interviews where candidates explain and extend their assessment work. Codility supports this with AI-generated follow-up questions after submission that probe understanding and originality, plus live interview environments where interviewers can dig deeper. No amount of AI assistance can substitute for a candidate who genuinely understands what they built.

Should assessments ban AI tools, allow them, or monitor them?

This is the question every engineering leader is asking, and there is no single correct answer. The right approach depends on what you are trying to measure, what role you are hiring for, and how your own engineering team works day to day.

The case for banning AI

Banning AI tools makes sense when you need to isolate foundational skills. If you are hiring junior engineers and need to verify they can write basic code independently, or if you are assessing core computer science knowledge for roles where fundamentals matter more than tooling fluency, removing AI tools gives you a cleaner signal on baseline capability.

The risk is relevance. If your engineers use Copilot and ChatGPT every day, assessing candidates without those tools measures a skill set that does not match the job. You may filter out candidates who are highly productive with AI assistance but slower without it, and let through candidates who can solve puzzles but struggle in modern development environments.

The case for allowing AI

Anthropic’s 2025 study of junior software engineers found meaningful differences in how people use AI: some engineers use AI to build comprehension (asking for explanations, requesting follow-up detail, testing their understanding), while others simply generate code without engaging with what it does. The first group develops stronger skills over time. The second group produces working code but accumulates understanding debt.

Observing these patterns during an assessment tells you something genuinely useful about how a candidate will perform on the job. An engineer who uses AI to accelerate work they understand is different from an engineer who uses AI to mask work they do not understand. You cannot see this distinction if you ban the tools entirely.

The case for enabling and monitoring — Codility’s approach: you choose

The monitored approach, where candidates have access to AI tools but every interaction is recorded and reviewable, gives you the most information. You see what they ask the AI, how they evaluate the response, what they modify, and where they override the AI’s suggestions. This is the richest signal available, and it directly addresses the question engineering leaders increasingly care about: can this person work effectively with AI tools?

The trade-off is evaluation complexity. Reviewing AI interaction logs takes time and requires reviewers who understand what good AI collaboration looks like. Without clear rubrics for what constitutes effective AI use versus over-reliance, the data can be hard to interpret consistently.

Codility’s approach: you choose

Codility’s VS Code assessment environment supports configurable AI access across both take-home assessments and live interviews. When AI is enabled, every interaction is monitored by default. There is no trade-off between access and oversight. For take-home assessments, you can enable a built-in AI assistant and review the full interaction transcript after submission, including AI-generated follow-up questions that test whether the candidate genuinely understands their solution. For live interviews, interviewers have full control over which AI capabilities and models candidates can access, choosing from multiple providers, and observe the candidate’s problem-solving process in real time. Every AI interaction is recorded in the post-interview report. You can also run assessments with AI tools fully disabled. Rather than prescribing a single philosophy, we give engineering teams the tools to implement the approach that matches their hiring goals and their team’s working practices.

This matters because the “right” answer varies by role, by team, and by what stage of the hiring process you are in. A screening assessment might restrict AI to isolate fundamentals. A final-round project might enable full AI access to see how candidates work in realistic conditions. The platform should support both without forcing a one-size-fits-all approach.

What does it actually mean to assess AI collaboration skills?

This is the frontier, and honesty requires acknowledging that nobody has definitively solved it yet. The industry is full of platforms claiming to assess “AI skills,” but the construct validity evidence, the proof that what they measure actually predicts job performance, does not exist for any of them.

The gap between claims and evidence

HackerRank offers prompt engineering questions and RAG assessment templates. CodeSignal launched an “AI Collection” with assessments for AI Literacy, Prompt Engineering, and AI Research skills. CoderPad proposes a four-dimension AI Proficiency Framework covering prompting skill, workflow integration, judgment and verification, and creativity. None of these has published criterion-related validity data showing that scores predict on-the-job performance. None has been peer-reviewed. All are less than two years old.

This is not a criticism specific to any one vendor. Assessing how well someone works with AI is a genuinely hard psychometric problem. The tools are evolving rapidly, the skills required change as models improve, and the relationship between “good at prompting” and “good at engineering with AI” is not yet well understood.

What we do know

Despite the uncertainty, research points toward several dimensions that matter:

Judgment and verification

Problem decomposition for AI

Iterative refinement

Knowing when not to use AI

How Codility approaches AI skills assessment

Codility’s COMPASS benchmark evaluates AI-generated code quality across multiple dimensions, providing a research foundation for understanding what “good” looks like when engineers work with AI tools.

COMPASS (COdility’s Multi-dimensional Programming ASSessment) is Codility’s research benchmark for evaluating AI-generated code using criteria that reflect real software engineering standards. Most existing code-generation benchmarks evaluate models primarily on functional correctness. COMPASS expands this approach by measuring three complementary dimensions of performance: correctness (functional accuracy across test cases), efficiency (algorithmic performance and scalability under realistic input constraints), and code quality / maintainability (structural clarity, modularity, and adherence to engineering best practices). 

The benchmark is built from 50 programming challenges drawn from real Codility coding competitions, supported by 393,150 historical human submissions that provide empirical performance baselines. This allows model performance to be interpreted relative to large-scale human developer behavior rather than evaluated in isolation. COMPASS has been released publicly as an arXiv paper and accepted for publication and presentation at the Association for Computing Machinery (ACM) International Conference on AI Foundation Models and Software Engineering (FORGE 2026). 

Early results illustrate an important limitation of correctness-only evaluation: models that achieve high correctness scores can still produce code that is inefficient or difficult to maintain. By evaluating code generation across multiple engineering-relevant dimensions, COMPASS provides a more realistic picture of how modern AI systems perform in real development contexts, where correctness, efficiency, and maintainability all matter.


Codility’s task library already includes AI-native assessment tasks that require candidates to work directly with LLM tools or build AI-powered solutions, alongside broader AI-readiness assessments covering foundational concepts. The library is expanding, with additional AI-native tasks in active development.

Our assessment science team, led by Tony Mellek (occupational psychologist) and James Meaden (assessment scientist), is actively researching the construct validity of AI collaboration assessment. We are transparent about what we know and what we are still learning, because the organisations that will build the strongest engineering teams are the ones making decisions based on evidence rather than vendor marketing.


Meaden, J. (2026). Assessing AI collaboration: Using AI evaluation frameworks to score human–AI performance in selection. Paper presented at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), New Orleans, LA.

Grelle, D., McClure Johnson, T., King, J., Meaden, J., Walsh, J., Allen, J. (Discussant), & Allen, K. (Chair). (2026). From Potential to Performance: Individual Differences that Enable AI Readiness [Alternative Session Type]. Society for Industrial and Organizational Psychology Annual Conference, New Orleans, LA, United States.

Badr, K. (Chair), Beazley, C., Meaden, J., & Samo, A. (2026). Designing and Building AI Agents for High-Stakes IO Applications: A Master Tutorial [Master Tutorial]. Society for Industrial and Organizational Psychology Annual Conference, New Orleans, LA, United States.

Theys, E. R., Sturdivant, M., & Meaden, J. (2025). Beyond Prompt Engineering: A Framework for Integrating GenAI into HR Processes [Master Tutorial]. Society for Industrial and Organizational Psychology Annual Conference, Denver, CO, United States.

Mayers, D., Meaden, J., (2025). Is The Juice Worth the Squeeze? An Examination on the Benefits of GenAI for Different HR Use Cases [Master Tutorial]. Society for Industrial and Organizational Psychology. Annual Conference, Denver, CO, United States.

Meaden, J., Sturdivant, M., & Theys, E. R. (2024). Harnessing the Power of Generative AI through Effective Prompt Engineering [Master Tutorial]. Society for Industrial and Organizational Psychology Annual Conference, Chicago, IL, United States.

Meaden, J. (2024). Modeling workforce skills with sentence transformers. In D. V. Simonet (Chair), Beyond the dictionary: Specialized NLP applications in I-O psychology. Symposium conducted at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), Chicago, IL.

Meaden, J. (2021, May 7). Modern talent assessment: Key concepts and applications. Seminar presented at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), Virtual.

Meaden, J. (2021, April 22). Doing what the robots can’t: The role of (human) I-Os in AI assessment. Panel discussion presented at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), Virtual.

Meaden, J. (2020, April 1). Acoustic analysis of the performance of online content creators. Poster presented at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), Virtual.

Meaden, J. (2019, April 1). Applied NLP in organizational research. Seminar presented at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), Washington, DC.

Meaden, J. (2019, April 1). Conceptual and methodological innovations in criterion measurement. Paper presented at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), Washington, DC.

Meaden, J. (2019, April 1). It’s about time: Using survival analysis to gain time-based people insights. Tutorial presented at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), Washington, DC.

Meaden, J. (2019, April 1). Black box ≠ magic box: Testing machine learning approaches to leader performance. Poster presented at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), Washington, DC.

Meaden, J. (2018, April 1). Application-based text analysis as a complement to psychometric assessments. Symposium presented at the annual conference of the Society for Industrial and Organizational Psychology (SIOP), Chicago, IL.

What should you look for when evaluating platforms?

When assessing how a technical assessment platform handles AI, look past the feature lists and ask questions that reveal whether the vendor understands the problem or is simply selling a reaction to it.

Questions that reveal depth versus marketing

Ask about detection accuracy and insist on specifics. What is the false positive rate? What is the recall rate? How were these figures derived? What dataset was used? If the vendor cannot answer these questions, their accuracy claims are marketing, not science. Any vendor quoting a single “accuracy” number without specifying whether it refers to precision, recall, or F1 score is being imprecise at best.

Ask how assessments are designed to resist AI gaming. A platform that relies primarily on detection is fighting an arms race it will eventually lose. A platform that designs assessments where AI assistance does not eliminate the need for genuine skill is building something more durable. Multi-file environments, architectural judgment tasks, and process-oriented evaluation are signals of this approach.

Ask about published validation research. Has the vendor published peer-reviewed studies demonstrating that their AI detection works? That their AI skills assessments predict job performance? That their integrity measures do not produce adverse impact? As of early 2026, no major vendor in this category has published peer-reviewed validation for their AI-specific tools. The vendor that publishes first earns the most credibility.

Ask who builds the assessments and what their credentials are. I/O psychologists, assessment scientists, and engineers with published research represent genuine investment in assessment quality. A vendor that cannot name the people behind their assessment methodology is outsourcing credibility to job titles rather than demonstrating it through expertise.

Ask about configurability. Your needs will change as AI tools evolve and as your organisation develops its own policies on AI use. A platform that locks you into a single philosophy (AI banned, AI allowed, or AI monitored) limits your ability to adapt. Look for platforms that let you configure AI access per assessment, per role, and per stage of the hiring process.

Why Codility takes this approach

Codility’s position is grounded in assessment science rather than feature marketing. Our VS Code environment provides the realistic, multi-file assessment context that resists AI gaming by design. Our configurable AI settings let you choose the approach that matches your hiring goals. Our assessment science team, including an occupational psychologist and assessment scientists, brings the I/O psychology rigour that most platforms lack.

We also believe in transparency. We will tell you what we know, what we are still researching, and where the evidence is inconclusive. In a market where vendors routinely overclaim, we think honest expertise builds stronger partnerships than inflated promises.

Frequently asked questions

How do you prevent candidates from using ChatGPT during coding assessments?

The most effective approach is not to focus solely on preventing AI tool use, but to design assessments where AI tools alone cannot produce a passing result. Multi-file tasks with realistic constraints, architectural judgment questions, and process-oriented evaluation ensure that candidates need genuine engineering skill even when AI tools are available. Behavioural monitoring (keystroke analysis, session recording, paste detection) provides an additional layer of integrity. Codility supports configurable AI access, allowing you to ban, restrict, or monitor AI tool use depending on your goals