Why Your Coding Assessment Needs a Real IDE
Why do stripped-down code editors produce bad signal?
Most coding assessment platforms still default to basic browser editors because they are simpler to build and maintain, not because they produce better hiring outcomes. These environments strip away the tools engineers rely on every day: terminal access, package managers, debugging, documentation, file navigation. The result is an assessment that measures adaptation to artificial constraints rather than engineering capability. I/O psychology research calls this “construct-irrelevant variance” — noise in your signal that comes from the test conditions rather than the skills you are trying to measure.
The gap between assessment and reality
Think about how your engineers actually work. They have a full IDE with IntelliSense, a terminal for running commands, access to package managers for pulling in dependencies, debugging tools for stepping through code, and the ability to structure work across multiple files. Nobody writes production code in a single file with no syntax help and no way to run anything except a test suite they cannot see.
Now think about what most coding assessments provide: a single-file editor, basic syntax highlighting, and a “Run Tests” button. The candidate writes code in conditions that bear no resemblance to the job they are being assessed for.
Two problems follow. First, you lose signal. Candidates who are strong engineers but unfamiliar with the specific constraints of your assessment platform underperform. Second, you gain noise. Candidates who are skilled at competitive programming and LeetCode-style puzzles overperform relative to their actual engineering ability.
The Behroozi et al. (2020) study from NC State University demonstrated this experimentally. Researchers split 48 computer science students into two conditions: a traditional whiteboard interview with an observer, and a private setting where candidates solved identical problems alone. Candidates in the traditional setting performed half as well. Most strikingly, all women in the public interview condition failed, while all women in the private condition passed. The researchers concluded that companies were filtering out capable engineers because the assessment conditions, not the problems themselves, were producing the failures.
What developers actually say about this
Developer communities consistently identify artificial assessment constraints as a fundamental problem with technical hiring.
A widely shared blog post that generated nearly 500 upvotes and 600 comments on Hacker News captured the core complaint: the author argued that LeetCode-style assessments bear no resemblance to the actual responsibilities of software engineering, and that most companies use them simply because other companies do.
On Team Blind, developers note the absurdity of disabling even basic features like autocomplete, with several pointing out that companies like Netflix already allow Google access during interviews and Stripe lets candidates use their own IDE. One developer using a competitive platform captured it well: they felt the basic editor was designed to make the task harder, not to evaluate their engineering ability.
Source: Team Blind discussion
The CoderPad/CodinGame State of Tech Hiring surveys (2024 and 2025) found that developers consistently rate whiteboard-style assessments as the worst recruitment method, while practical coding challenges in realistic environments are the most preferred. HackerRank’s 2025 Developer Skills Report (13,732 respondents) found that 78% of developers feel their assessments do not align with real-world tasks.
Engineering leaders are reaching the same conclusion from the hiring side. A CTO Craft article on assessment completion found that the realism of the environment directly affects whether strong candidates finish the process at all, with one CDO arguing that if your assessment can be answered with a single word like “hashmap,” it is the wrong assessment.
Source: CTO Craft
What does a real coding assessment environment look like?
A real coding assessment environment replicates the conditions under which engineers actually build software: a professional IDE (VS Code) with terminal access, package managers, debugging tools, Git integration, extensions, multi-file project support, and access to documentation. The goal is not to make the assessment easier. It is to make the signal cleaner. When the environment matches real work, you see engineering behaviour rather than test-taking behaviour.
The full tooling stack
Codility’s Interview product provides a VS Code assessment environment with the full tooling stack engineers expect:
TERMINAL ACCESS
Candidates can run commands, install dependencies, execute scripts, and interact with the runtime environment the same way they would in their local development setup. This is not a sandboxed “output only” console. It is a working terminal.
PACKAGE MANAGERS
npm, pip, Maven, NuGet, and other package managers are available. Candidates can pull in libraries, manage dependencies, and structure their solutions using the same tools they use daily. If your team uses lodash or pandas or Spring Boot, candidates can use them too.
DEBUGGING TOOLS
Breakpoints, variable inspection, call stack navigation, and step-through debugging. When something goes wrong, candidates can diagnose it the way a working engineer would, rather than sprinkling console.log statements into a single-file editor and hoping for the best.
MULTI-FILE PROJECTS
Real engineering work spans multiple files. Candidates can create file structures, separate concerns, and demonstrate architectural thinking that is invisible in a single-file assessment. This is where you see the difference between someone who can solve a problem and someone who can build software.
EXTENSIONS
VS Code extensions for language support, linting, formatting, and productivity are available. The environment feels familiar to any engineer who uses VS Code, which according to Stack Overflow surveys is the majority.
GIT INTEGRATION
Version control is available within the environment, supporting realistic workflows.
DOCUMENTATION ACCESS
Candidates can reference documentation, because nobody memorises API signatures in their day job and your assessment should not pretend otherwise.
What you learn that a basic editor cannot show you
A full tooling environment gives you a richer set of signals beyond just “did they solve it”:
HOW THEY STRUCTURE WORK
Do they plan before they code? Do they separate concerns across files? Do they create a logical project structure? None of this is visible in a single-file editor.
HOW THEY DEBUG
When something fails, do they use systematic debugging (breakpoints, step-through, variable inspection) or trial-and-error? Debugging approach is one of the strongest indicators of engineering maturity, and you cannot observe it without debugging tools.
HOW THEY USE TOOLING
Do they leverage the terminal effectively? Do they install appropriate packages rather than reinventing the wheel? Do they use linting and formatting? These are daily engineering habits that predict how someone will work in your codebase.
HOW THEY MANAGE COMPLEXITY
Multi-file projects reveal whether candidates can navigate increasing complexity, manage dependencies between modules, and maintain readability as the codebase grows. A single-function algorithmic puzzle cannot measure it.
Product scope: The VS Code assessment environment is available in Codility’s Interview product for live and asynchronous technical interviews. The Screen product uses a code editor optimised for high-volume screening assessments.
How does the assessment environment affect completion rates?
Assessment completion rates swing dramatically based on environment quality. Traditional take-home assignments in basic environments see completion rates around 30%. Standard online assessments typically achieve 60-70%. Platforms offering realistic IDE environments report completion rates between 85% and 95%. These are not marginal differences. For a company assessing 1,000 candidates, moving from a 60% to a 90% completion environment means 300 additional completed assessments and a significantly larger pool of qualified candidates to evaluate.
The data on environment and completion
Completion rate data from multiple sources consistently points in the same direction.
DevSkiller reports a 94% completion rate with its “RealLifeTesting” approach, which provides candidates with VS Code or their own local IDE. Woven Teams reports 95% for senior engineers using realistic project-based assessments. CodeSubmit reports 92% completion when candidates use their own IDE.
Source: DevSkiller
Source: CodeSubmit
Source: DevSkiller / CodeSubmit
94%
Completion rate with its “RealLifeTesting” approach which provides candidates with VS Code or their own local IDE
A McAfee case study showed a 35% boost in assessment completion rates after migrating to CoderPad’s collaborative IDE environment, alongside 22% faster time-to-accept.
Source: CoderPad
Source: McAffee
35%
boost in assessment completion rates after migrating to CoderPad’s collaborative IDE environment
The SmithSpektrum 2026 analysis found that completion rates above 85% are achievable when companies combine clear tooling requirements with warm process communication.
Source: SmithSpektrum
Source: SmithSpektrum
>85%
completion rates above 85% are achievable when companies combine clear tooling requirements with warm process communication.
What Codility’s own candidate data shows
Codility’s post-interview candidate survey data confirms the pattern. Across the dataset, candidates rate the assessment environment 6.7 out of 10 for similarity to their own IDE, with an overall recommendation score of 8.05 out of 10. Candidates who explicitly mention VS Code rate similarity higher (7.5) and recommendation higher (8.25), driven by familiarity with the UI, keyboard shortcuts, and editor behaviour.
The sharpest signal comes from whether IDE features helped or hindered the candidate’s work:
- When candidates describe IDE features (autocomplete, IntelliSense, syntax highlighting, terminal, debugging) as useful, scores are very strong: 8.73 for IDE similarity, 9.18 for recommendation, and 25 out of 33 responses rated the experience as excellent.
- When candidates describe IDE features as missing, scores collapse: 2.08 for IDE similarity, 6.24 for recommendation, and 30.1% of responses rate the experience as so-so or poor (compared to 14.2% overall). Candidates in this group explicitly say missing IDE support “made it harder to code” and “took up a lot of time”.
- Autocomplete and IntelliSense are the dominant theme by far (29 mentions as missing, 26 mentions as useful), followed by imports, documentation lookup, syntax checking, debugging, terminal access, and multi-file navigation.
The takeaway is clear: the presence or absence of standard IDE tooling is the single largest driver of candidate assessment experience. Candidates who get the tooling they expect rate the experience 4x higher on environment similarity and significantly higher on recommendation. Candidates who do not get it report friction that directly impairs their performance.
Source: Codility candidate post-interview survey data, analysed March 2026
Why completion rates matter more than you think
Low completion rates are not just an inconvenience. They represent a systematic bias in your hiring pipeline. The candidates most likely to abandon an assessment that feels artificial or disrespectful of their time are often the ones with the most options: experienced engineers who are already employed and evaluating your company as much as you are evaluating them.
The 2024 ERE/Talent Board CandE Benchmark Research, based on over 230,000 candidate responses, found that candidate resentment hit an all-time high, with technology and finance sectors showing 25% resentment rates. Organisations with strong candidate experience scores saw 38% higher NPS and 36% higher fairness perception.
Source: ERE
Ashby’s data from 67,400 survey requests shows that after assessment rejection, candidate NPS drops to between -23 and -26. The assessment stage is where most damage to your employer reputation occurs.
Source: Codility candidate post-interview survey data, analysed March 2026
The CoderPad 2024 survey found that 78% of developers say the assessment experience directly drives their accept/decline decision. When you give engineers a credible assessment environment, you signal that you understand how they work, and that matters.
Does a realistic assessment environment produce better hiring signal?
Yes. I/O psychology research consistently shows that assessment fidelity, the degree to which the test environment matches real working conditions, is a significant driver of predictive validity. Higher-fidelity work samples produce better predictions of job performance than lower-fidelity alternatives, and this effect has been demonstrated experimentally. The theoretical basis is well established: Wernimont and Campbell’s (1968) behavioural consistency model, Asher and Sciarrino’s (1974) point-to-point correspondence principle, and Lievens and colleagues’ (2011, 2015) experimental work on stimulus and response fidelity all converge on the same conclusion. When the assessment matches the job, the prediction improves.
What the research shows about fidelity and validity
Wernimont and Campbell (1968) established that the best predictor of future performance is a sample of demonstrated behaviour, not an abstract aptitude measure. Asher and Sciarrino (1974) formalised this as “point-to-point correspondence”: the more closely a predictor resembles the actual job, the higher its validity.
Sackett, Zhang, Berry, and Lievens (2022) produced the most current comprehensive meta-analysis of selection methods, placing work sample tests at a corrected validity of .33, within the top five selection methods across all of personnel psychology. While this was lower than Schmidt and Hunter’s (1998) earlier estimate of .54 (due to corrected range restriction methodology), work samples remain among the strongest predictors of job performance.
Lievens and Patterson (2011) demonstrated that high-fidelity simulations had incremental validity over low-fidelity simulations, which in turn had incremental validity over knowledge tests alone. Each step up in fidelity captures predictive power that lower levels cannot.
Source: Lievens and Patterson (2011)
Lievens, De Corte, and Westerveld (2015) isolated the mechanism experimentally: behavioural response modes (actually performing the task) predicted job performance significantly better than written or verbal descriptions of how one would perform it. The authors concluded that lowering response fidelity results in lower predictive validity, and that higher-fidelity responses were also less correlated with general cognitive ability and more reflective of personality and work style.
What this means for coding assessments
Applied to technical hiring:
A whiteboard interview where a candidate talks through a solution is a low-fidelity, verbal-response assessment. A basic browser editor where a candidate writes code in a single file is a moderate-fidelity simulation. A VS Code environment with full tooling where a candidate builds a multi-file solution with real dependencies, terminal access, and debugging is a high-fidelity work sample.
Each step up the fidelity ladder captures signal that the previous level misses: debugging behaviour, architectural decisions, tooling fluency, complexity management, code organisation.
The Roth, Bobko, and McFarland (2005) meta-analysis of work sample tests found that validity varied substantially by design quality. Poorly designed or mismatched work samples yielded much lower validity than well-constructed ones. This means the environment itself likely moderates predictive power: a work sample administered in a basic text editor may not reach the .33 validity ceiling, while one administered in realistic conditions may exceed it.
Caveat
No published randomised experiment has directly compared identical coding problems administered in a basic editor versus a full IDE while measuring differential hiring outcomes. The theoretical foundation strongly predicts that higher-fidelity environments will yield better results, and the Behroozi study provides strong experimental evidence on environment effects, but the definitive controlled trial specific to IDE fidelity in coding assessment has not yet been published. We believe the convergent evidence is compelling, and we are transparent about where the direct experimental gap exists.
What about code quality beyond pass/fail?
Most assessments stop at “does it work?” That is the wrong finish line. Two candidates can solve the same problem correctly: one writes code your team would be proud to maintain, the other creates technical debt from day one. Automated code quality analysis evaluates maintainability, complexity, naming conventions, structure, and patterns, giving you insight into whether a candidate writes code your team would merge, not just code that compiles.
Why pass/fail is not enough
A passing test suite tells you the code produces the correct output for the test cases. It tells you nothing about whether the code is readable by another engineer, whether the solution handles edge cases beyond the test suite, whether the approach will scale, whether the naming conventions and structure follow maintainable patterns, or whether the code introduces unnecessary complexity.
For senior hires especially, this distinction is critical. You are not hiring for today’s ticket. You are hiring for the codebase a year from now. A candidate who writes clean, well-structured, maintainable code is worth significantly more than one who produces clever but opaque solutions.
Automated code analysis in practice
Codility evaluates code across multiple dimensions beyond test results. Rather than relying solely on whether tests pass, the platform analyses the quality of the code itself:
MAINTAINABILITY
Is the code structured in a way that other engineers can understand, modify, and extend?
COMPLEXITY
Does the solution use an appropriate level of abstraction, or is it over-engineered? Does it handle branching logic cleanly?
PATTERNS AND CONVENTIONS
Does the candidate follow idiomatic patterns for the language? Do naming conventions communicate intent?
STRUCTURE
In multi-file projects, does the candidate organise code logically? Are responsibilities separated appropriately?
The VS Code environment compounds this value. In a single-file editor, there is no structure to evaluate, no file organisation to observe, and limited complexity to manage. The full tooling environment creates the conditions for richer code quality analysis.
Not just code that compiles. Code your team would merge.
How does a VS Code environment work with AI tools in assessment?
A realistic IDE environment is the natural context for evaluating how engineers work with AI tools. Engineers use AI assistants (GitHub Copilot, ChatGPT, Claude) daily in VS Code. Assessing candidates in the same environment, with configurable AI access, lets you observe authentic AI collaboration patterns rather than either banning AI entirely (which no longer reflects the job) or testing in conditions where AI usage cannot be meaningfully observed. The VS Code environment with full tooling also supports AI-resilient assessment design: multi-file tasks with realistic project complexity are inherently harder for AI to solve alone than single-function algorithmic puzzles.
The environment enables AI-resilient tasks
Single-file algorithmic puzzles are the most vulnerable assessment format to AI generation. An LLM can solve a typical “reverse a linked list” or “find the shortest path” problem in seconds. Multi-file project tasks in a full IDE environment are inherently more resistant because they require navigating genuine project complexity, understanding relationships between modules, and making architectural decisions that depend on context a prompt cannot fully capture.
Multi-file project tasks in a full IDE environment test the skills that actually matter when AI tools are part of the daily workflow: judgment about when to use AI and when to think independently, ability to evaluate and modify AI-generated code, architectural decision-making that requires understanding the broader system, and debugging skills when AI-generated code does not work as expected.
For a deeper exploration of how AI fits into technical assessment, including configurable AI access and integrity monitoring, see How AI fits into technical assessment.
Configurable AI access
Codility supports configurable AI access in the VS Code Interview environment, allowing you to choose the approach that matches your evaluation goals:
AI DISABLED
For roles where you need to verify foundational coding ability without AI assistance.
AI MONITORED
The most revealing option for many roles. Candidates can use AI tools, and you see exactly how they interact with them: what they prompt, what they accept, what they modify, and what they reject. This shows whether someone can collaborate effectively with AI, which is increasingly the actual job.
AI ENABLED
For roles where AI fluency is the skill being assessed, with full AI access and evaluation focused on the quality of the output and the candidate’s decision-making process.
This connects directly to our position on assessment integrity: integrity means knowing who is at the keyboard and understanding how they work, not policing which tools they use.
How should you evaluate assessment environments?
When evaluating coding assessment platforms, the environment should be assessed on six dimensions: IDE foundation (is it a real IDE or a basic editor?), tooling depth (terminal, debugging, package managers, extensions), project complexity support (single-file or multi-file), AI integration strategy (ban, monitor, or enable), code quality analysis (beyond pass/fail), and candidate experience (completion rates, satisfaction, accessibility). The strongest environments replicate real working conditions as closely as possible, because the assessment science is clear: fidelity drives validity.
Six questions to ask any platform
1. WHAT IDE DOES THE CANDIDATE ACTUALLY SEE?
There is a meaningful difference between a Monaco Editor instance in a browser panel (which most platforms use) and a full VS Code environment with all native features. Ask for a candidate-view demo, not a marketing screenshot.
2. CAN CANDIDATES USE A TERMINAL?
Not a “Run Tests” button. An actual terminal where they can execute commands, install packages, and interact with the runtime. This is baseline tooling for any engineer.
3. CAN CANDIDATES WORK ACROSS MULTIPLE FILES?
Single-file assessments cap the complexity of what you can evaluate. Multi-file project support is essential for senior roles and any assessment of architectural thinking.
4. WHAT DEBUGGING TOOLS ARE AVAILABLE?
Breakpoints and step-through debugging, or just “Run and see what happens”? Debugging is one of the most important engineering skills, and you cannot assess it without debugging tools.
5. HOW DOES THE PLATFORM HANDLE AI TOOLS?
Does it ban AI entirely (increasingly unrealistic), detect AI use (unreliable), or provide configurable access with monitoring (most informative)? Your platform should support your philosophy, not dictate it.
6. WHAT HAPPENS BEYOND PASS/FAIL?
Does the platform evaluate code quality, maintainability, and patterns, or just whether tests pass? The difference determines whether you can distinguish a good engineer from a good problem-solver.
The competitive landscape
Every major assessment platform now uses Monaco Editor (the engine behind VS Code) as its foundation. This represents a meaningful convergence, but it also means that simply claiming “we use VS Code” is not enough. The differentiation lies in how deep the tooling goes.
Some platforms provide Monaco with basic syntax highlighting and a run button. Others add a terminal for their project-based assessment mode only. A smaller number offer the full VS Code experience with debugging, extensions, and package managers. The gap between the marketing claim (“VS Code-powered IDE”) and the actual candidate experience varies significantly across the market.
Assessment environment comparison (as of March 2026)
| Platform | Editor | Full VS Code | Terminal | Debugger | AI Assistant | Own IDE |
|---|---|---|---|---|---|---|
| Codility | Monaco / VS Code | Yes (Interview) | Yes | Yes (VS Code) | Cody + AI Copilot | No |
| CoderPad | Monaco / VS Code | Yes (Projects) | Yes | Yes (breakpoints) | GPT-5, Claude | No |
| HackerRank | Monaco | No | Projects only | Linting only | Dual-mode AI | No |
| CodeSignal | Monaco | No | Yes | Syntax only | Cosmo + AI Interviewer | No |
| DevSkiller | VS Code + Local | Yes + Local | Yes | Local IDE | No | Yes (Git clone) |
| Karat | Proprietary | No | No | Output-based | No | No |
Source: vendor documentation and product pages, reviewed March 2026. Capabilities may have changed since review.
Why Codility built a developer-first assessment environment
Codility’s VS Code assessment environment exists because we believe the environment is part of the assessment. Every design decision, from full terminal access to multi-file project support to automated code quality analysis, is grounded in the I/O psychology principle that higher-fidelity work samples produce better predictions of job performance. We built the environment engineers actually work in because that is the environment that produces the signal you actually need.
Built on assessment science
The assessment environment is not a UX decision. It is a psychometric one. Callinan and Robertson (2000) identified six dimensions of work sample quality, with fidelity (the degree of realism) as a primary driver of validity. Codility’s assessment science team, including an occupational psychologist and assessment scientists, applies this research to every design decision in the platform.
Source: Callinan and Robertson (2000)
If the assessment environment does not match real working conditions, you are measuring the wrong thing. The VS Code environment with full tooling is how we close the gap between assessment and reality.
Transparent about what we know and what we are still learning
We are honest about the evidence landscape. The I/O psychology research strongly supports higher-fidelity environments. The Behroozi study provides experimental evidence of environment effects. Multiple vendor case studies show completion rate improvements. But the definitive controlled trial comparing IDE fidelity levels in coding assessment specifically has not been published.
We believe the convergent evidence is compelling. We also believe that transparency about evidence gaps builds stronger partnerships than overclaiming. If you want to discuss the research behind our approach, our assessment science team is available.
Frequently asked questions
What is a VS Code coding assessment?
A VS Code coding assessment runs in a full Visual Studio Code environment with access to the same tools engineers use daily: terminal, package managers, debugging, extensions, Git, and multi-file project support. Rather than writing code in a basic browser editor, candidates work in conditions that match real software development. This produces more realistic signal about how they actually engineer software.