Home » Blog » COMPASS: A Better Way to Evaluate AI Code Generation

RESEARCH

COMPASS: A Better Way to Evaluate AI Code Generation

James Meaden

Michał Jarosz

Piotr Jodłowski

Grigori Melnik | February 10, 2026

COMPASS is a benchmark that evaluates AI code generation across three dimensions: correctness, efficiency, and quality. Unlike existing benchmarks that only test whether code works, COMPASS measures whether it scales under production load and whether it’s maintainable. It uses 50 competitive programming problems with 393,150 human submissions as baseline.

Current benchmarks only measure correctness, missing efficiency and code quality entirely
COMPASS evaluates three dimensions: correctness, efficiency, and quality
Testing three leading models, we found high correctness doesn’t guarantee efficient or maintainable code
O4-Mini-High scored highest overall (92.3%); Claude Opus 4 struggled with efficiency (35.4%) despite strong quality scores
Code quality is statistically independent of correctness (r = .089)

Download the paper

Large language models can write code that passes tests. But does that code run efficiently? Is it maintainable? Current benchmarks can’t answer these questions because they only measure correctness. COMPASS changes that by evaluating AI-generated code across three dimensions: whether it works, whether it scales, and whether it’s well-written.

What’s Wrong with Current Code Generation Benchmarks?

Most AI code generation benchmarks, including HumanEval, MBPP, and HackerRank-ASTRA, evaluate models on a single criterion: functional correctness. If the code passes the test cases, it scores well.

This creates two blind spots:

Efficiency is ignored. A brute-force O(n³) solution scores the same as an optimal O(n log n) algorithm, provided both produce correct output. In production, that distinction is the difference between software that scales and software that crashes under load.

Quality is invisible. Code that’s difficult to read, maintain, or extend receives no penalty. Nested complexity, poor naming, and lack of modularity go unmeasured, even though these factors directly impact long-term engineering productivity.

How Does COMPASS Work?

COMPASS (COdility’s Multi-dimensional Programming ASSessment), developed by Ph.D.-level assessment scientists, evaluates code generation models across three independent dimensions:

Correctness: Does the code produce the right output? Scored as percentage of test cases passed, including edge cases and corner cases.
Efficiency: Does the code scale? Tested against large inputs near upper constraint bounds, with strict runtime thresholds derived from expert reference solutions.
Quality: Is the code maintainable? Analysed using CodeScene for complexity, readability, modularity, and adherence to best practices. Scored 1–100.

The benchmark consists of 50 competitive programming problems from real Codility competitions held between 2011 and 2021, with 393,150 human submissions providing empirical baselines.

What Did the Research Find?

We evaluated three leading reasoning-enhanced models: Anthropic Claude Opus 4, Google Gemini 2.5 Pro, and OpenAI O4-Mini-High.

Key finding: Models that excel at correctness don’t necessarily produce efficient or maintainable code.

Table VI

OVERALL MODEL PERFORMANCE SCORES (MEDIAN / MEAN ± SD FOR EACH MODEL ON THE FOR COMPASS METRICS)

Model	Correctness	Efficiency	Quality	Composite
Claude-Opus-4	100 / 72.2 ± 36.5	22.2 / 35.4 ± 39.1	93.8 / 92.3 ± 6.2	66.7 / 66.1 ± 23.8
Gemini-2.5-Pro	100 / 93.8 ± 20.3	100 / 85.4 ± 30.3	94.5 / 93.2 ± 5.9	97.9 / 90.4 ± 17.1
O4-Mini-High	100 / 95.6 ± 17.4	100 / 93.0 ± 21.5	93.7 / 89.2 ± 12.9	97.5 / 92.3 ± 15.5

Table VI. Overall model performance scores.

Open as image

Claude Opus 4 achieved high code quality scores but struggled significantly with efficiency, producing solutions that frequently timed out on large inputs. O4-Mini-High demonstrated both the highest average performance and the most consistent results across repeated runs.

Statistical analysis confirmed that the three dimensions capture genuinely different aspects of model capability. Code quality, in particular, was found to be largely independent of both correctness (r = .089) and efficiency (r = .022).

Why Does This Matter for Engineering Teams?

If you’re using AI tools for code generation, or evaluating which models to adopt, correctness-only benchmarks give you an incomplete picture.

A model that scores 95% on HumanEval might produce:

Code that times out on production data volumes
Functions with deep nesting that are difficult to debug
Solutions that work but incur technical debt

COMPASS helps you understand these trade-offs before they appear in your codebase.

What’s Next for COMPASS?

This initial release focuses on Python solutions to algorithmic problems. Future development will expand to additional programming languages, more realistic multi-file project scenarios, and a wider range of models including open-source alternatives.

The full research paper, including detailed methodology, per-task breakdowns, and appendices with all 50 problem baselines, is available for download.

👉 Download the full COMPASS research paper (PDF)

For questions about the methodology or partnership enquiries, contact [email protected].

COMPASS WAS DEVELOPED BY

James Meaden Assessment Scientist, Head of Assessment R&D, Codility

Michał Jarosz Lead Solutions Architect, Codility

Piotr Jodłowski Associate Product Manager, Codility

Grigori Melnik Codility Board Director & Strategy Advisor, Codility

What is COMPASS?

COMPASS (Codility’s Multi-dimensional Programming ASSessment) is a benchmark for evaluating AI code generation across three dimensions: correctness, efficiency, and quality. It uses 50 competitive programming problems and 393,150 human submissions as baseline.

How is COMPASS different from other code generation benchmarks?

Which AI models were tested?

Which model performed best?

Why did Claude Opus 4 score lower?

Is code quality related to correctness?

What programming language does COMPASS use?

Where can I download the full research paper?