Book a demo
Home » Blog » COMPASS: A Better Way to Evaluate AI Code Generation

RESEARCH

COMPASS: A Better Way to Evaluate AI Code Generation

COMPASS is a benchmark that evaluates AI code generation across three dimensions: correctness, efficiency, and quality. Unlike existing benchmarks that only test whether code works, COMPASS measures whether it scales under production load and whether it’s maintainable. It uses 50 competitive programming problems with 393,150 human submissions as baseline.

  • Current benchmarks only measure correctness, missing efficiency and code quality entirely
  • COMPASS evaluates three dimensions: correctness, efficiency, and quality
  • Testing three leading models, we found high correctness doesn’t guarantee efficient or maintainable code
  • O4-Mini-High scored highest overall (92.3%); Claude Opus 4 struggled with efficiency (35.4%) despite strong quality scores
  • Code quality is statistically independent of correctness (r = .089)
Assessing AI Code with COMPASS

Large language models can write code that passes tests. But does that code run efficiently? Is it maintainable? Current benchmarks can’t answer these questions because they only measure correctness. COMPASS changes that by evaluating AI-generated code across three dimensions: whether it works, whether it scales, and whether it’s well-written.

What’s Wrong with Current Code Generation Benchmarks?

Most AI code generation benchmarks, including HumanEval, MBPP, and HackerRank-ASTRA, evaluate models on a single criterion: functional correctness. If the code passes the test cases, it scores well.

This creates two blind spots:

Efficiency is ignored. A brute-force O(n³) solution scores the same as an optimal O(n log n) algorithm, provided both produce correct output. In production, that distinction is the difference between software that scales and software that crashes under load.

Quality is invisible. Code that’s difficult to read, maintain, or extend receives no penalty. Nested complexity, poor naming, and lack of modularity go unmeasured, even though these factors directly impact long-term engineering productivity.

How Does COMPASS Work?

COMPASS (COdility’s Multi-dimensional Programming ASSessment), developed by Ph.D.-level assessment scientists, evaluates code generation models across three independent dimensions:

  1. Correctness: Does the code produce the right output? Scored as percentage of test cases passed, including edge cases and corner cases.
  2. Efficiency: Does the code scale? Tested against large inputs near upper constraint bounds, with strict runtime thresholds derived from expert reference solutions.
  3. Quality: Is the code maintainable? Analysed using CodeScene for complexity, readability, modularity, and adherence to best practices. Scored 1–100.

The benchmark consists of 50 competitive programming problems from real Codility competitions held between 2011 and 2021, with 393,150 human submissions providing empirical baselines.

What Did the Research Find?

We evaluated three leading reasoning-enhanced models: Anthropic Claude Opus 4, Google Gemini 2.5 Pro, and OpenAI O4-Mini-High.

Key finding: Models that excel at correctness don’t necessarily produce efficient or maintainable code.

Table VI

OVERALL MODEL PERFORMANCE SCORES (MEDIAN / MEAN ± SD FOR EACH MODEL ON THE FOR COMPASS METRICS)

ModelCorrectnessEfficiencyQualityComposite
Claude-Opus-4100 / 72.2 ± 36.522.2 / 35.4 ± 39.193.8 / 92.3 ± 6.266.7 / 66.1 ± 23.8
Gemini-2.5-Pro100 / 93.8 ± 20.3100 / 85.4 ± 30.394.5 / 93.2 ± 5.997.9 / 90.4 ± 17.1
O4-Mini-High100 / 95.6 ± 17.4100 / 93.0 ± 21.593.7 / 89.2 ± 12.997.5 / 92.3 ± 15.5
Table VI. Overall model performance scores.
Open as image

Claude Opus 4 achieved high code quality scores but struggled significantly with efficiency, producing solutions that frequently timed out on large inputs. O4-Mini-High demonstrated both the highest average performance and the most consistent results across repeated runs.

Statistical analysis confirmed that the three dimensions capture genuinely different aspects of model capability. Code quality, in particular, was found to be largely independent of both correctness (r = .089) and efficiency (r = .022).

Why Does This Matter for Engineering Teams?

If you’re using AI tools for code generation, or evaluating which models to adopt, correctness-only benchmarks give you an incomplete picture.

A model that scores 95% on HumanEval might produce:

COMPASS helps you understand these trade-offs before they appear in your codebase.

What’s Next for COMPASS?

This initial release focuses on Python solutions to algorithmic problems. Future development will expand to additional programming languages, more realistic multi-file project scenarios, and a wider range of models including open-source alternatives.

The full research paper, including detailed methodology, per-task breakdowns, and appendices with all 50 problem baselines, is available for download.

👉 Download the full COMPASS research paper (PDF)

For questions about the methodology or partnership enquiries, contact [email protected].

COMPASS WAS DEVELOPED BY

What is COMPASS?

COMPASS (Codility’s Multi-dimensional Programming ASSessment) is a benchmark for evaluating AI code generation across three dimensions: correctness, efficiency, and quality. It uses 50 competitive programming problems and 393,150 human submissions as baseline.