Book a demo

Why “Correct” Code Isn’t Enough: Rethinking AI Benchmarks with COMPASS

Most code generation benchmarks give AI too much credit. They score models as “perfect” if code passes functional tests, even when that code is slow, inefficient, or nearly impossible to maintain. For engineering leaders, this creates a dangerous illusion of readiness.

Codility developed COMPASS (COdility’s Multi-Dimensional Programming ASSessment) to close this gap. Built on 50 real programming problems and anchored by 393,150 human submissions, COMPASS evaluates AI-generated code across three dimensions: correctness, efficiency, and quality.

The findings are clear: correctness is not enough. Some leading models that appear strong on traditional benchmarks fail on scalability and maintainability — exactly where it matters most in production. COMPASS sets a new standard for understanding which models (and engineers) are truly ready for real-world development.

The Benchmarking Gap

For years, benchmarks like HumanEval, MBPP, and HackerRank-ASTRA have measured one thing: correctness. If the code produces the right output, the model “passes.”
This misses two realities every software engineer knows:

  • Efficiency matters. A brute-force algorithm may pass tests but grind to a halt on real data.
  • Quality matters. Unreadable, unmaintainable code generates technical debt and slows down teams.
    By overlooking these, existing benchmarks present a misleading picture of AI’s capability. They risk convincing organizations that models are production-ready when, in practice, they are not.

The COMPASS Framework

COMPASS addresses this by measuring performance across three independent axes:

  • Correctness — Does the solution run and handle edge cases?
  • Efficiency — Does it scale to large inputs under strict runtime thresholds?
  • Quality — Is it maintainable, modular, and aligned with best practices?

Unlike synthetic test sets, COMPASS is grounded in reality. It draws on 50 problems from Codility competitions and a human baseline of nearly 400,000 real submissions. Each AI solution is scored not only against ideal outputs but also against patterns of human success and failure.

Statistical validation shows that correctness, efficiency, and quality capture distinct dimensions of performance — not redundant signals. This means COMPASS reveals strengths and weaknesses hidden by single-score benchmarks.

Key Findings

Our evaluation of leading reasoning-enhanced models revealed:

  • Correctness is table stakes. Most top models can produce working code.
  • Efficiency separates contenders from pretenders. Some models frequently produced code that was correct but impractically slow.
  • Quality is independent. Models could generate code that was correct and fast, yet poorly structured — or vice versa.

These results highlight an essential truth: a model’s readiness for production cannot be judged by correctness alone.

Why Codility?

Codility brings a unique combination of assets to AI benchmarking:

  • Depth of data. 393,150 human submissions provide a robust baseline.
  • Scientific rigor. Our methods mirror the psychometric foundations of human skills assessment.
  • Practical focus. Every metric — correctness, efficiency, quality — maps to real-world engineering concerns.

This is why enterprises trust Codility to evaluate not just developers, but now also the AI tools developers use.

The Bottom Line

Real-world development is multi-dimensional. Our benchmarks should be too.

COMPASS is not just another scorecard — it is a roadmap for evaluating what matters: code that is correct, efficient, and maintainable. By raising the bar, Codility helps organizations see through the hype and make confident choices about AI adoption and engineering skills.

👉 Read the full COMPASS research paper here.

👉 To learn more about how Codility’s research can help you evaluate both your engineering team and the AI tools they use, book a demo today.