APEX

About APEX Testing

What is APEX Testing?

APEX Testing is an automated benchmark for agentic AI coding models. Models receive real codebases with real bugs, real feature requests, and real refactoring tasks — then must read, understand, and edit code to solve them. No toy puzzles. No multiple choice. Just real engineering work.

ELO Rating System

Models are rated using a Bradley-Terry model with Item Response Theory (IRT) adjustments. When two models attempt the same task, the higher-scoring model wins the matchup. ELO updates account for task difficulty — beating a hard task contributes more than an easy one.

All models start at 1500 ELO. Category-specific ratings are tracked independently, so a model can be strong at debugging but weaker at frontend work.

Scoring Weights

CriterionWeight
Correctness40%
Completeness25%
Code Quality20%
Efficiency15%

Overall score = correctness × 0.40 + completeness × 0.25 + code_quality × 0.20 + efficiency × 0.15

Multi-Judge Evaluation

Each run is evaluated by multiple LLM judges. Judges score independently against a task-specific rubric, and scores are aggregated to reduce individual judge bias.

Task Categories

Frontend

React, Next.js, CSS, accessibility, performance

Backend

APIs, databases, queues, caching, auth

Full-Stack

End-to-end features spanning client and server

Debugging

Race conditions, memory leaks, security vulns

Refactoring

Code cleanup, modularization, pattern migration

Code Review

Finding bugs, writing tests, security audits

From Scratch

Building new projects from requirements

Multi-Language

Cross-language ports and polyglot tasks

Built by

APEX Testing is built and maintained by HauhauCS. Got questions, feedback, or want to contribute? Reach out on Discord.

Discord: hauhau