About APEX Testing
What is APEX Testing?
APEX Testing is an automated benchmark for agentic AI coding models. Models receive real codebases with real bugs, real feature requests, and real refactoring tasks — then must read, understand, and edit code to solve them. No toy puzzles. No multiple choice. Just real engineering work.
ELO Rating System
Models are rated using a Bradley-Terry model with Item Response Theory (IRT) adjustments. When two models attempt the same task, the higher-scoring model wins the matchup. ELO updates account for task difficulty — beating a hard task contributes more than an easy one.
All models start at 1500 ELO. Category-specific ratings are tracked independently, so a model can be strong at debugging but weaker at frontend work.
Scoring Weights
| Criterion | Weight |
|---|---|
| Correctness | 40% |
| Completeness | 25% |
| Code Quality | 20% |
| Efficiency | 15% |
Overall score = correctness × 0.40 + completeness × 0.25 + code_quality × 0.20 + efficiency × 0.15
Multi-Judge Evaluation
Each run is evaluated by multiple LLM judges. Judges score independently against a task-specific rubric, and scores are aggregated to reduce individual judge bias.
Task Categories
Frontend
React, Next.js, CSS, accessibility, performance
Backend
APIs, databases, queues, caching, auth
Full-Stack
End-to-end features spanning client and server
Debugging
Race conditions, memory leaks, security vulns
Refactoring
Code cleanup, modularization, pattern migration
Code Review
Finding bugs, writing tests, security audits
From Scratch
Building new projects from requirements
Multi-Language
Cross-language ports and polyglot tasks
Built by
APEX Testing is built and maintained by HauhauCS. Got questions, feedback, or want to contribute? Reach out on Discord.
Discord: hauhau