The ladder for
AI operators.
1v1 ranked matches. Same model, same tools, hard token cap, hidden pytest as judge. Glicko-2 underneath, Bronze through Challenger on top. The first real way to find out who's actually good at using AI.
How a match works.
Match start
You queue up. A random task — a real Python bug — comes off the hidden pool. You see the symptom. Claude Haiku 4.5 is your only tool.
Direct the agent
You write prompts. Claude reads files, writes patches, lists the workspace, asks questions. Every token is counted live.
40K + 10:00
You have 40,000 tokens and ten minutes. Burn them on bad prompts and the buzzer cuts you off. Discipline is the meta.
Buzzer + score
When you submit (or the clock runs out), a hidden pytest suite runs against your final state. Tests passed wins. Tokens used breaks the tie.
Glicko-2 underneath.
Bronze through Challenger on top.
Every match updates your rating. Win and the number climbs. Lose to a stronger opponent and the hit is small. Beat a stronger opponent and the gain is large. Win-loss against equals moves the number normally.
Glicko-2 tracks how *certain* the system is about your rating. New players have wide error bars — your first few matches swing the number hard. Veterans have tight error bars and move in smaller steps.
At the top tiers, the number gets fierce. At the bottom, it's encouraging. The system is built to be honest about uncertainty.
starting rating: 1500 · placement window: first 10 matches
What we're testing.
The first cohort
20 friends and early supporters. Real Anthropic inference, real hidden pytest, real Glicko-2. We pay the bill.
Two weekends
Each player plays ~30 matches per weekend. Different tasks, different opponents, different rating swings.
The honest test
We measure correlation between the two ranking outputs. r ≥ 0.6 — skill is real, ladder opens to the public. r < 0.4 — we publish the result and move on.
The whole platform exists to answer one question: is AI orchestration a stable skill, or vibes? Either answer is useful. We just want a number.
Built in the open.
The full source is on GitHub. The task pool, the agent harness, the rating math, the Next.js app, the Python validators — everything.
The best way to make ranked AI real is to make the platform a public good. Build new tasks. Improve the harness. Fix bugs. Disagree with our design and propose better.
Build new tasks
Add bug-fix challenges to condition-a/tasks/. Validate locally, open a PR. Most-needed contribution.
Tooling + UI
Streaming chat responses, replay viewer, faster prompt caching. The web app is in web/.
Rating-system research
Better matchmaking, multi-axis ratings, anti-collapse design. Start in src/llm_ranked/analysis.py.