RANKEDLLMv0.3
cohort 1 — closed beta open

The ladder for
AI operators.

1v1 ranked matches. Same model, same tools, hard token cap, hidden pytest as judge. Glicko-2 underneath, Bronze through Challenger on top. The first real way to find out who's actually good at using AI.

40,000
token cap per match
10:00
wall clock
200+
hidden tasks
Haiku 4.5
fixed substrate
five minutes per match

How a match works.

01

Match start

You queue up. A random task — a real Python bug — comes off the hidden pool. You see the symptom. Claude Haiku 4.5 is your only tool.

02

Direct the agent

You write prompts. Claude reads files, writes patches, lists the workspace, asks questions. Every token is counted live.

03

40K + 10:00

You have 40,000 tokens and ten minutes. Burn them on bad prompts and the buzzer cuts you off. Discipline is the meta.

04

Buzzer + score

When you submit (or the clock runs out), a hidden pytest suite runs against your final state. Tests passed wins. Tokens used breaks the tie.

the ladder

Glicko-2 underneath.
Bronze through Challenger on top.

Every match updates your rating. Win and the number climbs. Lose to a stronger opponent and the hit is small. Beat a stronger opponent and the gain is large. Win-loss against equals moves the number normally.

Glicko-2 tracks how *certain* the system is about your rating. New players have wide error bars — your first few matches swing the number hard. Veterans have tight error bars and move in smaller steps.

At the top tiers, the number gets fierce. At the bottom, it's encouraging. The system is built to be honest about uncertainty.

Challenger
2200+
Master
2000 – 2199
Diamond
1800 – 1999
Platinum
1600 – 1799
Gold
1400 – 1599
Silver
1200 – 1399
Bronze
Below 1200

starting rating: 1500 · placement window: first 10 matches

cohort 1 · closed beta

What we're testing.

20
seed players

The first cohort

20 friends and early supporters. Real Anthropic inference, real hidden pytest, real Glicko-2. We pay the bill.

60
matches each

Two weekends

Each player plays ~30 matches per weekend. Different tasks, different opponents, different rating swings.

r
spearman

The honest test

We measure correlation between the two ranking outputs. r ≥ 0.6 — skill is real, ladder opens to the public. r < 0.4 — we publish the result and move on.

The whole platform exists to answer one question: is AI orchestration a stable skill, or vibes? Either answer is useful. We just want a number.

MIT · open source

Built in the open.

The full source is on GitHub. The task pool, the agent harness, the rating math, the Next.js app, the Python validators — everything.

The best way to make ranked AI real is to make the platform a public good. Build new tasks. Improve the harness. Fix bugs. Disagree with our design and propose better.

Build new tasks

Add bug-fix challenges to condition-a/tasks/. Validate locally, open a PR. Most-needed contribution.

Tooling + UI

Streaming chat responses, replay viewer, faster prompt caching. The web app is in web/.

Rating-system research

Better matchmaking, multi-axis ratings, anti-collapse design. Start in src/llm_ranked/analysis.py.

Queue up.

20 spots. Two weekends. Inference on us. Find out where you actually rank.

Join the beta