OpenAI and Anthropic dropped their latest coding models practically at the same time: Codex 5.3 and Opus 4.6. So I did the obvious thing: made them fight.
This is how it went down: I pulled a few key sections from a real npm package's README (~1,500 chars) and used them as a spec. Each agent got the same prompt: implement this spec as a complete, publishable TypeScript repo. The spec describes monocrate, a monorepo publishing CLI we recently open-sourced.
I then fed the implementations produced by each agent — along with the existing monocrate codebase as a baseline — into a judging process. Seven LLMs judged every pairwise matchup, each evaluated twice with order swapped to reduce bias. The question was deliberately simple: "which repo is a better starting point?" — not "does it work?" A win means a judge thought the code was a stronger foundation. This keeps comparisons clean. Full methodology here.
Although simple, this setting — the task and its evaluation scheme — is a reliable yardstick for assessing the overall coding capabilities of the participating models.
Leaderboard
| Rank | Agent | Win % |
|---|---|---|
| 1 | Baseline (human + Opus 4.5, iterative) | 79% |
| 2 | Codex 5.3 | 63% |
| 3 | Opus 4.6 | 43% |
| 4 | Codex 5.2 | 41% |
| 5 | Opus 4.5 | 25% |
Takeaways
Codex 5.3 takes it. It was declared winner in 35 out of 56 judgments — more than any other competitor besides the baseline. In the direct matchup against Opus 4.6, it won 10-4. And it wasn't just Opus 4.6 — Codex 5.3 beat every competitor head-to-head. A clear winner. Here's the full head-to-head breakdown:
| Matchup | Result |
|---|---|
| Codex 5.3 vs. Opus 4.6 | Codex 5.3 wins 10-4 |
| Codex 5.3 vs. Codex 5.2 | Codex 5.3 wins 9-5 |
| Codex 5.3 vs. Opus 4.5 | Codex 5.3 wins 12-2 |
| Opus 4.6 vs. Codex 5.2 | Tie 7-7 |
| Opus 4.6 vs. Opus 4.5 | Opus 4.6 wins 9-5 |
| Codex 5.2 vs. Opus 4.5 | Codex 5.2 wins 8-6 |
Opus 4.6 and Codex 5.2 are practically tied. Opus 4.6's overall win rate is 2 percentage points higher (43% vs. 41%), but in their direct matchup they split 7-7. Anthropic's latest model landed exactly even with OpenAI's previous generation.
Within each vendor, the generational jump is clear. Codex 5.3 beat 5.2 (9-5), Opus 4.6 beat 4.5 (9-5).
Bottom line
So we have a winner. And it's Codex 5.3. Opus 4.6 is quite behind.
And while this looks like a one-shot benchmark, the judging scheme — "which repo is a better starting point?" — means the results apply more broadly. We're measuring the quality of the headstart you get, which matters whether you're shipping on the first try or settling in for a many-shot session.
monocrate is MIT licensed. Judged by GPT-5 Mini, Claude Sonnet 4.5, DeepSeek v3.2, Gemini 2.5 Flash, Devstral 2512, Sonar Pro, and Qwen3 Coder 30B.
Top comments (0)