Codex 5.3 vs. Opus 4.6: who wins on a real coding task?

#ai #agents #coding #benchmark

OpenAI and Anthropic dropped their latest coding models practically at the same time: Codex 5.3 and Opus 4.6. So I did the obvious thing: made them fight.

This is how it went down: I pulled a few key sections from a real npm package's README (~1,500 chars) and used them as a spec. Each agent got the same prompt: implement this spec as a complete, publishable TypeScript repo. The spec describes monocrate, a monorepo publishing CLI we recently open-sourced.

I then fed the implementations produced by each agent — along with the existing monocrate codebase as a baseline — into a judging process. Seven LLMs judged every pairwise matchup, each evaluated twice with order swapped to reduce bias. The question was deliberately simple: "which repo is a better starting point?" — not "does it work?" A win means a judge thought the code was a stronger foundation. This keeps comparisons clean. Full methodology here.

Although simple, this setting — the task and its evaluation scheme — is a reliable yardstick for assessing the overall coding capabilities of the participating models.

Leaderboard

Rank	Agent	Win %
1	Baseline (human + Opus 4.5, iterative)	79%
2	Codex 5.3	63%
3	Opus 4.6	43%
4	Codex 5.2	41%
5	Opus 4.5	25%

Takeaways

Codex 5.3 takes it. It was declared winner in 35 out of 56 judgments — more than any other competitor besides the baseline. In the direct matchup against Opus 4.6, it won 10-4. And it wasn't just Opus 4.6 — Codex 5.3 beat every competitor head-to-head. A clear winner. Here's the full head-to-head breakdown:

Matchup	Result
Codex 5.3 vs. Opus 4.6	Codex 5.3 wins 10-4
Codex 5.3 vs. Codex 5.2	Codex 5.3 wins 9-5
Codex 5.3 vs. Opus 4.5	Codex 5.3 wins 12-2
Opus 4.6 vs. Codex 5.2	Tie 7-7
Opus 4.6 vs. Opus 4.5	Opus 4.6 wins 9-5
Codex 5.2 vs. Opus 4.5	Codex 5.2 wins 8-6

Opus 4.6 and Codex 5.2 are practically tied. Opus 4.6's overall win rate is 2 percentage points higher (43% vs. 41%), but in their direct matchup they split 7-7. Anthropic's latest model landed exactly even with OpenAI's previous generation.

Within each vendor, the generational jump is clear. Codex 5.3 beat 5.2 (9-5), Opus 4.6 beat 4.5 (9-5).

Bottom line

So we have a winner. And it's Codex 5.3. Opus 4.6 is quite behind.

And while this looks like a one-shot benchmark, the judging scheme — "which repo is a better starting point?" — means the results apply more broadly. We're measuring the quality of the headstart you get, which matters whether you're shipping on the first try or settling in for a many-shot session.

monocrate is MIT licensed. Judged by GPT-5 Mini, Claude Sonnet 4.5, DeepSeek v3.2, Gemini 2.5 Flash, Devstral 2512, Sonar Pro, and Qwen3 Coder 30B.