After nearly 2 years of AI-assisted development — from ChatGPT 3.5 to Claude Code — I kept hitting the same problem: every model makes mistakes it ...
For further actions, you may consider blocking this person and/or reporting abuse
The observation that each model has consistent failure patterns rather than random bugs is underappreciated. Claude getting sloppy on error handling in long contexts, Codex over-engineering abstractions: these are predictable weaknesses, which means a second agent can be specifically tuned to watch for them.
We run a multi-agent setup ourselves and the "grading your own exam" problem is real. The manual copy-paste phase you describe is painfully familiar. Automating the review loop is the obvious next step but the hard part is knowing when to stop iterating. Two agents can get into an infinite refinement cycle if you're not careful. Did you find a good heuristic for convergence?
Great question, our current approach has a few layers:
Tag-based convergence. Each round is tagged — [CODE], [PASS], [NEEDS_WORK], [CONSENSUS]. The loop stops when the reviewer issues [PASS] with an explicit reason. No rubber-stamp passes allowed — the policy requires at least one concrete justification.
Round cap. After 5 rounds without consensus, the system flags it for human intervention. You can [OVERRIDE] or [HANDOFF]. In practice most tasks converge in 2-3 rounds.
The real heuristic is upfront alignment. Honestly, with Claude Code + Codex the ping-pong refinement loop is rare. The more common failure mode is direction being wrong from the start. So for complex tasks, I align on goals and approach with the author agent first — before the loop even begins. If the direction is right, convergence comes naturally. If it's wrong, no amount of review rounds will fix it.
Curious about your setup — are you using different models for each role, or the same model in different contexts?
The AI-reviews-AI loop is underrated. Biggest issue I've hit with single-model generation is that the model gets anchored on its own assumptions — a second model asking "why did you do it this way?" catches stuff that linting never will.
Curious about the cost side though. Running two models per commit adds up fast if you're shipping multiple times a day. Have you found the quality improvement offsets the token spend, or do you gate it to only run on certain file types?
We've been automating a lot of our content pipeline the same way — one AI generates, another critiques. Built postammo.com around that idea for social media content specifically. The adversarial review step made the output way less generic.
I'm on max plans for both models, and token usage looks pretty comfortable so I'd say the cost is manageable. As someone who doesn't come from a Node.js background, the dual-agent review loop is really my main way of catching deep issues that I wouldn't spot from reading the code alone.
Great question though — I'll add a token tracking module to the package so we can get accurate numbers instead of guessing. Thanks for raising it.
And applying the dual-agent pattern to content creation is a great idea. The same principle should work well there.
Good point! I also see that AI bots stick to their initial assumptions, thus doing effective work (sometimes) but still consistently moving in the wrong direction. The few cases where "vibe coding" worked for me as a senior were either
Claude successfully set up working MVP code for a classic WordPress plugin and a similar Chrome browser extension. Copilot did a code review and found two potential security issues in the extension code. The initial concept had been drafted by another AI agent based on a random idea that popped up in a casual conversation. In the end it turned out we'd better not waste any more effort in finishing and publishing the browser extension for the requirements were already based on flawed assumptions.
Thus, pair programming on a coding level isn't enough but maybe better than naive vibe coding without any external challenger at all.
This is a really important point and honestly something I’m still figuring out. Code-level review catches bugs, but it can’t fix flawed assumptions — and we’ve run into that too.
What’s helped so far is aligning on goals and approach with the author agent before the review loop starts. The loop handles code quality, but the direction-setting has to happen upfront. When I skip that step, I get exactly what you described — polished code built on flawed premises.
Your Chrome extension story resonates. A reviewer agent would’ve caught the security issues, but wouldn’t have questioned whether the extension should exist at all. That kind of judgment is still on us.
So yeah — pair programming at the code level is necessary but not sufficient. Still learning where the boundaries are.
This resonates a lot. I've been building side projects with Claude Code for months now and honestly the biggest lesson was exactly this - one agent reviewing its own output is like proofreading your own essay, you just gloss over things. The characteristic failure patterns you describe are spot on too, Claude getting sloppy on error handling in long contexts is something I hit constantly. I ended up building a security scanning step into my workflow for similar reasons - not a second agent exactly, but a dedicated pass that only looks for vulnerabilities and missed edge cases. Caught stuff I would have shipped otherwise. Curious though - do you find the reviewer agent sometimes introduces new issues? Like overcorrecting or suggesting refactors that break the original intent? That's been my experience when I tried having a second model do full rewrites instead of just flagging problems.
Yeah, we ran into this early on too. Our solution was to let the author agent challenge the reviewer’s feedback before changing any code. So instead of blindly applying every suggestion, they discuss it first — the reviewer flags an issue, the author can push back with reasoning, and they go back and forth until they reach consensus. Only then does the code get modified.
In practice, most of the time the author accepts the reviewer’s feedback and makes the fix. But sometimes it pushes back and holds its ground — and often it’s right to. That back-and-forth filters out the overcorrections before they ever touch the code.
We have a 5-round cap on any single disagreement — if they can’t reach consensus, it escalates to human judgment. Hasn’t happened yet though. Turns out when both agents have to justify their position, they converge pretty quickly.
Your security scanning step is smart though — a focused pass for a specific concern. General review + specialized scan is probably the strongest combination. Something I’d like to explore adding to the loop as well.
That author-challenges-reviewer step is a really elegant fix. Way cleaner than trying to tune the reviewer to be less aggressive upfront - you're basically adding a negotiation layer before any code changes happen, which is a smarter place to put the friction. The 5-round cap with human escalation is a nice touch too, love systems that have a clear fallback rather than looping forever. Definitely stealing that idea.
Really resonates with this. I've gone through a similar journey — started with AI writing code, then realized the real leverage is AI validating code.
We built a TDD pipeline where the AI writes tests first, implements, then a separate agent reviews. The meta-insight: the most impactful automation isn't generating code faster, it's catching bad code before it ships.
What's your approach to preventing AI-generated false confidence? (e.g., code that looks right but has subtle logic bugs)
Your TDD pipeline sounds really solid — tests first is the right instinct.
For false confidence, our main defense is the challenge mechanism. The reviewer doesn’t just say “looks good” — it has to give a concrete reason for passing. And the author can challenge the reviewer’s feedback too, so they actually debate before code gets modified. That back-and-forth surfaces the subtle stuff that a single pass misses.
But honestly, the biggest guard against false confidence is the human in the loop. Both agents can confidently agree on code that solves the wrong problem. I align on goals and approach with the author agent before the loop starts — that’s where the most dangerous false confidence gets caught. Code-level review handles bugs. Direction-setting handles “looks right but fundamentally wrong.”
Still an unsolved problem though. Would be curious how your TDD approach handles cases where the tests themselves encode the wrong assumptions.
Really resonates with this. I went down a similar rabbit hole — built 80+ automation scripts in a week, then realized the real win wasn't the scripts themselves, but having an AI agent that could compose them together.
The quality angle is interesting. What metrics do you use to measure if the automation actually improved code quality vs just speed?
No formal metrics yet. I recall seeing a multi-agent coding framework in GitHub that improved pass rates from around 80% to over 90% on standard benchmarks by adding specialized review and testing agents. But those are algorithmic benchmarks — not real-world development tasks. I’d like to build a test suite based on actual project work and measure the difference properly. Great question — thanks for pushing me to think about this.
Codex becomes very close to autonomous development