DEV Community

Cover image for After 2 years of AI-assisted coding, I automated the one thing that actually improved quality: AI Pair Programming
Sakiharu
Sakiharu

Posted on

After 2 years of AI-assisted coding, I automated the one thing that actually improved quality: AI Pair Programming

After nearly 2 years of AI-assisted development — from ChatGPT 3.5 to Claude Code — I kept hitting the same problem: every model makes mistakes it can't catch. Inspired by pair programming and the Ralph Loop, I built a dual-agent workflow where one agent writes and another reviews. Last week, a PR written entirely by the two agents got merged into a 15k-star open source Electron project after 3 rounds of maintainer feedback. I don't write TypeScript.

The problems I kept finding

I've been doing AI-assisted programming for almost 2 years now. Started with ChatGPT 3.5 generating snippets, moved through Claude, Cursor, TRAE, and eventually fell in love with Claude Code.
From the very beginning, I noticed every model and every agent has its own characteristic problems. Not random bugs — consistent patterns of failure:

Claude Code skips error handling when context gets long. It's brilliant at architecture but gets sloppy on defensive code in later turns.
Codex over-engineers abstractions but catches edge cases Claude misses.
Gemini struggles with complex multi-file changes.
Cursor has context dependency issues — works great in small scope, gets confused across files.

The severity varies, but the pattern is the same: a single agent can't reliably catch its own mistakes. It writes code AND judges whether that code is good — like grading your own exam.
Every developer knows this problem has a name. It's called "why we do code review."

Pair programming, but with AIs

Pair programming was formalized by Kent Beck as part of Extreme Programming (XP) in the late 1990s — one of the most influential practices to come out of the agile movement. The core idea is simple: two developers at one workstation, one drives, one navigates. The navigator catches mistakes in real time, questions design decisions, and keeps the big picture in focus. Research has consistently shown it produces fewer defects and better designs, despite appearing to "waste" half your developers.
The same principle applies to AI agents. If one agent writes and another watches, you catch more bugs.
So that's what I started doing — manually. Way back when I was using Claude (the chat version, before Claude Code), I would take Claude's output, paste it into ChatGPT, ask ChatGPT to review it, then bring the feedback back. Primitive, but it worked better than trusting either one alone.
When Claude Code and Codex CLI came along, the workflow got more serious. Claude Code writes code, I copy the diffs to Codex, Codex reviews and flags issues, I bring the feedback back to Claude Code. Rinse and repeat.
This manual cross-agent coordination worked. But it was slow, repetitive, and cognitively draining. The worst part: it was easy to skip when tired. You tell yourself "this change looks fine, I'll skip the review step" — and that's always the change that bites you.

Automating the loop

Then I discovered the Ralph Loop (by Geoffrey Huntley) — the concept of wrapping a coding agent in an external loop so it keeps iterating. Powerful idea, and it gave me the push to automate my dual-agent workflow.
But the Ralph Loop team has been transparent about some limitations. It works great for greenfield projects with clear completion criteria. It's harder with legacy codebases, complex refactoring, or multi-step tasks where you need checkpoints along the way.
That matched my experience. I wasn't building new projects from scratch — I was forking and deeply modifying an existing large Electron app. I needed something that could handle ambiguity, maintainer feedback, and incremental consensus.
So I built a structured loop: one agent (Claude Code) writes, another (Codex) reviews, they take turns, and neither moves forward until both agree. I sit in the middle as tech lead — setting scope, making architecture calls, breaking ties.
The efficiency jumped immediately. Not because the agents got smarter, but because the review discipline became automatic instead of depending on my willpower at 2am.

The real test: my first open source PR

I'd been using this workflow to fork AionUI (~15k ⭐ Electron + React app) into an internal AI assistant for my company. 30 commits, zero manual code. Full rebrand, core engine rewrite, database migration, CI/CD rebuild — the whole thing done through the dual-agent loop.
During that work, the agents found a real upstream bug: orphan CLI processes that linger when you kill a conversation using ACP agents. I submitted a PR back to AionUI.
The maintainer reviewed it and came back with 3 issues:

Double super.kill() race condition — needed an idempotent guard
Swallowed errors — .catch(() => {}) should log warnings
treeKill discrepancy — the PR description didn't match upstream's actual implementation

I pointed the two agents at the maintainer's feedback and let them work. The author agent analyzed the issues, wrote the fixes, ran tests (133/133 passing). The reviewer agent reviewed the diffs, verified correctness, confirmed types were clean. A few rounds of back-and-forth. I watched but didn't write code.
Merged. "LGTM — all three review feedback items properly addressed."
This was my first ever PR submitted and merged into someone else's project. I'm a 30-year software veteran — but I spent the last 25 years on product and business, not writing code. I don't write TypeScript. AI tools pulled me back into development, and the dual-agent loop made it possible for me to contribute real fixes to a real project.

Independent convergence

After I posted about this, another developer (Hwee-Boon Yar, indie dev, also 30 years experience) shared a similar approach — a skill that shells out to a second agent for review, loops until the reviewer has nothing left to flag. Lighter than mine, works within a single session. Different trade-off, same core insight.
Multiple people are independently arriving at this: one agent is not enough. You need a second pair of eyes.

Limitations

This is not a magic solution. Here's what doesn't work:
Agent crashes have no auto-recovery. When an agent dies mid-session, the loop stops. You restart manually. No self-healing yet.
Wasted rounds. Sometimes the agents ping-pong — a fix introduces a new issue, review catches it, the next fix introduces another issue. You have to step in and reset scope.
Context window — but with a twist. Quality degrades in long sessions, and when an agent compresses its context, information gets lost. But here's where the dual-agent setup actually helps: when one agent's context is compressed and loses details, the other agent still remembers. They don't share the same context window, so they don't lose the same information at the same time. This is an unexpected architectural advantage. I'm thinking about building shared memory management across agents in future versions — so they can explicitly share what each has forgotten.
Two AIs can happily agree on a bad design. Without domain judgment from a human, this is just two agents rubber-stamping each other. The human arbiter is not optional.
This is not autonomous development. It is structured AI-assisted development. The distinction matters.

The deeper question

The AI coding conversation is too focused on generation and not enough on review. Everyone's benchmarking how fast and how much code models can produce. Nobody's asking: who checks it?
If AI code needs structured critique — the same way human code has always needed code review — then the question is: how do you build review discipline into AI workflows?

Just shipped v0.3.0

I've incorporated what I learned from the AionUI PR process and released a new version. Key stuff:

npm i -g ralph-lisa-loop
Works with Claude Code (Ralph) + Codex CLI (Lisa)
Turn control, tag system, consensus protocol, policy checks
Auto mode via tmux (experimental)
Agent-agnostic in principle — any two CLI agents can fill the roles

Early stage. Using daily for real work, not demos.
Repo:
If you've been doing AI coding and hitting that frustrating "almost right, but not quite" problem — you're not alone. This might help, or at least give you ideas for your own approach.
Happy to discuss. The failure modes are more interesting than the successes.

Top comments (14)

Collapse
 
choutos profile image
choutos

The observation that each model has consistent failure patterns rather than random bugs is underappreciated. Claude getting sloppy on error handling in long contexts, Codex over-engineering abstractions: these are predictable weaknesses, which means a second agent can be specifically tuned to watch for them.

We run a multi-agent setup ourselves and the "grading your own exam" problem is real. The manual copy-paste phase you describe is painfully familiar. Automating the review loop is the obvious next step but the hard part is knowing when to stop iterating. Two agents can get into an infinite refinement cycle if you're not careful. Did you find a good heuristic for convergence?

Collapse
 
yw1975 profile image
Sakiharu

Great question, our current approach has a few layers:

Tag-based convergence. Each round is tagged — [CODE], [PASS], [NEEDS_WORK], [CONSENSUS]. The loop stops when the reviewer issues [PASS] with an explicit reason. No rubber-stamp passes allowed — the policy requires at least one concrete justification.
Round cap. After 5 rounds without consensus, the system flags it for human intervention. You can [OVERRIDE] or [HANDOFF]. In practice most tasks converge in 2-3 rounds.
The real heuristic is upfront alignment. Honestly, with Claude Code + Codex the ping-pong refinement loop is rare. The more common failure mode is direction being wrong from the start. So for complex tasks, I align on goals and approach with the author agent first — before the loop even begins. If the direction is right, convergence comes naturally. If it's wrong, no amount of review rounds will fix it.

Curious about your setup — are you using different models for each role, or the same model in different contexts?

Collapse
 
chovy profile image
chovy

The AI-reviews-AI loop is underrated. Biggest issue I've hit with single-model generation is that the model gets anchored on its own assumptions — a second model asking "why did you do it this way?" catches stuff that linting never will.

Curious about the cost side though. Running two models per commit adds up fast if you're shipping multiple times a day. Have you found the quality improvement offsets the token spend, or do you gate it to only run on certain file types?

We've been automating a lot of our content pipeline the same way — one AI generates, another critiques. Built postammo.com around that idea for social media content specifically. The adversarial review step made the output way less generic.

Collapse
 
yw1975 profile image
Sakiharu

I'm on max plans for both models, and token usage looks pretty comfortable so I'd say the cost is manageable. As someone who doesn't come from a Node.js background, the dual-agent review loop is really my main way of catching deep issues that I wouldn't spot from reading the code alone.
Great question though — I'll add a token tracking module to the package so we can get accurate numbers instead of guessing. Thanks for raising it.
And applying the dual-agent pattern to content creation is a great idea. The same principle should work well there.

Collapse
 
ingosteinke profile image
Ingo Steinke, web developer

Good point! I also see that AI bots stick to their initial assumptions, thus doing effective work (sometimes) but still consistently moving in the wrong direction. The few cases where "vibe coding" worked for me as a senior were either

  • boilerplate code commonly documented in numerous tutorials
  • greenfield code in strongly typed languages
  • things that I should have known...
  • ...but search engines failed to reveal for some reason.

Claude successfully set up working MVP code for a classic WordPress plugin and a similar Chrome browser extension. Copilot did a code review and found two potential security issues in the extension code. The initial concept had been drafted by another AI agent based on a random idea that popped up in a casual conversation. In the end it turned out we'd better not waste any more effort in finishing and publishing the browser extension for the requirements were already based on flawed assumptions.

Thus, pair programming on a coding level isn't enough but maybe better than naive vibe coding without any external challenger at all.

Collapse
 
yw1975 profile image
Sakiharu

This is a really important point and honestly something I’m still figuring out. Code-level review catches bugs, but it can’t fix flawed assumptions — and we’ve run into that too.
What’s helped so far is aligning on goals and approach with the author agent before the review loop starts. The loop handles code quality, but the direction-setting has to happen upfront. When I skip that step, I get exactly what you described — polished code built on flawed premises.
Your Chrome extension story resonates. A reviewer agent would’ve caught the security issues, but wouldn’t have questioned whether the extension should exist at all. That kind of judgment is still on us.
So yeah — pair programming at the code level is necessary but not sufficient. Still learning where the boundaries are.​​​​​​​​​​​​​​​​

Collapse
 
itskondrat profile image
Mykola Kondratiuk

This resonates a lot. I've been building side projects with Claude Code for months now and honestly the biggest lesson was exactly this - one agent reviewing its own output is like proofreading your own essay, you just gloss over things. The characteristic failure patterns you describe are spot on too, Claude getting sloppy on error handling in long contexts is something I hit constantly. I ended up building a security scanning step into my workflow for similar reasons - not a second agent exactly, but a dedicated pass that only looks for vulnerabilities and missed edge cases. Caught stuff I would have shipped otherwise. Curious though - do you find the reviewer agent sometimes introduces new issues? Like overcorrecting or suggesting refactors that break the original intent? That's been my experience when I tried having a second model do full rewrites instead of just flagging problems.

Collapse
 
yw1975 profile image
Sakiharu

Yeah, we ran into this early on too. Our solution was to let the author agent challenge the reviewer’s feedback before changing any code. So instead of blindly applying every suggestion, they discuss it first — the reviewer flags an issue, the author can push back with reasoning, and they go back and forth until they reach consensus. Only then does the code get modified.
In practice, most of the time the author accepts the reviewer’s feedback and makes the fix. But sometimes it pushes back and holds its ground — and often it’s right to. That back-and-forth filters out the overcorrections before they ever touch the code.
We have a 5-round cap on any single disagreement — if they can’t reach consensus, it escalates to human judgment. Hasn’t happened yet though. Turns out when both agents have to justify their position, they converge pretty quickly.
Your security scanning step is smart though — a focused pass for a specific concern. General review + specialized scan is probably the strongest combination. Something I’d like to explore adding to the loop as well.​​​​​​​​​​​​​​​​

Collapse
 
itskondrat profile image
Mykola Kondratiuk

That author-challenges-reviewer step is a really elegant fix. Way cleaner than trying to tune the reviewer to be less aggressive upfront - you're basically adding a negotiation layer before any code changes happen, which is a smarter place to put the friction. The 5-round cap with human escalation is a nice touch too, love systems that have a clear fallback rather than looping forever. Definitely stealing that idea.

Collapse
 
maxxmini profile image
MaxxMini

Really resonates with this. I've gone through a similar journey — started with AI writing code, then realized the real leverage is AI validating code.

We built a TDD pipeline where the AI writes tests first, implements, then a separate agent reviews. The meta-insight: the most impactful automation isn't generating code faster, it's catching bad code before it ships.

What's your approach to preventing AI-generated false confidence? (e.g., code that looks right but has subtle logic bugs)

Collapse
 
yw1975 profile image
Sakiharu

Your TDD pipeline sounds really solid — tests first is the right instinct.
For false confidence, our main defense is the challenge mechanism. The reviewer doesn’t just say “looks good” — it has to give a concrete reason for passing. And the author can challenge the reviewer’s feedback too, so they actually debate before code gets modified. That back-and-forth surfaces the subtle stuff that a single pass misses.
But honestly, the biggest guard against false confidence is the human in the loop. Both agents can confidently agree on code that solves the wrong problem. I align on goals and approach with the author agent before the loop starts — that’s where the most dangerous false confidence gets caught. Code-level review handles bugs. Direction-setting handles “looks right but fundamentally wrong.”
Still an unsolved problem though. Would be curious how your TDD approach handles cases where the tests themselves encode the wrong assumptions.​​​​​​​​​​​​​​​​

Collapse
 
maxxmini profile image
MaxxMini

Really resonates with this. I went down a similar rabbit hole — built 80+ automation scripts in a week, then realized the real win wasn't the scripts themselves, but having an AI agent that could compose them together.

The quality angle is interesting. What metrics do you use to measure if the automation actually improved code quality vs just speed?

Collapse
 
yw1975 profile image
Sakiharu

No formal metrics yet. I recall seeing a multi-agent coding framework in GitHub that improved pass rates from around 80% to over 90% on standard benchmarks by adding specialized review and testing agents. But those are algorithmic benchmarks — not real-world development tasks. I’d like to build a test suite based on actual project work and measure the difference properly. Great question — thanks for pushing me to think about this.​​​​​​​​​​​​​​​​

Collapse
 
alifar profile image
Ali Farhat

Codex becomes very close to autonomous development