We let AI agents loose on a payment platform. They crushed the boring stuff. Then they silently broke the stuff that matters.
A survey came out last week. 54% of all code is now AI-generated. Up from 28% last year.
I read that number and thought: yeah, that tracks. We're probably in that range too.
But here's the thing nobody's asking — which 54%?
Not all code carries equal weight. A CRUD endpoint for fetching merchant details? Low risk. The webhook handler that transitions a payment from pending to complete? That's someone's rent. Someone's payroll. Get that wrong and money moves where it shouldn't, or worse, money doesn't move at all.
I'm the CTO of a payment platform. FCA-authorised, processing real money, real merchants, real consequences. We run NestJS microservices, Docker, Traefik — the usual stack. And we've been using AI agents aggressively for over a year now.
I'm not here to tell you AI is dangerous. It's not.
I'm here to tell you it's dangerous when you forget what it's actually good at.
The 80% Where AI Agents Are Genuinely Brilliant
Let me give credit where it's due. AI agents have made our team faster in ways that would have seemed absurd two years ago.
API scaffolding. Generating service boilerplate. Writing Zod validation schemas. Spinning up new endpoints. Creating test stubs. Refactoring imports. Migrating patterns across repos.
We run multiple microservices. When we need a new service, an agent can scaffold the entire thing — module structure, base configuration, Docker setup, Traefik labels — in minutes. What used to be a half-day of copy-paste-and-tweak is now a conversation.
When we overhauled our env management across all repos, AI agents did the grunt work. They mapped every .env file, found naming conflicts, identified common variables, and generated a unified Zod schema. What would have taken a team days of grep-and-spreadsheet work took hours.
For this 80% of the codebase — the predictable, pattern-following, structurally repetitive code — AI agents are the best junior developers money can buy. Tireless. Cheap. No ego. Almost never make a mistake on the stuff they're good at.
An army of juniors sitting at your terminal.
Then You Hit the Other 20%
Here's where it gets interesting.
We had an agent build out a webhook handler. Webhooks in payments are critical — they're how you know a payment succeeded, failed, or needs attention. The agent wrote the handler. It looked clean. Tests passed.
But it silently ignored the edge cases.
Status transitions have rules. A payment can go from pending to complete. It cannot go from complete back to pending. When a human developer builds this, they think about the illegal transitions because they've seen what happens when money moves backwards. They build the guard because they've felt the pain of not having it.
The agent didn't care about that. It built the happy path beautifully and treated the edge cases like they didn't exist.
When we do this work manually, this type of error never happens. A senior developer who has worked in payments for years doesn't forget the impossible transitions. It's not in their code — it's in their bones.
The Pattern I Keep Seeing
This isn't a one-off. After months of working with AI agents on a regulated payment stack, one pattern is consistent:
AI agents optimise for completion, not correctness.
They want to finish the feature. Get to the green checkmark. And to get there efficiently, they take shortcuts that look reasonable on the surface.
The agent builds what should happen. It rarely builds what should not happen. In payments, the negative cases are where all the real risk lives. What happens when a webhook arrives twice? What happens when a refund is requested on an already-refunded transaction? What happens when the bank returns an unexpected status code? The agent doesn't think about any of that unless you explicitly tell it to.
Then there's the reusability problem. We have shared utility packages. Helper functions. Common patterns that the team has standardised on over years. The agent doesn't care. It writes its own version from scratch. It works, but now you have two implementations of the same logic — one tested and trusted in production, one freshly generated and untested. The agent is focused on completing this feature, not maintaining the architecture.
And the subtlest one — agents seem to optimise for fewer back-and-forth turns. It looks like they're saving cost, saving context. Complex validation? Skip it, the basic case works. Error handling for a rare edge case? Not worth the tokens. The result is code that passes every test you wrote but fails on the scenarios you didn't think to test — because those are exactly the scenarios the agent also didn't think about.
Juniors Don't Ship Products. They Write Code.
Here's the frame that made this click for me.
Claude — or any coding agent — is the best junior developer money can buy. An army of juniors. Tireless, cheap, no ego, near-zero error rate on routine work.
But juniors don't ship products. They write code.
The difference between code and a product is judgment. Knowing which transitions are illegal. Knowing that the retry logic has a specific backoff curve because you've been burned by what happens when it doesn't. Knowing that the webhook handler needs idempotency because banks sometimes send the same notification three times.
That knowledge doesn't come from training data. It comes from years of operating a system, debugging at 2am, explaining to a merchant why their settlement was delayed.
The most dangerous mistake a CTO can make in 2026 is buying AI to replace senior engineers. The right move is buying AI to enable them.
Replace your senior with AI? You get speed plus silent disasters.
Enable your senior with AI? You get an architect with an army.
What We Actually Do About It
I'm not writing this to complain about AI. I'm writing this because we've built a system that works, and it might help you too.
The first thing we did was make our architecture machine-readable. We extract design patterns and architecture rules into formats that agents can consume. When an agent works on our codebase, it doesn't just see code — it sees boundaries, patterns, rules about what belongs where. Not documentation nobody reads. Lints and constraints that the agent can't ignore.
Then we invested heavily in testing the negative cases. Every PR — human or AI — runs through the same suite. But we specifically built tests for the stuff agents skip: illegal state transitions, duplicate webhook handling, idempotency checks. If the agent silently drops a negative case, the tests catch it before it ships.
And seniors still review everything that touches money. No AI-generated payment logic ships without a senior looking at it. Not because we don't trust AI — because we know exactly where it's blind. The review isn't checking syntax. It's checking judgment. Did the agent handle the ambiguous bank status? Did it respect our existing retry logic? Did it use the shared utility or reinvent the wheel?
This problem bothered me enough that I started building Bodhi Orchard — an open-source agentic development framework. The core idea: don't just let agents write code. Feed them the full context — architecture, design patterns, test plans, existing utilities — so they stop making the same blind-spot mistakes. Human decisions over human busywork, with guardrails that actually enforce quality.
The Real Question for 2026
The survey says 54% of code is AI-generated. I believe it.
But here's my question: what percentage of bugs in 2026 will be AI-generated?
And more importantly — who's going to find them?
Not the agents. They wrote the bugs in the first place. Not the juniors — they won't know enough to spot what's missing.
It's going to be the seniors. The architects. The people who've operated these systems long enough to know where the bodies are buried.
The 80% is solved. AI won. Celebrate that.
Now invest in the humans who understand the other 20%. Because that's where your product lives or dies.
I'm Arun, CTO & Co-Founder of Atoa — a UK open banking payment platform. I write about what it's actually like to build fintech with AI, not what the conference slides say it's like. If this resonated, follow me here or on X @mickyarun.
And if you're curious about building AI-native development with proper guardrails, check out Bodhi Orchard.
Top comments (57)
the right split isn't complexity - it's blast radius. AI fails on the paths where wrong code has externally visible consequences. your webhook handler nails it: same to write, completely different stakes if broken.
Blast radius is the better framing, you're right. We've actually started using exactly that language internally when routing work — not "is this complex?" but "what breaks if this is wrong?" A CRUD endpoint and a webhook handler are the same complexity to write. The difference is that one quietly corrupts payment state and the other doesn't. That asymmetry is what makes the 80/20 split so deceptive.
the CRUD-vs-webhook example is exactly it — same complexity, different blast radius. once you start routing by what breaks externally, you also notice that AI failures cluster on those external-consequence paths, not the complex internal ones. that asymmetry is worth building into your review criteria explicitly.
Yes — and the clustering is the useful bit: AI failures don't spread evenly, they pile up on the external-consequence paths, exactly where you can least afford them. That's the argument for routing review by blast radius instead of diff size. Anything that touches money gets a senior's eyes regardless of how "small" the change looks. Good addition.
the clustering pattern is what finally convinced me to retire the 'review every AI change' rule - if failures aren't random, blanket review is the wrong tool. route to where the risk actually pools.
A lot of you asked the same question in the comments: how do you actually measure that 20% when you're hiring?
I wrote the sequel. It covers how we flipped our interview, why we stopped asking candidates to write code from scratch, and a design thinking challenge I'd love your take on.
How We Hire for the 20% AI Can't Do (And Why We Stopped Asking Candidates to Code From Scratch)
That 20% is where the real engineering judgment sits. AI can generate a lot of code, but seniors are still needed for tradeoffs, architecture, edge cases, security, and knowing when the “working” solution will become a future problem.
Spot on. The part that catches most teams off guard is your last point — knowing when a working solution becomes a future problem. AI agents will happily generate a solution that passes every test today but creates a coupling that makes the next feature impossible. That's the judgment call that still needs a human with context.
Exactly. Technical debt rarely looks like debt when it's created. Most of the time it looks like a fast win, which is why experience matters. Someone has to think about the second and third order effects, not just whether the code works today.
"Which 54%?" is the question the headline number always hides. A CRUD endpoint and a payment-state webhook are not the same risk, but the stat treats them as one. The 20% that needs a senior is exactly the part where a confident wrong answer moves money the wrong way.
Exactly. The headline number is seductive but meaningless without weighting by consequence. We could probably get to 90% AI-generated if we counted by lines. But the 10% that handles payment state transitions, retry logic, and settlement timing is worth more than the other 90% combined. The stat treats a login form and a refund handler as equal. They're not.
Weighting by consequence is the only honest way to read that number. Lines of code makes a settlement webhook look the same as a tooltip, and that webhook is the part you can't hand off. I'd rather see it reported as percent of risk automated than percent of code.
"Percent of risk automated" instead of "percent of code" — I'm stealing that. A settlement webhook and a tooltip are one line each and worlds apart in blast radius, and every "54% of code is now AI" headline flattens exactly that distinction. The number that would actually mean something is how much of the risky surface you've automated and still sleep at night. Spot on.
Risk-weighted is the only honest read. A settlement webhook and a tooltip are one line each on the diff and worlds apart at 2am when one of them is down. The number I actually trust is how much of the scary surface you handed off and can still sleep through.
"Surface you handed off and can still sleep through" — that's the metric. We talk about it as blast radius, not line count: a diff that can't move money or leak data can ship on a junior's say-so; a diff that touches settlement gets a senior even if it's three characters. The honest org chart isn't seniority by years, it's who's allowed near the scary surface. The trap is teams that measure AI adoption by % of code merged and never look at which 20% it was.
Blast radius over line count is exactly right. We ended up baking it in: anything that can move money or touch user data goes to a stricter agent tier even when the diff is three lines. Percent-merged is a vanity number that hides which 20% actually shipped.
The line about illegal transitions sitting in the senior's bones is the one I keep coming back to. What's worked for us is treating those exact rules as the highest-value tests - the failing case that proves the impossible transition still throws, the contract test that catches the duplicate webhook. The senior still reviews, but the same blind spot doesn't slip past twice. The catch is that negative cases catch nothing day-to-day, so you only find out the agent skipped them when something goes wrong in prod, which on a payments stack is too late.
This is exactly the approach we've landed on too. We call them "scar tests" internally — every time a senior catches something an agent missed, that specific scenario becomes a permanent test. The agent still does the bulk work, but the test suite encodes the team's institutional memory. Over time, the blind spots shrink. Not because the agent gets smarter, but because the guardrails get sharper.
"Scar tests" - I might steal that :)
The human would still check and find issues, but the agent would catch the regression the next time around. Over time you'd end up with a test suite that's basically a record of every mistake the team has ever had to fix, which is one of the best things you can hand a new agent or a new joiner.
prickles.org/tenet/living-document...
“Scar tests” is a great phrase, but I wonder if the unit should be a little broader than tests.
Every scar probably needs to become part of the repo’s memory, but not every scar should become another test. Some mistakes should become tests, yes. Others are better captured as boundary rules, diagnostic checks, ownership constraints, repair patterns, or notes about what the agent must not normalize as baseline.
Otherwise the test suite itself can become a drift surface: every past mistake gets encoded as another assertion, the agent starts optimizing around the tests, and the repo slowly accumulates verification bloat.
The deeper idea, to me, is that scars should become governed signals. The repo should remember what hurt it before, but it should choose the right enforcement surface instead of turning every wound into another test.
Fair point. A test is the easiest thing to add so it ends up doing too much of the work. A lint rule for the kind of thing the agent keeps proposing does the same job without making the suite bigger. The bit where you catch it is the same either way, someone spots it and the team agrees it shouldn't happen again, but the fix doesn't have to be a test.
prickles.org/tenet/linter-as-law/TA1
The 80/20 split is real — and the hard part isn't the 20%, it's knowing which 20% you're in before you ship. We've started routing every AI-generated diff through a cheap local model review gate that flags "suspicious confidence" (clean code that subtly breaks edge cases). Caught 3 leaks and 2 race conditions last sprint alone. Do you run any automated review on the AI-generated parts or just eyeball them?
We do both. Automated: every PR runs through our standard test suite plus what we call "scar tests" — specific edge cases we've caught before. But we also have architecture lints that check whether the agent used existing shared utilities or reinvented them, and schema validation that catches impossible state transitions at compile time. Manual: any code that touches money movement gets a senior review, non-negotiable. The automated layer catches about 80% of agent mistakes. The senior review catches the 20% that requires judgment about intent, not just correctness.
scar tests + architecture lints is a solid combo — especially catching when the agent reinvents existing shared utilities instead of reusing them. We tried something similar internally and it worked well. And the non-negotiable senior review for money-touching code is something we've been sticking to as well.
The 20% is defined by consequence, not difficulty, which is exactly why it doesn't shrink as the models get better. You're FCA-authorised, so you live this: the risky code isn't the hard code, it's the code nobody can explain. AI output that works but that no one can defend to an auditor is still a liability, correct or not. So the senior's real job there isn't writing that 20%, it's being able to stand behind it when someone asks why it made the call it did.
This is the FCA angle that doesn't get enough airtime. "The risky code isn't the hard code, it's the code nobody can explain" — that's exactly it. We've had auditors ask why a specific retry backoff was chosen, and the answer can't be "the AI picked it." Someone has to own the reasoning. AI-generated code that works but has no defensible rationale is a compliance risk in regulated fintech, full stop. The senior's real value isn't writing that 20% — it's being the person who can explain it under questioning.
"The AI picked it" as the answer to an auditor. That image should scare every team shipping AI-generated code in regulated environments. The senior's value isn't the code. It's the defensible rationale attached to it.
You said it better than my whole article did — the senior's value is the defensible rationale, not the code. An auditor won't accept "the AI picked it," and neither should a CTO. The code is cheap now; the why behind it is the thing you're actually paying a senior for. Thanks for reading.
Your article framed the problem, I just sharpened one edge. Most people still think the gap is complexity. It's not. It's who signs off on this when a regulator asks why.
The 80%/20% split is the right framing, and the failure mode worth naming is that the 20% has fundamentally different shape from the 80%. Execution rewarded consistency, pattern recognition, and accumulated templates — exactly what AI is best at. The remaining work rewards judgment, taste, knowing when to stop, and recognizing when the agent's confident output is structurally wrong.
Most engineers never had to develop those skills explicitly because execution filled the day. AI didn't degrade them — it surfaced a latent skill gap that was always there, just hidden under volume. The seniors who still matter are the ones who built that judgment over years and can now apply it without being slowed by execution. The juniors who'll become seniors are the ones who realize the 20% is where the career compounds.
"AI didn't degrade them — it surfaced a latent skill gap that was always there, just hidden under volume." That's the most precise way I've seen this framed. We had seniors who were genuinely good at judgment but spent 70% of their time on execution. AI freed them to focus entirely on the 20% — and the quality of their architectural decisions went up because they weren't context-switching between boilerplate and boundary design. The career compounding point is real too. The juniors who lean into the hard 20% now will be rare and extremely valuable in 3 years.
It's a popular moot meme .... In answer to your Title I would say "This year yeh. But next year? I'm not so sure."
Instead what would be more accurate would be to say that "We will always need Experienced." That is just about the only future for everyone herein.
Architecting and Directing and Problem Solving.
Anything else is just circle jerk.
Fair pushback. I'd reframe it slightly: the title of "senior" might become less meaningful. What stays permanently valuable is the judgment that comes from operating a system under real constraints — regulatory, financial, human. Even if models get dramatically better at code generation, someone still needs to decide what to build, what not to build, and what the system should refuse to do. In payments, that's not a coding problem. It's a domain judgment problem. And I'd bet that's still human territory in 2030.
This really lands for me, especially the point that the “other 20%” isn’t just harder code — it’s judgment, memory, scars, and knowing which paths should never be allowed in the first place.
The slightly different angle I’ve been thinking about is that maybe the senior shouldn’t only be the final human checkpoint. Some of that senior judgment needs to become part of the repo’s operating environment.
Not in the sense of replacing the senior, but in the sense of making the repo’s rules, architecture, constraints, and hard-won assumptions continuously inspectable while the agent is working.
Because I agree with you: agents are great at producing the happy path. But the deeper issue is that they don’t always know when they’ve drifted away from the repo’s truth. They can pass tests, finish the task, and still quietly make the system noisier or less coherent.
So yes, we still need seniors. But I think the next layer is tooling that helps preserve senior judgment inside the repo itself — so the agent is not just generating code, but being supervised against the architecture and constraints the team already knows matter.
You've nailed what I think is the next evolution. We're actually building towards exactly this — making the repo itself aware of its own constraints so the agent can't silently drift.
Concretely, that means things like: MCP architecture feeds that tell the agent which service owns which domain, typed schemas that reject impossible state transitions at compile time, and automated linting for design patterns the team has agreed on.
The senior still decides what the rules are. But the repo enforces them continuously, not just at PR review time. That way when an agent finishes a task, it hasn't just passed tests — it's stayed coherent with the system's actual truth.
Thank you. Comments like this are actually one of the reasons I've become increasingly convinced there's a real category forming here.
I've been building a diagnostic suite around this general problem space, and one of the things that keeps surprising me is how often the same underlying issue shows up in completely different conversations. Memory, observability, verification, architecture, agent reliability — the terminology changes, but the pattern feels remarkably similar.
The more I work on it, the more I find myself thinking about these as different drift surfaces rather than completely separate problems.
The implementation details are obviously different, but the recurring question seems to be: how does a system preserve its own truth while work is being performed inside it?
That's why I like your phrase "stayed coherent with the system's actual truth." It feels like it gets at something deeper than whether the code passed tests or the task was completed. It gets closer to whether the system remained aligned with itself.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.