DEV Community

AI Agents Are Great at 80% of Our Code. The Other 20% Is Why We Still Need Seniors.

arun rajkumar on May 28, 2026

We let AI agents loose on a payment platform. They crushed the boring stuff. Then they silently broke the stuff that matters. A survey came out...

Read full post

Mykola Kondratiuk • May 29

the right split isn't complexity - it's blast radius. AI fails on the paths where wrong code has externally visible consequences. your webhook handler nails it: same to write, completely different stakes if broken.

arun rajkumar • Jun 3

Blast radius is the better framing, you're right. We've actually started using exactly that language internally when routing work — not "is this complex?" but "what breaks if this is wrong?" A CRUD endpoint and a webhook handler are the same complexity to write. The difference is that one quietly corrupts payment state and the other doesn't. That asymmetry is what makes the 80/20 split so deceptive.

Mykola Kondratiuk • Jun 3

the CRUD-vs-webhook example is exactly it — same complexity, different blast radius. once you start routing by what breaks externally, you also notice that AI failures cluster on those external-consequence paths, not the complex internal ones. that asymmetry is worth building into your review criteria explicitly.

arun rajkumar • Jun 6

Yes — and the clustering is the useful bit: AI failures don't spread evenly, they pile up on the external-consequence paths, exactly where you can least afford them. That's the argument for routing review by blast radius instead of diff size. Anything that touches money gets a senior's eyes regardless of how "small" the change looks. Good addition.

Mykola Kondratiuk • Jun 6

the clustering pattern is what finally convinced me to retire the 'review every AI change' rule - if failures aren't random, blanket review is the wrong tool. route to where the risk actually pools.

arun rajkumar • Jun 2

A lot of you asked the same question in the comments: how do you actually measure that 20% when you're hiring?

I wrote the sequel. It covers how we flipped our interview, why we stopped asking candidates to write code from scratch, and a design thinking challenge I'd love your take on.

arun rajkumar

Jun 2

How We Hire for the 20% AI Can't Do (And Why We Stopped Asking Candidates to Code From Scratch)

#career #ai #hiring #webdev

10 min read

Varsha Ojha • May 28

That 20% is where the real engineering judgment sits. AI can generate a lot of code, but seniors are still needed for tradeoffs, architecture, edge cases, security, and knowing when the “working” solution will become a future problem.

arun rajkumar • May 29

Spot on. The part that catches most teams off guard is your last point — knowing when a working solution becomes a future problem. AI agents will happily generate a solution that passes every test today but creates a coupling that makes the next feature impossible. That's the judgment call that still needs a human with context.

Varsha Ojha • May 29

Exactly. Technical debt rarely looks like debt when it's created. Most of the time it looks like a fast win, which is why experience matters. Someone has to think about the second and third order effects, not just whether the code works today.

Andrii Krugliak • May 29

"Which 54%?" is the question the headline number always hides. A CRUD endpoint and a payment-state webhook are not the same risk, but the stat treats them as one. The 20% that needs a senior is exactly the part where a confident wrong answer moves money the wrong way.

arun rajkumar • Jun 3

Exactly. The headline number is seductive but meaningless without weighting by consequence. We could probably get to 90% AI-generated if we counted by lines. But the 10% that handles payment state transitions, retry logic, and settlement timing is worth more than the other 90% combined. The stat treats a login form and a refund handler as equal. They're not.

Andrii Krugliak • Jun 4

Weighting by consequence is the only honest way to read that number. Lines of code makes a settlement webhook look the same as a tooltip, and that webhook is the part you can't hand off. I'd rather see it reported as percent of risk automated than percent of code.

arun rajkumar • Jun 6

"Percent of risk automated" instead of "percent of code" — I'm stealing that. A settlement webhook and a tooltip are one line each and worlds apart in blast radius, and every "54% of code is now AI" headline flattens exactly that distinction. The number that would actually mean something is how much of the risky surface you've automated and still sleep at night. Spot on.

Andrii Krugliak • Jun 9

Risk-weighted is the only honest read. A settlement webhook and a tooltip are one line each on the diff and worlds apart at 2am when one of them is down. The number I actually trust is how much of the scary surface you handed off and can still sleep through.

arun rajkumar • Jun 10

"Surface you handed off and can still sleep through" — that's the metric. We talk about it as blast radius, not line count: a diff that can't move money or leak data can ship on a junior's say-so; a diff that touches settlement gets a senior even if it's three characters. The honest org chart isn't seniority by years, it's who's allowed near the scary surface. The trap is teams that measure AI adoption by % of code merged and never look at which 20% it was.

Adam Lewis • May 28

The line about illegal transitions sitting in the senior's bones is the one I keep coming back to. What's worked for us is treating those exact rules as the highest-value tests - the failing case that proves the impossible transition still throws, the contract test that catches the duplicate webhook. The senior still reviews, but the same blind spot doesn't slip past twice. The catch is that negative cases catch nothing day-to-day, so you only find out the agent skipped them when something goes wrong in prod, which on a payments stack is too late.

arun rajkumar • May 29

This is exactly the approach we've landed on too. We call them "scar tests" internally — every time a senior catches something an agent missed, that specific scenario becomes a permanent test. The agent still does the bulk work, but the test suite encodes the team's institutional memory. Over time, the blind spots shrink. Not because the agent gets smarter, but because the guardrails get sharper.

Adam Lewis • May 29

"Scar tests" - I might steal that :)

The human would still check and find issues, but the agent would catch the regression the next time around. Over time you'd end up with a test suite that's basically a record of every mistake the team has ever had to fix, which is one of the best things you can hand a new agent or a new joiner.

prickles.org/tenet/living-document...

Scarab Systems • May 30

“Scar tests” is a great phrase, but I wonder if the unit should be a little broader than tests.

Every scar probably needs to become part of the repo’s memory, but not every scar should become another test. Some mistakes should become tests, yes. Others are better captured as boundary rules, diagnostic checks, ownership constraints, repair patterns, or notes about what the agent must not normalize as baseline.

Otherwise the test suite itself can become a drift surface: every past mistake gets encoded as another assertion, the agent starts optimizing around the tests, and the repo slowly accumulates verification bloat.

The deeper idea, to me, is that scars should become governed signals. The repo should remember what hurt it before, but it should choose the right enforcement surface instead of turning every wound into another test.

Adam Lewis • May 31

Fair point. A test is the easiest thing to add so it ends up doing too much of the work. A lint rule for the kind of thing the agent keeps proposing does the same job without making the suite bigger. The bit where you catch it is the same either way, someone spots it and the team agrees it shouldn't happen again, but the fix doesn't have to be a test.

prickles.org/tenet/linter-as-law/TA1

xulingfeng • May 30

The 80/20 split is real — and the hard part isn't the 20%, it's knowing which 20% you're in before you ship. We've started routing every AI-generated diff through a cheap local model review gate that flags "suspicious confidence" (clean code that subtly breaks edge cases). Caught 3 leaks and 2 race conditions last sprint alone. Do you run any automated review on the AI-generated parts or just eyeball them?

arun rajkumar • Jun 2

We do both. Automated: every PR runs through our standard test suite plus what we call "scar tests" — specific edge cases we've caught before. But we also have architecture lints that check whether the agent used existing shared utilities or reinvented them, and schema validation that catches impossible state transitions at compile time. Manual: any code that touches money movement gets a senior review, non-negotiable. The automated layer catches about 80% of agent mistakes. The senior review catches the 20% that requires judgment about intent, not just correctness.

xulingfeng • Jun 2

scar tests + architecture lints is a solid combo — especially catching when the agent reinvents existing shared utilities instead of reusing them. We tried something similar internally and it worked well. And the non-negotiable senior review for money-touching code is something we've been sticking to as well.

Valentin Monteiro • May 30

The 20% is defined by consequence, not difficulty, which is exactly why it doesn't shrink as the models get better. You're FCA-authorised, so you live this: the risky code isn't the hard code, it's the code nobody can explain. AI output that works but that no one can defend to an auditor is still a liability, correct or not. So the senior's real job there isn't writing that 20%, it's being able to stand behind it when someone asks why it made the call it did.

arun rajkumar • Jun 3

This is the FCA angle that doesn't get enough airtime. "The risky code isn't the hard code, it's the code nobody can explain" — that's exactly it. We've had auditors ask why a specific retry backoff was chosen, and the answer can't be "the AI picked it." Someone has to own the reasoning. AI-generated code that works but has no defensible rationale is a compliance risk in regulated fintech, full stop. The senior's real value isn't writing that 20% — it's being the person who can explain it under questioning.

Valentin Monteiro • Jun 4

"The AI picked it" as the answer to an auditor. That image should scare every team shipping AI-generated code in regulated environments. The senior's value isn't the code. It's the defensible rationale attached to it.

arun rajkumar • Jun 6

You said it better than my whole article did — the senior's value is the defensible rationale, not the code. An auditor won't accept "the AI picked it," and neither should a CTO. The code is cheap now; the why behind it is the thing you're actually paying a senior for. Thanks for reading.

Valentin Monteiro • Jun 6

Your article framed the problem, I just sharpened one edge. Most people still think the gap is complexity. It's not. It's who signs off on this when a regulator asks why.

Theo Valmis • Jun 1

The 80%/20% split is the right framing, and the failure mode worth naming is that the 20% has fundamentally different shape from the 80%. Execution rewarded consistency, pattern recognition, and accumulated templates — exactly what AI is best at. The remaining work rewards judgment, taste, knowing when to stop, and recognizing when the agent's confident output is structurally wrong.

Most engineers never had to develop those skills explicitly because execution filled the day. AI didn't degrade them — it surfaced a latent skill gap that was always there, just hidden under volume. The seniors who still matter are the ones who built that judgment over years and can now apply it without being slowed by execution. The juniors who'll become seniors are the ones who realize the 20% is where the career compounds.

arun rajkumar • Jun 3

"AI didn't degrade them — it surfaced a latent skill gap that was always there, just hidden under volume." That's the most precise way I've seen this framed. We had seniors who were genuinely good at judgment but spent 70% of their time on execution. AI freed them to focus entirely on the 20% — and the quality of their architectural decisions went up because they weren't context-switching between boilerplate and boundary design. The career compounding point is real too. The juniors who lean into the hard 20% now will be rare and extremely valuable in 3 years.

BlackwellJohnL • Jun 2

It's a popular moot meme .... In answer to your Title I would say "This year yeh. But next year? I'm not so sure."
Instead what would be more accurate would be to say that "We will always need Experienced." That is just about the only future for everyone herein.
Architecting and Directing and Problem Solving.
Anything else is just circle jerk.

arun rajkumar • Jun 2

Fair pushback. I'd reframe it slightly: the title of "senior" might become less meaningful. What stays permanently valuable is the judgment that comes from operating a system under real constraints — regulatory, financial, human. Even if models get dramatically better at code generation, someone still needs to decide what to build, what not to build, and what the system should refuse to do. In payments, that's not a coding problem. It's a domain judgment problem. And I'd bet that's still human territory in 2030.

Scarab Systems • May 28

This really lands for me, especially the point that the “other 20%” isn’t just harder code — it’s judgment, memory, scars, and knowing which paths should never be allowed in the first place.

The slightly different angle I’ve been thinking about is that maybe the senior shouldn’t only be the final human checkpoint. Some of that senior judgment needs to become part of the repo’s operating environment.

Not in the sense of replacing the senior, but in the sense of making the repo’s rules, architecture, constraints, and hard-won assumptions continuously inspectable while the agent is working.

Because I agree with you: agents are great at producing the happy path. But the deeper issue is that they don’t always know when they’ve drifted away from the repo’s truth. They can pass tests, finish the task, and still quietly make the system noisier or less coherent.

So yes, we still need seniors. But I think the next layer is tooling that helps preserve senior judgment inside the repo itself — so the agent is not just generating code, but being supervised against the architecture and constraints the team already knows matter.

arun rajkumar • May 29

You've nailed what I think is the next evolution. We're actually building towards exactly this — making the repo itself aware of its own constraints so the agent can't silently drift.

Concretely, that means things like: MCP architecture feeds that tell the agent which service owns which domain, typed schemas that reject impossible state transitions at compile time, and automated linting for design patterns the team has agreed on.

The senior still decides what the rules are. But the repo enforces them continuously, not just at PR review time. That way when an agent finishes a task, it hasn't just passed tests — it's stayed coherent with the system's actual truth.

Scarab Systems • May 29

Thank you. Comments like this are actually one of the reasons I've become increasingly convinced there's a real category forming here.

I've been building a diagnostic suite around this general problem space, and one of the things that keeps surprising me is how often the same underlying issue shows up in completely different conversations. Memory, observability, verification, architecture, agent reliability — the terminology changes, but the pattern feels remarkably similar.

The more I work on it, the more I find myself thinking about these as different drift surfaces rather than completely separate problems.

The implementation details are obviously different, but the recurring question seems to be: how does a system preserve its own truth while work is being performed inside it?

That's why I like your phrase "stayed coherent with the system's actual truth." It feels like it gets at something deeper than whether the code passed tests or the task was completed. It gets closer to whether the system remained aligned with itself.

Quentin Merle • May 29

Your article hits exactly where it hurts. This perfectly aligns with my own experience building agent architectures: AI is fundamentally 'lazy' by design.

A model always looks for the path of least resistance toward completion. It doesn't aim for architectural robustness, just the immediate happy path. And there's another very subtle trap: the natural 'agreeableness' of Cloud models. AI is trained to be helpful and to agree with us. If a developer isn't extremely vigilant, this docility completely biases their own engineering judgment and lulls them into a false sense of security regarding edge cases.

Which brings me to your crucial point about seniority. If AI is now replacing the tasks of Juniors (the repetitive 80%), how will those same Juniors ever acquire those 'painful' years of experience (debugging prod at 2 AM) that forge the judgment and vigilance a Senior needs to handle the critical 20%?

This is exactly why it's vital for the next generation to keep learning how to code 'the hard way' to build that critical thinking. AI is a tool to be piloted, not magic. Brilliant analysis!

arun rajkumar • Jun 2

You've identified maybe the most important second-order problem in AI-assisted engineering. If juniors never debug at 2am, how do they build the judgment that makes seniors valuable? Our approach: we still put juniors on the hard problems, but paired with a senior who's watching the process, not the output. The junior uses AI to scaffold, but the senior asks "what did you skip?" and "what happens when this fails?" The painful lessons still happen, but they happen faster and with guardrails. The worst outcome would be a generation of engineers who can prompt but can't diagnose.

Theo Valmis • May 29

The 80/20 split is real but the framing makes it sound like the 20% is just 'harder code.' In practice the 20% is where architectural intent, security boundaries, and cross-system invariants live, things the agent can't infer from the file it's editing. Seniors aren't writing harder code, they're enforcing constraints the codebase never made explicit.

arun rajkumar • Jun 3

"Enforcing constraints the codebase never made explicit" — that's the line. This is exactly why we started extracting architecture rules into machine-readable formats. The constraints existed as tribal knowledge in senior heads. Now they're lints, typed schemas, and boundary definitions the agent can actually see. The 20% doesn't get easier. But it becomes explicit instead of implicit, which means at least the agent knows when it's crossing a line instead of blundering through it.

Daniel Stolf • May 29

The "scar tests" frame from your reply to @adam_lewis_427616cbc93f0b is the right destination. The underrated part is when in the lifecycle you get there.

Most teams pick up the negative cases reactively: the agent ships the happy path, the senior reviews, finds the missing impossible-transition guard, the test gets added after the fix. That works, but it puts the senior in the role of "the thing that catches what the agent skipped". That doesn't scale, and burns the most expensive person on the team on deterministic checks.

The shift that's worked for us: list the negative cases before any code exists. A spec for "webhook handler" doesn't reach implementation until someone has answered, in writing, what transitions are illegal, what's the behaviour on duplicate delivery, what happens when the bank returns an unknown status.

Each answer becomes a failing test before the agent is prompted. Then the agent has to satisfy them, and the senior reviews the spec (ten minutes, scan-level) instead of hunting omissions in the diff.

The 20% doesn't disappear. It just stops being something a senior discovers after the fact and becomes something the team commits to before the keystrokes happen. Same judgment, applied earlier, where it's cheaper to enforce and harder for the agent to route around.

Scar tests still matter, they're the upgrade path. The first time a negative case bites in prod, it goes into the spec template for that class of feature, and the next webhook handler is born with the guard already required. The institutional memory compounds at the spec layer, not just the test suite.

arun rajkumar • Jun 2

This is one of the sharpest comments on this thread. The insight about when in the lifecycle you capture the negative cases changes everything. We've been moving towards exactly this — writing the impossible transitions, the idempotency requirements, and the failure modes into a spec before the agent gets prompted. The spec becomes the acceptance criteria. The agent has to satisfy it. The senior reviews ten lines of spec instead of hunting through hundreds of lines of diff. You're right that this doesn't make the 20% disappear — it makes it cheaper to enforce.

Adam Lewis • May 31

Daniel, the ordering is right. I really like the idea of a spec template per type of feature. Writing the acceptance criteria up front means the agent has something to check itself against, and a senior can review the spec instead of looking for what's missing in the diff. The other thing is that the spec ends up being what the agent has to satisfy. If it's in the repo the agent reads it each time, and the same thing stops getting missed.

prickles.org/tenet/spec-first-exec...

Harjot Singh • May 30

The 80/20 split is the most useful frame for this whole debate. Agents crush the well-trodden 80% (CRUD, boilerplate, glue, standard patterns) because that's where training data is dense, and they faceplant on the 20% that needs system-level judgment, novel tradeoffs, and knowing what NOT to build.

The practical consequence people miss: you should route by that split. The 80% genuinely doesn't need your most expensive model or your most senior human - cheap model, light review. The 20% is where you spend both the premium model AND the senior's attention. Treating all code as equally hard is what makes AI coding feel either too expensive or too risky depending on which half you're looking at. Really well-argued piece - the "why we still need seniors" conclusion is the honest one.

arun rajkumar • Jun 2

This is exactly our approach. We use Sonnet for the routine 80% — API scaffolding, test stubs, boilerplate — and escalate to Opus with full architectural context for anything that touches payment logic. The cost difference is significant but the reliability difference is bigger. The routing isn't just about model choice though — it's about context. The 20% needs structured context files that describe service boundaries, shared schemas, and constraint rules. Without that, even the best model drifts.

aim dcap • May 30

The scary part is that the missing 20% is rarely about syntax.

It's usually about security, architecture, trust, governance, and long-term maintainability.

I've seen AI-generated code pass tests while still containing eval(), shell=True, and SQL injection patterns.

The code worked.

The risk was hidden.

That's why senior engineers are still essential.

arun rajkumar • Jun 3

"The code worked. The risk was hidden." That's the scariest sentence in AI-assisted development. In payments, we use Zod schemas and typed state machines specifically because they reject impossible transitions at compile time — before the code even runs. Tests tell you if something broke. Types and schemas tell you if something should never have been allowed in the first place. The security surface you're describing is real and underappreciated. Green tests are not the same as safe code.

Dk Bk • May 31 • Edited

i use ai for coding dont understand a shit. but i then ask other ai platform to explain to me as if i am 12 years old, and i then understand comprehensively and edit and make it wonderful, and then i let some other ai to refine and get confirmation form other if that's ok...…here what i realise at 43..you have to understand what you building specialy and apply real life experience to it. i built that way codes which people download from GitHub. you need experience and understanding of the subject and question why . well that s how i am doing it..in 3 weeks i have learnt so much that i question all these ai platforms codes, though i dont really know how to read them, but i take it as a story of a problem that needs severe thinking and solving. it has to have spiritual side e to it. like music art and culture. it just like reading a novel .anyway coding jobs will go away within 5 years . what would remain experience that shapes codes.

Eugene Maiorov • May 29

I totally agree with this. That last 20% is always where the AI messes up, because it doesn't actually understand how your whole system works together. It just guesses.

As a senior dev, I realized my job now is setting strict boundaries. Instead of letting the AI write that tricky 20% from scratch, I write the core logic myself and turn it into a strict tool for the AI to use.

Managing the servers for all those little tools got annoying fast, though. I started routing them through Vectoralix so I could just spin up secure tool endpoints without doing all the boring setup work. When you force the AI to use your exact tools instead of letting it guess the logic, that missing 20% becomes a lot easier to manage.

Dhruv Patil • May 29

This 80% vs 20% framing is what I’ve been trying to articulate.

AI can generate the routine parts insanely fast, but the dangerous part is that the remaining 20% often still looks “done” from the outside. Clean code, green tests, confident explanation — but missing the domain judgment around illegal transitions, idempotency, edge cases, and blast radius.

I’ve been thinking about this from the hiring/student side: if engineers are becoming AI orchestrators, how do we prove someone has that judgment before they join a team?

GitHub shows output. LeetCode shows DSA. Kaggle shows ML. But what shows that someone can guide AI, catch hallucinations, and review generated code properly?

I wrote a related post from that angle here: dev.to/dhurv_in_space/are-we-actually-learning-to-code-with-ai-or-just-generating-more-code-faster-ce0

Would be curious how you’d measure that 20% in hiring. Scar tests? walkthroughs? Prompt logs? something else?

arun rajkumar • Jun 2

Great question. We've actually changed our hiring process because of this exact problem. We stopped asking candidates to write code from scratch in interviews. Instead, we give them AI-generated code with deliberate blind spots — a webhook handler missing idempotency, a payment flow with an illegal state transition — and ask them to review it. The ones who spot what's missing are the ones who have the judgment. We also do walkthroughs where the candidate explains why they'd reject an AI-generated PR, not just what they'd change. The "why" reveals whether they understand the downstream consequences or are just pattern-matching.

João Gabriel Sabedra Vieira • May 28

Interesting perspective. As someone learning to code right now, this actually motivates me to focus on the fundamentals instead of relying only on AI. Out of curiosity, what kind of problems usually fall into that 20% that still needs a senior?

arun rajkumar • May 29

Great question. The 20% that still needs a senior usually falls into a few buckets:

State transitions that should be illegal — like a payment going from "refunded" back to "pending". AI will happily write the code because it looks syntactically fine.
Failure handling with real consequences — what happens when a bank API times out mid-transaction? The retry logic, idempotency, and partial rollback paths need someone who's been woken up at 3am by that exact scenario.
Architecture decisions that compound — choosing the wrong data model or service boundary today means pain for the next 2 years.
Security and compliance — AI doesn't understand regulatory context. It'll store things it shouldn't or expose endpoints it shouldn't.

The good news for you: learning these fundamentals now is the best investment. AI makes the easy stuff free. The hard stuff becomes even more valuable.

Aditya Mitra • Jun 3 • Edited

This is it. I still treat Opus 4.8 as junior engineer.
AI is sub optimal.
We need to use your 🧠

arun rajkumar • Jun 6

Junior engineer is the right mental model — brilliant on the routine, needs a senior on anything with consequences. I'd push back on "sub-optimal" though: it's optimal at exactly the 80% you point it at. The mistake is handing it the 20% it can't own. Aim it well and it's the best junior you'll ever hire.