Everyone keeps saying AI will replace developers.
Meanwhile I was sitting at 3:17 AM staring at a bug created by code that looked perfectly correct.
Classic.
The weird thing about AI coding tools is this:
They are insanely good at getting you from zero to eighty percent.
The remaining twenty percent?
That part turns into a psychological thriller.
The Illusion of Fast Development
You ask the AI for a feature.
It generates:
- components
- hooks
- API calls
- utility functions
- TypeScript types
- enough confidence to destroy your weekend
And for a moment, you feel unstoppable.
Then reality walks in wearing steel toe boots.
Because the code works.
Until it doesn't.
And when it breaks, you're debugging logic written by someone who technically does not exist.
Which is honestly rude.
The Real Problem Isn't The AI
The issue is not bad syntax.
The issue is missing context.
AI understands patterns.
Senior developers understand consequences.
That difference matters a lot.
Especially when:
- state management gets messy
- async logic starts fighting itself
- edge cases crawl out of the sewer
- one innocent refactor nukes three unrelated features
The AI generated code looked clean.
Too clean.
Like a serial killer apartment.
The Most Dangerous Thing AI Produces
Confident nonsense.
Not broken nonsense.
Not obvious nonsense.
Confident nonsense.
The kind where you read it and think:
"Damn this looks smart."
Then four hours later you discover the function has been emotionally gaslighting your database the entire time.
What I Learned After Fighting AI Generated Code
1. Fast code is not maintainable code
AI optimizes for completion.
You still have to optimize for:
- readability
- architecture
- scalability
- future you not having a breakdown
Because future you will absolutely file complaints against present you.
2. Tests matter more now
If AI writes code faster, you need validation faster.
Otherwise you're basically accepting pull requests from a caffeinated intern that never sleeps.
Which sounds impressive until production catches fire.
3. Senior thinking matters more than ever
AI can generate solutions.
It cannot reliably judge tradeoffs.
It does not understand:
- business risk
- performance bottlenecks
- system boundaries
- long term technical debt
It just vibes aggressively.
The Funny Part?
I still use AI every single day.
Because despite all the chaos, it is useful.
It removes boring work.
It speeds up experimentation.
It helps unblock momentum.
But I stopped treating it like an engineer.
Now I treat it like:
an extremely fast junior developer with infinite confidence and occasional hallucinations
Which honestly makes the experience much healthier.
Final Thought
AI is not replacing developers.
It's exposing who actually understands software engineering.
Because generating code was never the hard part.
Understanding why the code should exist in the first place?
That's the real game.
And unfortunately, the AI still can't survive a production bug at 2 AM with three stakeholders breathing down its neck.
Lucky us.
Top comments (69)
ran into this building my PM agent stack - AI gets you to 80% in minutes, that last 20% is brutal. I now treat AI-written code like an intern PR: approve it, but budget the review time.
"Approve it, but budget the review time" might be the most practical AI coding advice I've heard. The speed is incredible, but the review phase is where you find out whether you saved time or just borrowed it from your future self.
the borrowed-from-future-self framing is exactly right - and the debt lands in the places the AI cannot see: state mutations, side effects, anything that depends on context outside the file. the 80% is clean, the 20% is where it was quietly making assumptions about everything it did not ask about.
Couldn't agree more. The scary part is that the code often looks completely reasonable at first glance. Then you start tracing actual application state, external dependencies, and edge cases, and you realize it was making a bunch of silent assumptions the whole time. AI is great at filling in blanks, but software bugs usually live in the blanks that should never have been filled in without asking. That's why the final stretch isn't really debugging code, it's debugging context.
'blanks that should never be filled without asking' is the exact line. what makes it worse is the confidence level — the model writes the silent assumption the same way it writes the obvious thing, no hedging, no flag. the review pass that catches it has to be specifically hunting hidden state dependencies, not just reading for syntax. most review isn't running at that level of skepticism.
Exactly. The confidence is what makes it dangerous. A human developer will often leave clues when they're unsure: a comment, a TODO, a weird variable name, or they'll simply ask a question. The model doesn't really have that instinct. An assumption about a critical state transition gets written with the same confidence as a string formatting function.
That's why AI code review can't just be "does this code look correct?" It has to be "what assumptions is this code making that aren't stated anywhere?" The failure mode isn't usually bad syntax or broken logic. It's unstated dependencies, missing business rules, race conditions, and state that exists outside the current context window.
What's interesting is that AI is pushing reviews toward a different skill set. The valuable reviewer isn't the person spotting a missing semicolon anymore. It's the person asking, "What happens if this service returns stale data?", "Who else mutates this state?", or "Where did this assumption come from?" That's the level where most of the expensive bugs are hiding.
the asymmetry matters: a human's 'weird variable name' is a free signal that something's uncertain. model output has no such tell — every line reads equally confident. review can't be passive reading, it has to hunt state dependencies specifically.
That's exactly the trap. Human code often leaks uncertainty in useful ways. You see a sketchy abstraction, an awkward comment, or a function that's suspiciously longer than it should be, and your reviewer instincts start firing. Those imperfections are signals.
AI tends to smooth all of those signals away. The questionable assumption and the obviously correct logic arrive with the same clean formatting, naming, and confidence. The code looks finished before it's understood.
That's why reviewing AI-generated code is closer to an investigation than a reading exercise. The question isn't "Does this look right?" It's "What context would have to be true for this to be right?" Once you start reviewing through that lens, you end up tracing ownership of state, side effects, concurrency boundaries, and business rules rather than just reading line by line. That's where the hidden assumptions usually surface.
yeah the smooth-to-uniform thing is the core problem. review instincts built on texture have nothing to grip. i've been experimenting with asking agents to flag their own uncertainty in inline comments - helps a bit, but you're trusting the same model to know what it does not know. circular.
Exactly. The moment the model is responsible for declaring its own uncertainty, you've created a self-auditing system. Sometimes that's useful, but it's fundamentally limited because the most dangerous failures are the ones the model doesn't recognize as uncertain in the first place.
A human says, "I'm not sure about this part." A model says, "This pattern usually works," and may never realize the current situation is the exception.
What's interesting is that uncertainty markers still have value, just not as a trust signal. They're more like a heatmap. If the model flags something, inspect it. If it doesn't flag something, inspect it anyway. I've started thinking that the better signal isn't model-reported uncertainty but context exposure. How many files did it see? What assumptions did it import from prior prompts? Which external systems does this change touch? The less context available, the more skepticism the output deserves, regardless of how confident or uncertain it claims to be.
The irony is that humans are bad at implementation but relatively good at recognizing uncertainty. Models are often good at implementation but relatively bad at recognizing the boundaries of their own knowledge. Put those together carelessly and you get overconfidence. Put them together with diagnostics and you get leverage.
yeah, but predictable failure modes are still catchable — the model doesn"t recognize uncertainty, but teams who"ve run enough cycles can build a lookup table for where that model"s confidence is systematically miscalibrated. the real gap isn"t detection, it"s that most teams never build the table in the first place.
Exactly. The strongest teams do not really trust or distrust the model; they learn its failure profile. After enough iterations, they build a mental lookup table for where confidence tends to diverge from correctness. State ownership, async workflows, security boundaries, business logic with hidden exceptions, and cross file refactors all receive extra scrutiny. The interesting part is that these failure modes are often consistent across repositories, which means they can be turned into review habits instead of being rediscovered every time. The real issue is not that AI failures are impossible to detect. It is that most teams never invest the time to systematically map where the model is predictably wrong, so they end up reviewing each change as if they have learned nothing from the previous hundred.
the hidden-exceptions category is the hardest to keep current — the others have test coverage proxies or static analysis you can lean on. undocumented business logic shifts as the product changes, so your lookup table for that column goes stale fastest.
That's exactly why undocumented business logic is such a persistent source of bugs. State boundaries, concurrency issues, and security rules tend to have recognizable patterns, so you can build tooling, tests, and review checklists around them. Business logic is different because it's effectively institutional memory encoded in behavior. The rule isn't "if X then Y"; it's "if X then Y, except for these six customers, this legacy workflow, and the thing we changed last quarter but never documented." AI struggles with it because the context usually isn't in the code, and humans struggle with it because the context keeps moving. The lookup table never converges. Every product decision can invalidate part of it. That's why the most valuable artifact often isn't the implementation or even the test suite, it's a living record of the assumptions behind the rules. Once those assumptions disappear, both humans and models start confidently preserving behavior that nobody actually intended to keep.
agree on the tooling gap, and the handover problem makes it worse — business logic exceptions often live in one person’s head, and when they leave, the exception becomes undocumented overnight. at least a concurrency bug can be reproduced in staging. a business rule exception that was added for a specific client edge case in 2019 might not surface until an agent touches that workflow five years later with no idea to check for it.
That's what makes business logic drift uniquely dangerous. Most technical bugs leave evidence. A race condition throws intermittent failures. A performance regression shows up in metrics. A memory leak eventually trips an alarm. But an undocumented business exception can sit quietly for years because, from the system's perspective, everything is working exactly as coded. The only place the rule exists is in organizational memory. Then someone leaves, the context disappears, and what was once an intentional exception starts looking like dead code or unnecessary complexity. Five years later an engineer or an AI agent "cleans it up," all the tests pass, and suddenly a single client workflow breaks because the only documentation was a conversation that happened in 2019. In a lot of legacy systems, the biggest source of truth isn't the codebase or the docs. It's the set of assumptions that haven't been forgotten yet.
yeah the 'no evidence' part is the real killer. with a tech bug you at least have a stack trace to start from. with business logic drift you're just waiting for the one customer who hits the edge case that the original dev 'just knew' to avoid. i've found the only reliable surface for these is exit interviews, which is a pretty grim last line of defense.
That's what makes business logic drift feel more like archaeology than debugging. A technical failure leaves artifacts: logs, traces, metrics, exceptions. An undocumented business rule can disappear without leaving any evidence that it ever existed. The system keeps working, the tests stay green, and everyone assumes the behavior is intentional until the one customer who depends on that forgotten exception suddenly proves otherwise. The fact that exit interviews end up being one of the best sources for recovering this knowledge says a lot about the gap. We're often better at documenting code than documenting why the code exists. In a strange way, the highest risk asset in many systems isn't the source code, it's the unwritten context sitting in someone's head. Once that person leaves, the organization starts running on inherited assumptions, and neither humans nor AI can protect rules they don't know are there.
the "tests stay green" part is the most dangerous — it means CI validated the drift, not caught it. a test written from code behavior instead of requirement intent is a faithful witness to the wrong thing. writing tests from intent requires the intent to be written down somewhere first, which is exactly what drift erases.
"Confident nonsense" is the perfect name for it. We see the exact same pattern in test automation — AI-generated test code that looks complete, has solid coverage numbers, but is testing something completely different from what it thinks it's testing.
The scariest part is that these tests pass. They pass cleanly. So you feel safe. Then the bug hits production, and when you trace back, you find that assert never actually touched the real edge case — it just ran the happy path and called it a day.
We started enforcing a rule: any AI-generated test has to include a one-line comment explaining what assumption it's validating, not what function it's testing. If even the model can't articulate the assumption, the test doesn't make it into the PR.
I like that rule because it shifts the focus from coverage to intent. A passing test only proves that reality matched the assumptions encoded in the test. If those assumptions are wrong, you get a green checkmark attached to a false sense of security. That's one of the nastier AI failure modes: it can generate tests that perfectly validate the implementation it just invented. The code and the test agree with each other, but neither agrees with the actual requirement. Requiring an explicit statement of the assumption forces the conversation up a level from "does this function return the expected value?" to "what property of the system are we claiming remains true?" In a way, that comment becomes more valuable than the test itself because it's the only place where intent is made visible instead of inferred.
And here's the part that makes it worse — the model generating the tests and the model writing the code are the same brain. They share the same blind spots. If the code misses an edge case, the test misses the exact same one, because both came from the same understanding of the problem. It's not "right code, wrong test." It's two things agreeing with each other on a shared misunderstanding.
Your point about the comment being more valuable than the test — I'd push it one step further. If the model that wrote the test can't clearly state what assumption it's validating, that test shouldn't exist. No test is better than a bad test, because at least you won't be fooled by the green checkmark.
I think that's the deeper failure mode. We often treat tests as an independent verification layer, but when the same model generates both the implementation and the test, they're not independent at all. They're two artifacts derived from the same mental model. If that model misunderstood a requirement, the code and test can reinforce each other perfectly while both being wrong. The result is a green build that measures consistency, not correctness. That's why I like your rule. Requiring the test to state the assumption it's validating forces it to expose the underlying model of the system. Once the assumption is visible, a human can challenge it. Without that step, you're effectively letting the model grade its own homework. And in practice, a bad test is often more dangerous than no test because it replaces uncertainty with misplaced confidence, which means the next reviewer is less likely to go looking for the bug in the first place.
Exactly. And the organizational side amplifies it — once the build is green, the incentive to keep looking vanishes. A failing test forces a conversation. A passing test closes it. The model didn't just generate the test. It generated the permission to stop thinking. That's the real danger — not the bug itself, but the false all-clear that follows it.
That's a really important distinction. Bugs are recoverable. False confidence is what lets them spread. A failing test creates friction, discussion, and investigation. A passing test creates closure. The danger isn't that the model wrote incorrect code or even an incorrect test. It's that it produced enough evidence to satisfy the process without actually validating the assumption. Once the dashboard is green, the PR is approved, and the deployment succeeds, the organization shifts from "prove this is correct" to "assume this is correct." The test becomes less of a verification tool and more of a social signal that the thinking has already been done. In that sense, the most expensive AI failure mode isn't hallucinated code. It's hallucinated certainty.
"Hallucinated certainty" — I'm stealing that one, heads up 😂
You're right though. A red test is noisy but honest. A green test is silent, and that silence is the dangerous part — it convinces you the thinking is done. The most expensive bugs I've seen weren't wrong code. They were the kind that passed every check and surfaced two weeks into production. By then it's not about fixing a line anymore. It's about rebuilding the entire chain of trust.
"Prove this is correct" → "Assume this is correct." That shift is going into something I'm writing. When it does, you'll know where it came from 👀
Steal away. 😂 I think "hallucinated certainty" describes the failure mode better than "hallucinated code" ever did. Most production incidents aren't caused by code that obviously looks wrong. They're caused by code, tests, reviews, and deployment checks all agreeing with the same flawed assumption. The bug survives because every layer validated the artifact instead of the premise. That's why a green pipeline can sometimes be more dangerous than a red one. A red pipeline demands attention. A green pipeline grants permission. Once that permission is given, the organization naturally stops asking questions and starts spending trust. Then two weeks later a customer finds the edge case and suddenly you're not debugging a function, you're auditing every decision that convinced people the function was correct in the first place. That's the part AI has made more important, not less: verification of assumptions, not verification of code.
"A green pipeline grants permission" — that line deserves to be printed on every CI machine.
The worst incidents I've seen all follow the same pattern: every gate went green. Green on "code passes tests" — not green on "the problem is solved." The most dangerous line in any code review is LGTM. Not because the code is fine, but because the reviewer and the author fell into the same cognitive trap.
AI makes that trap deeper. AI-generated code tends to look cleaner. No mixed tabs and spaces. No magic numbers. No obvious mistakes. So a reviewer is more likely to say LGTM. But clean errors are still errors — and harder to spot.
You're right: the problem isn't verifying code. It's verifying assumptions. And there's no green pipeline for that yet.
I think that's the uncomfortable truth underneath all of this. Our tooling is incredibly good at verifying artifacts and surprisingly bad at verifying intent. CI can tell you whether the code compiles, tests pass, types match, dependencies resolve, and performance stays within a threshold. What it can't tell you is whether everyone involved is operating from the same understanding of the problem. AI amplifies that gap because it produces code that is syntactically clean and structurally familiar, which makes it easier for reviewers to substitute pattern recognition for actual validation. A green pipeline proves consistency with the checks you wrote. It does not prove correctness with respect to reality. In a lot of postmortems, the root cause isn't that a safeguard failed. It's that nobody realized the critical assumption was never being checked in the first place. The industry has spent decades building systems to catch implementation errors. The next challenge might be building systems that continuously surface and challenge assumptions before they harden into accepted truth.
Man, that phrase 'debugging logic written by someone who technically does not exist' hits way too close to home.
You perfectly captured the great illusion of modern development. AI is like a hyper-caffeinated junior dev who has memorized every textbook but has never actually survived a production outage. It gives us that massive dopamine hit of writing 80% of the feature in 5 minutes, only to quietly trap us in a 6-hour psychological thriller over a single missing edge case.
Your point about missing context vs. senior consequences is spot on. 'Vibe coding' is fun until the vibe turns into an unintended 3 AM shift. Definitely treating my AI tools like an eager intern from now on. Great write-up!
That's exactly the tradeoff I've been noticing. The speed boost is real, but so is the cost of understanding what was generated. The best results I've had come from treating AI as a collaborator rather than an authority. The moment I start trusting it blindly, that's usually when the psychological thriller begins.
Yeah, AI, does code alright, but also produces CVEs and ISSUEs in code expanding and things. We knwo that AI CANNOT THINK ON ITS OWN, it ca nrequery itself, but CANNOT THINK like human beings, so we expect some errors from it, and we are more then happy to tell it to:
FIX THIS NOW!. And overall, AI is making us dull inCoding!That's a fair concern. AI is great at generating code, but it's not great at owning the consequences of that code. Security flaws, hidden bugs, and design tradeoffs still require human judgment. I think the challenge is making sure AI amplifies our skills rather than replacing the need to use them.
This is true. I just hope more people find this post and be more proactive when it comes to using AI. Not challenging it or using your own brain will make debugging harder in the future.
Exactly. AI is at its best when it's challenged, not blindly trusted. The faster it generates code, the more important it becomes to understand the assumptions behind that code. Otherwise we're just trading time spent writing bugs for time spent debugging them.
"Confident nonsense" – that's the perfect phrase for it
I felt this entire post in my bones. Especially the part about debugging AI‑generated code at 3 AM, wondering if you're being gaslit by a language model.
The thing that caught me off guard was exactly what you described: the code looks beautiful. Clean types, sensible variable names, good structure. Then somewhere deep in an async callback, there's an off‑by‑one error or a race condition that only appears in production under specific load. Good luck finding that.
Your point about tests being more important now is spot on. I've started treating AI‑generated code like I treat code from a new junior dev: review everything, test thoroughly, and never trust it blindly.
The shift from "AI as engineer" to "AI as fast junior with infinite confidence" is exactly right. It's a tool, not a teammate. And like any powerful tool, it can hurt you if you forget how it works under the hood.
Thanks for writing this – it's a good reality check for anyone who thinks the "80% in 2 minutes" means the last 20% will also be fast.
Cheers,
Jack
DEV.to/ggle.in
Appreciate that, Jack. The "fast junior with infinite confidence" analogy keeps feeling more accurate the longer I use these tools. What worries me now isn't the obvious bug anymore, it's the polished bug. The code is clean enough that your review brain relaxes, and that's exactly when hidden assumptions slip through. I've started noticing that the most expensive failures aren't syntax errors or broken logic, they're cases where the model made a reasonable assumption that nobody challenged because everything looked professional. The real skill seems to be shifting from reading code to interrogating assumptions. Not "does this work?" but "what would have to be true for this to work?" That's usually where the interesting bugs are hiding.
Appreciate that, Jack. The "fast junior with infinite confidence" analogy keeps feeling more accurate the longer I use these tools. What worries me now isn't the obvious bug anymore, it's the polished bug. The code is clean enough that your review brain relaxes, and that's exactly when hidden assumptions slip through. I've started noticing that the most expensive failures aren't syntax errors or broken logic, they're cases where the model made a reasonable assumption that nobody challenged because everything looked professional. The real skill seems to be shifting from reading code to interrogating assumptions. Not "does this work?" but "what would have to be true for this to work?" That's usually where the interesting bugs are hiding.
That line hit me: "what would have to be true for this to work?"
You just articulated the shift I couldn't name. Reading code asks "does this work?" — which the AI usually passes because it looks like it works. Interrogating assumptions asks "what hidden conditions is this code silently depending on?" That's where the real bugs live.
The polished bug is indeed scarier than the obvious one. Obvious bugs get caught in code review. Polished bugs get merged, deployed, and then surface at 2 AM when a specific edge case finally triggers that reasonable-but-wrong assumption.
I've started keeping a small "assumption log" during PR reviews — not just what the code does, but what the code believes about the world (timing, state, data shape, ordering). It's been surprisingly helpful.
Thanks again for this thread. One of the most valuable conversations I've had on here.
Cheers,
Jack
dev.to/ggle_in
This lands hard.
The “zero to eighty percent” part is exactly the trap. AI can produce a lot of plausible code very quickly, but plausible code is not the same as a coherent repo.
The expensive part usually shows up later, when you have to ask:
Why does this file exist?
What contract was this function supposed to preserve?
Which layer owns this state?
Did this test prove behavior, or just prove the patch?
Did the AI fix the symptom while quietly moving the drift somewhere else?
That’s the part I’ve been working on with Scarab Diagnostic Suite: not replacing the coding agent, but putting a diagnostic layer around it so the repo can prove what is still true after the agent is done.
AI can absolutely speed up implementation. But without diagnostics, it can also speed up entropy.
I think that's the distinction a lot of people are missing. The bottleneck is no longer generating code, it's preserving understanding.
A repo can survive mediocre code. It struggles to survive lost intent.
The questions you listed are exactly the ones AI is weakest at because they're not local code questions, they're system questions. The model can tell you how a function works. It usually can't tell you why that function exists, what invariant it's protecting, or whether the change shifted complexity into a different part of the system.
That's why I like the idea of a diagnostic layer. The real value isn't checking whether the agent wrote valid code, it's checking whether the repository still obeys the contracts and assumptions that existed before the change. In a world where code generation is cheap, proof becomes expensive.
AI doesn't just accelerate implementation. It accelerates change. Diagnostics are what stop accelerated change from becoming accelerated decay.
The 6-hour-on-one-line pattern shows up almost everywhere AI-generated code lands in legacy systems. The expensive part isn't the line. It's that you can't read intent from variable names the agent picked from training data instead of from a deliberate design choice. Debugging time scales with how much intent you have to reverse-engineer.
That's a great observation. The bug itself is often the easy part. The hard part is reconstructing the reasoning behind the code when there was never any real reasoning to begin with. When intent isn't obvious, every debugging session turns into an archaeology project.
the "debugging logic written by someone who technically does not exist" part is the real experience. we hit this hardest on async flows — AI would generate code that looked correct, handled the happy path fine, and silently swallowed errors on retry logic. no exception, no obvious failure. just incorrect final state.
mental shift that helped: stop reading AI code top to bottom and start asking "what does this fail to handle." forces you to think about consequences instead of patterns.
started requiring AI to generate test cases before the implementation. catches a lot of the confident nonsense before it makes it into a PR.
what's the most expensive AI generated bug you've had to find?
I really like the shift from "what does this do?" to "what does this fail to handle?" That feels much closer to how experienced engineers review code in general.
The async/retry examples are especially painful because everything appears to work until some edge case quietly corrupts the final state. Those are the bugs that consume entire afternoons.
As for the most expensive one, it wasn't a crash—it was a piece of generated logic that looked perfectly reasonable but made an incorrect assumption about state updates. Nothing failed, no errors were thrown, the data was just subtly wrong. Those are always the worst because you spend hours proving everything else isn't broken first.
the "data just wrong, no error thrown" category ages badly — you only find it when something downstream breaks in a way that does not point back. we added assertions after key state mutations not to catch bugs, just to make wrong state visible before it travels 3 hops and becomes untraceable.
state assumption bugs in AI code trace to one root: model saw the happy path shape and wrote for that. do you write assertions defensively now, or still mostly at review time?
More defensively now. The bugs that worry me most are not crashes, they're silent state corruption. If something throws immediately, at least you have a starting point. If invalid state survives three service boundaries and shows up as a weird analytics discrepancy a week later, you've basically started a forensic investigation. Assertions after critical state transitions have become less about catching programmer mistakes and more about enforcing invariants while the causal chain is still visible. That's especially true with AI-generated code because it tends to optimize for the happy path shape of the problem. The implementation often looks reasonable, but hidden assumptions about state validity, ordering, or ownership slip through. I've found that the highest ROI assertions aren't around inputs and outputs, they're around the moments where state changes hands. That's where "this should never happen" becomes "this happened six hours ago and now nobody knows why."
AI wrote me a button component last week. Four state managers fighting each other inside 80 lines of code. I don't even know what a state manager is. Spent three hours staring at it before just deleting everything and writing the world's ugliest button from scratch. It works though.
"The world's ugliest button" is exactly how half of software engineering breakthroughs happen. Sometimes deleting 80 lines of clever code and replacing it with 10 lines you actually understand is the senior-engineer move.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.