DEV Community

Cover image for I Spent 10x Longer Debugging AI Code Than Writing It

I Spent 10x Longer Debugging AI Code Than Writing It

Harsh on May 28, 2026

AI wrote the code in 30 seconds Three lines A simple function I prompted it generated I copied It looked fine Clean syntax Good variable names No ...
Collapse
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Edited

I do hate that sometimes. There were cases where I have to undo everything and re-prompt it and for some reason it gets it correctly. Rarely, I debug AI generated code when it comes to small fixes, but when it comes to big, I just undo what AI generated and re-prompt it.

Inefficient, yea. Gonna change that habit since I don't want to lose my skills.

Good work Harsh! Thanks for sharing :)

Collapse
 
harsh2644 profile image
Harsh

Francis undo and re-prompt I've been there too many times Feels faster But you're right it's inefficient And worse you're not learning Just gambling Gonna change that habit that's the part that matters. Not debugging faster. Debugging yourself

What helped me 30 seconds of reading before re-prompt. Catches what blind re-prompt misses.

Thanks for the honest share. 🙌

Collapse
 
gramli profile image
Daniel Balcarek

It worked 99% of the time, the 1% crashed in production is not just an AI problem. We’ve all seen bugs caused by developers implementing only the happy path long before AI existed.

Covering edge cases and exception scenarios usually comes with experience, often after seeing production crashes yourself.

But I totally agree with the main message: be careful with AI-generated code. Treat it as a draft, review it critically and always understand what you are pushing to production.

Collapse
 
derstruct profile image
Alex • Edited

A large part of discovering edge cases and exceptions happens during implementation. There is no way that you can see all those details upfront when prompting.

You could argue that reviewing solves this. But understanding (~= justifying) and criticizing execution paths is not nearly the same as modeling them. So seems it's almost kinda impossible to be careful with AI-generated code at scale.

Collapse
 
gramli profile image
Daniel Balcarek

Thanks for the comment!

I partially disagree. A lot of issues are discovered during implementation, debugging, and code review, but many edge cases are only found during testing, and some unfortunately only appear in production. By edge cases, I mean those rare scenarios that can slip past both developers and testers.

Also, we usually don't review an entire application or multiple features at once. We review smaller changes, whether they were written by a colleague or generated by AI. In that sense, AI-generated code can be reviewed the same way.

That said, I agree with your broader concern. If developers start accepting large amounts of AI-generated code without fully understanding it, ownership becomes a real problem.

Reviewing code is one thing; being able to maintain and debug it six months later is another.

Thread Thread
 
derstruct profile image
Alex • Edited

I got your point, it can be simplified to "human makes mistake, AI makes mistakes, you need to review both anyway". But from my perspective, code review was never a very effective measure against bugs (understanding code is not enough) and with generated code it's arguably even worse because of elevated mental overhead.

I make like 5x more mistakes then AI, but mostly it's something stupid and easy to fix once found (btw AI saves hours or debugging those). When working with generated code, it takes so much effort to ensure that the whole thing is not a mistake.

many Edge cases are only found during testing.

To test the edge case, you often need to know that it exists. And to know that it exists, often you need to write code yourself.

Maybe you are right, If your processes aligned, negative effects can be mostly mitigated. But probably with modest (compared to industry expectations) overall productivity boost.

Thread Thread
 
gramli profile image
Daniel Balcarek

That's a fair point.

I think where we differ is that you see implementation itself as an important part of discovering edge cases, while I see developers as being primarily focused on solving the main problem and delivering the feature. Chasing perfect code is expensive and there always has to be a trade-off between development time and covering every possible scenario.

Many edge cases are subtle enough that they can be overlooked during implementation. Testers, on the other hand, often approach the system with a different mindset and are more likely to uncover them. In reality, it's probably a combination of implementation, testing, and production usage that reveals most edge cases.

And I absolutely agree that AI-generated code introduces additional mental overhead.

Thread Thread
 
derstruct profile image
Alex • Edited

It's expensive to write code that is cheap to maintain. Obviously, there is a balance for each particular case. I don't think we differ here.

Sometimes it makes sense to rush a feature as fast as possible. And sometimes it's better for core functionality to operate flawlessly. There is no universal practice.

I didn't say btw that code should be perfect. But there are a lot of critical systems, where implementation matters. There are domains where testing is more expensive than development. Not everything is measured in features.

Collapse
 
p4nd3m1c profile image
p4nd3m1c

yes, you are right, Many wannabe coders just buy Codex, or opencode tokens and start slamming keyboard, they dont even know what memory is! And since AI is made FREE to use, many startups are being launched, made totally by AI, And people who used to CODE THEMSELVES are being replaced by AI SLOP. We need to learn how to use AI better, not to make it our god and Hallelujah!

AI will not take our jobs, It will destroy the world as we know it!

Collapse
 
harsh2644 profile image
Harsh

I share the concern not the destroy the world conclusion Wannabe coders don't know what memory is real problem. AI makes it dangerous People who coded themselves replaced by AI slop valid worry Senior devs use AI to ship more of what matters They know what good looks like We need to use AI better not make it our god 100% agree Problem isn't tool. It's treating tool as replacement for thinking.

AI won't destroy the world Over-trusting it without understanding will destroy systems trust, and careers.

Thanks for the passionate take. 🙌

Collapse
 
p4nd3m1c profile image
p4nd3m1c

btw, which AI agentic env do you use? and what would you tier as WORST of them? just a question.

Thread Thread
 
harsh2644 profile image
Harsh

Used: Cursor Cline Continue vanilla Claude/GPT Best: Cursor integrates well, good context awareness Worst? Not a specific tool. Any tool that hides too much where you can't see what assumptions it's making.

The worst is when you stop asking why did it do that?

What's been your experience? 🙌

Thread Thread
 
p4nd3m1c profile image
p4nd3m1c • Edited

Oh, I see. You have tried many things, yeah.
I feel like every AI Agentic env is good, unless It thinks for 30 straight mins every time I ask for optimizations in my code!
Currently, I think opencode is best, but a few minutes ago, i discovered an InfoStealer logic in it, So i think i will not use it again. You see opencode was commit-ing my code on a private repo somwhere, I think I might be mistaking but I am waiting for reply on my issue i posted on github opencode anomaly something account! I think I am being stupid but lets see!

Collapse
 
txdesk profile image
TxDesk

Worst story: 13 gates of clean AI-generated code, 107 passing tests, one bug that survived all of it.

Last night I was shipping a frontend feature with per-edit review on every AI-generated file. Plan locked upfront. Strict review gates. Tests required green before each gate closed. By gate 13, we had a custom hook, 6 components, full component test coverage, typecheck clean. 107/107 tests passing. Looked like a clean ship.

Gate 14 was live verification. Page never resolved. Stuck on skeleton forever.

The hook was double-unwrapping a { data: T } envelope. The HTTP client already strips .data universally before returning. The hook then re-stripped, got undefined, page sat on if (!query.data) return <Skeleton />. Three components downstream, undefined propagating quietly, no exception.

Here's the part that fits your post but extends it: the bug wasn't AI making a bad assumption. The AI had executed the plan faithfully. The PLAN had a bad assumption. Plan section 0 said "the hook unwraps .data in queryFn", which was a misread of how the underlying HTTP client worked. AI implemented the plan correctly. Tests passed because the test mocks ALSO returned the wire-format envelope, the hook re-unwrapped them in tests, and the assertions held. Both layers had the same bug, so the closed loop verified itself.

Your four practices catch AI-introduced assumptions. They wouldn't have caught this one because the assumption wasn't in the AI's output. It was in the spec the AI was executing against.

The extension I'd add to your list: review the spec for invisible assumptions before letting AI execute on it. Specifically check that any "the layer below already does X" claim in the spec is verified, not asserted. In my case, "api.ts already unwraps .data" was the kind of claim that needed an actual one-line read of api.ts to verify, not a confident-sounding sentence in the plan.

Cost ratio: ~6 hours of correct AI-driven work to produce something that would have failed in production. Saved by 30 minutes of live verification on the real browser. Live verification is the version of your debugging-tax practice that scales when the AI is good enough that the bug isn't in the code anymore. It's in the contract between layers.

Fix was two-part: remove the double-unwrap in the hook, and strip the wire envelope from every test mock so they faithfully simulate the production layer. Both had to land together, neither alone would have worked.

Collapse
 
harsh2644 profile image
Harsh

TxDesk this is the most important comment in the thread. Thank you for writing it up the bug wasn't AI making a bad assumption The AI had executed the plan faithfully The PLAN had a bad assumption this is the next layer Level 5 isn't just reviewing AI output. It's reviewing the instructions you gave the AI The AI can be perfect and still fail if your plan was wrong Tests passed because the test mocks ALSO returned the wire-format envelope. Both layers had the same bug, so the closed loop verified itself this is terrifying The tests didn't catch the bug because the tests shared the bug. The closed loop validated itself No error because no difference the extension: review the spec for invisible assumptions before letting AI execute on it.

Yes. Not just review code Review the plan The AI will do what you said, not what you meant. So the human's job isn't just code review it's spec review live verification is the version of your debugging-tax practice that scales when the bug isn't in the code anymore. It's in the contract between layers this is the key insight. When the AI becomes good enough, the bugs won't be syntax or logic They'll be mismatched assumptions between layers And the only way to catch those is to run the code in the real environment 6 hours of correct AI-driven work. Saved by 30 minutes of live verification.

The ratio The tax doesn't disappear It just moves up the stack.

Thank you for this it's the most valuable comment in the thread. 🙌

Collapse
 
txdesk profile image
TxDesk

Glad it resonated. The deeper failure mode I keep seeing: the spec and the tests are usually written by the same person at the same time, so they encode the same mental model. The mock is the spec, the test is the mock, the code matches both. Three layers of consistency, zero contact with reality.

The only thing that breaks it is running against the actual external surface - a real wallet, a real RPC, a real protocol contract. Once the loop has a foreign node that doesn't share your mental model, mismatched assumptions surface immediately. Until then, you're just verifying that you're consistent with yourself.

Thread Thread
 
harsh2644 profile image
Harsh

Three layers of consistency, zero contact with reality that's the line the mock is the spec. The test is the mock. The code matches both. Perfect consistency Perfect irrelevance the only thing that breaks it is a foreign node that doesn't share your mental model.

Beautifully said. Thank you for this thread, TeDwk. 🙌

Collapse
 
ofri-peretz profile image
Ofri Peretz

The naming trap resonates more than the logic bugs do for me. A variable called data isn't just a readability annoyance — in security-sensitive paths it's a liability, because vague names let unsafe values flow further before anyone questions what they actually contain. I've started treating AI-generated variable names as a linting signal: if the name doesn't encode the domain (raw vs. validated, user-supplied vs. internal), I treat the code as unreviewed regardless of how clean the logic looks. The 30-second write / 5-hour debug ratio is real, but I'd frame the fix differently than "slow down at write time" — it's more about what static analysis you run before you trust the output, because the AI isn't going to tell you what it assumed.

Collapse
 
harsh2644 profile image
Harsh

Ofri vague names let unsafe values flow further before anyone questions them this is a security insight most people miss A variable called data isn't just annoying It's dangerous Because no one can tell at a glance whether it's been sanitized validated or still contains user input The vagueness hides the risk If the name doesn't encode the domain raw vs validated, user-supplied vs internal treat the code as unreviewed.

That's a concrete rule. Not use better names if the name doesn't tell you the state, reject the code What static analysis you run before you trust the output

This is the frame shift. The article said slow down at write time You're saying automate the checking so you don't have to slow down manually Different approach, same goal catch assumptions before they become production bugs.

The AI won't tell you what it assumed So you need a system that checks for you.

Thank you for this security lens, actionable rule, frame shift. Three wins in one comment. 🙌

Collapse
 
stoyan_minchev profile image
Stoyan Minchev • Edited

I usually ask an ai to do a code review
In best case, new session, different model. There are different code review approaches that can be used as well. Things like that happen not only with ai generated code. How many times you have been I such situation, but with code written by human? The problem might not be in the AI, but in the process. ;)
And when things like that happen, this knowledge must be kept so that the ai don't do it again in the next session. ;)

Collapse
 
harsh2644 profile image
Harsh

Stoyan fair point Humans cause these problems too Difference isn't frequency it's recovery Human code has fingerprints Intent AI code is smooth No intent to recover The problem might be in the process agreed The generate, copy, ship process skips the assumption check.

Knowledge must be kept so AI doesn't repeat it hardest part Humans learn from mistakes AI doesn't unless you explicitly save the lesson.

Thanks for this layer. 🙌

Collapse
 
stoyan_minchev profile image
Stoyan Minchev

This hit me. Totally agree with you. AI can't learn itself.

In all cases, we can't leave it unsupervised.

Collapse
 
sephyi profile image
Sephyi • Edited

Are you doing TDD — having it write the test case first, then the implementation? I find that alone often makes things noticeably better. There's also the repeated-run technique (I forget what it's called) where you basically rerun the same prompt several times. I'd guarantee that even on the 10th run you'll still surface findings worth fixing.

Beyond that, I personally lean on dialectic verification: at least three separate models (e.g. Claude Opus, GPT-5.5, Gemini 3.1 Pro) each perform an independent, detailed review and produce a standardized report, which all get passed to a model of your choosing with fresh context that then synthesizes a unified final review. That said, I reserve this almost exclusively for large review runs at milestones, or after I've let an agent implement a plan. Lastly, always plan ahead. And after the review, instruct the agent to implement the plan accordingly.

If you're not confident reviewing the final report yourself, I'd suggest running each model 3× — giving you nine reports total. And then rating each finding by how many agents flagged it, so the unified report's findings table carries a confidence rating.

Another thing I just remembered is to always establish concrete coding standards and architecture at the beginning of a project.

The best results with AI-generated content were definitely achieved when I followed all these practices.

Collapse
 
harsh2644 profile image
Harsh

Sephyi repeated-run technique is underrated. Even the 10th run surfaces something new The model isn't deterministic same prompt, different output, different flaws Dialectic verification three separate models review independently, then a fourth synthesizes a unified review this is Level 3 (cross-model) on steroids Not just run another model Have them disagree then synthesize the disagreement into signal Nine reports total (3 models × 3 runs) rating findings by how many agents flagged it this is the meta-layer Consensus = confidence. Disagreement = investigation needed the trade-off: this is expensive (time, tokens, attention) You reserve it for milestones which is the right call. Not every PR needs nine reports.

If you're not confident reviewing the final report yourself this is the honest admission The human still has to be the one who can review it. The system helps, but doesn't replace judgment.

Thanks for sharing this it's the most sophisticated workflow in the thread. 🙌

Collapse
 
sephyi profile image
Sephyi

No problem. ❤️ If you’d like, I can happily look up the Skill I use for this later in case you could use it. Essentially, it instructs Claude Code to use besides itself for native models, including Codex and Gemini CLI.

Collapse
 
theuniverseson profile image
Andrii Krugliak

The 30-seconds-to-write, 5-hours-to-debug split is the number nobody puts on the slide. The quiet assumption that crashes at 1% is worse than a loud failure, because it ships looking fine. I've started treating "time until I trust it in prod" as the real cost, not write time.

Collapse
 
harsh2644 profile image
Harsh

Andrii time until I trust it in prod the real metric Not write time Trust time Quiet assumption that crashes at 1% is worse than loud failure because it ships looking fine Loud failure gets caught. Quiet assumption looks like success until it doesn't You shifted from speed to confidence Not how fast to generate How fast to trust.

Smartest reframe here. 🙌

Collapse
 
theuniverseson profile image
Andrii Krugliak

Trust-time is measurable in a way write-time never forced us to be: how many times did it ship something that passed every check and still broke at 1 percent. We started logging that as its own number, separate from velocity, and it quietly changed which agents we let run unattended. The loud failures were never the ones that hurt.

Collapse
 
backrun profile image
Backrun • Edited

There's a step nobody talks about that happens right after the code is "done."

For non-technical users like marketers and solo founders, the debugging problem you described actually starts before debugging. It starts at deploy. AI gives them the HTML in 30 seconds. Then they spend the next 2 hours trying to figure out Netlify, GitHub Pages, FTP, or just... giving up and leaving the HTML sitting in the chat window.

That's literally why I built HTML Deployer, a Chrome extension that lets you deploy AI-generated HTML directly from your ChatGPT or Claude tab without touching a terminal. The "deploy tax" is as real as your debugging tax, it just hits a different audience.
Great post by the way. The empty list bug story is painfully relatable.

Collapse
 
harsh2644 profile image
Harsh

Backrun deploy tax is a real thing and you're right nobody talks about it For developers deploy is an afterthought. For non-technical users, its Not the wall They get the code but they can't get it live The chat window becomes a graveyard of working HTML that never saw the light of day.

The debugging problem starts before debugging at deploy this is the insight Different audience, same hidden cost. Speed at generation, friction at delivery HTML Deployer sounds genuinely useful for this gap No terminal no Netlify config just publish.

The empty list bug glad it landed. And thanks for the kind words. 🙌

Collapse
 
xulingfeng profile image
xulingfeng

30 seconds to write, 5 hours to debug — that ratio hit hard. We run Hermes (an agent framework) for automated testing and I've noticed the same pattern: AI generates tests quickly but misses the edge cases that a human with context would catch naturally.

We ended up building a validation layer that forces the agent to explicitly state its assumptions before generating code. It hasn't eliminated the debugging time entirely, but it's cut it by about 60%.

Curious — did you end up with a systematic approach to catch those quiet assumptions, or is it still a per-case thing?

Collapse
 
harsh2644 profile image
Harsh

xulingfeng validation layer before generation best pattern in the thread 60% reduction is huge. Honest that it didn't eliminate entirely AI can't know assumptions it doesn't know.

Systematic vs per-case? Still per-case. But patterns emerging:
Empty pattern (lists, inputs, results)
Wrong type pattern (raw vs validated)
Edge case combinations

I keep a mental checklist. After enough 5-hour sessions, you see the same shapes.

What does your validation layer check? Would love to learn more. 🙌

Collapse
 
xulingfeng profile image
xulingfeng

Our validation layer checks three things before generation:
Structure — Does the output match the expected schema/type? (List vs dict, required fields present.)
Constraints — Business rules models tend to ignore. (ID can't be empty, result set can't be null.)
Consistency — Cross-field sanity checks. (Sum of parts equals the total.)
Can't catch everything, but it intercepts the most expensive failures.
Followed! Great discussion 🙌

Thread Thread
 
harsh2644 profile image
Harsh

xulingfeng Structure Constraints Consistency Three simple checks. 60% reduction Can't catch everything but intercepts the most expensive failures.

That's the 60% right there.

Followed back. Great discussion. 🏅

Thread Thread
 
xulingfeng profile image
xulingfeng

Appreciate the follow back! 🙌 60% on simple structural checks is a solid baseline — we see similar numbers. The remaining 40% is where the interesting edge cases live. Glad the discussion resonated!

Collapse
 
nark3d profile image
Adam Lewis

The empty-list bug is the one that lands hardest because the test that would have caught it is the one nobody writes - the one for the negative case. Plausible wrongness, as Urmila put it in the thread, is exactly what slips past a happy-path suite. What's helped me with the same pattern is writing the negative-case test before letting the agent generate the implementation: 'empty input returns empty' and 'illegal transition throws'. If the agent can grade itself against those, the 99% trap stops being a 99% trap. The hard part is the half of your point that doesn't go away: the human still has to know which negative cases exist, because the agent won't imagine them on its own. prickles.org/tenet/verifiable-spec...

Collapse
 
harsh2644 profile image
Harsh

Adam most actionable advice in the thread Write the negative-case test before the agent generates test for absence, not just success If the agent can grade itself against those the 99% trap stops being a 99% trap But the human still has to know which negative cases exist the agent won't imagine them.

AI can execute tests It can't invent them That's still on us.

Thank you for this. 🙌

Collapse
 
leob profile image
leob • Edited

"That's a trade-off nobody is talking about" - well, I do :-)

I'm pretty often commenting that sometimes it's just better to write some code yourself - for various reasons ... not all the time - sometimes :-)

P.S. good write-up!

Collapse
 
harsh2644 profile image
Harsh

Leob fair You've been saying this for a while Not claiming discovery just finally felt it enough to write about it Not all the time sometimes that's the honest middle ground.

And thanks for the good write-up means a lot coming from you. 🙌

Collapse
 
leob profile image
leob

Cool - yeah that's what I like, a balanced and grounded approach to AI coding tools!

Thread Thread
 
harsh2644 profile image
Harsh

That's the goal balanced not extreme Thanks for the conversation Leob. 🙌

Collapse
 
zep1997 profile image
Self-Correcting Systems

This is the trade-off I keep coming back to too: AI makes code faster to produce, but not
automatically cheaper to own.

The line that stood out to me was:

The fast code isn't free. It's borrowed time.

That feels exactly right.

The failure mode is rarely “the AI wrote obviously bad code.” If it were obvious, we’d
catch it immediately. The expensive failures are the quiet assumptions: the list is never
empty, the API always returns this shape, the user already has state, the timezone is
harmless, the retry is safe, the data is clean.

Those assumptions become invisible debt because the generated code looks finished.

One habit that has helped me is treating AI code like a junior engineer’s PR from someone
who moved very fast but didn’t know the production history. Not insulting, just
realistic. The syntax may be clean, but the missing context is where the risk lives.

I’ve started asking three questions before trusting generated code:

  1. What input shape would break this?
  2. What production state did this assume exists?
  3. If this fails at 2 AM, will the next person understand why?

That last one matters a lot. A variable named data is not just a style issue. It is
future debugging cost.

The bigger metric probably is not “time to generate code.” It is “time to accepted,
explainable, maintainable code.”

That is where AI still needs human ownership.

Collapse
 
harsh2644 profile image
Harsh

Most articulate comment here the expensive failures are quiet assumptions list never empty API always returns this shape user has state hidden syllabus of the 5-hour debugging class Treat AI code like a junior's PR from someone who moved fast but didn't know production history right mental model. Not bad Just junior.
Three questions:

  1. What input breaks this?
  2. What state did it assume?
  3. Will next person understand why?

Last one is the real metric. Not does it work Will it stay working when someone else touches it?

Time to accepted explainable maintainable code not time to generate.

Most complete response here. 🙌

Collapse
 
zep1997 profile image
Self-Correcting Systems

Appreciate that.

“Hidden syllabus of the 5-hour debugging class” is a perfect way to put it.

That is the part AI code hides well: not the syntax, not even always the logic, but the
unstated curriculum the production system eventually forces you to learn.

The junior PR framing helps me because it keeps the posture balanced. I do not want to
treat generated code as worthless, but I also do not want to treat clean formatting as
proof of production judgment.

A fast junior can produce useful structure. The senior work is asking what the code
assumed without saying.

And yes, that third question is the one I keep coming back to:

Will the next person understand why?

If the answer is no, the code may run, but it is not done yet.

That is where AI-assisted development needs a better metric. Not generation speed. Not
even first-pass correctness.

Accepted, explainable, maintainable code is the actual finish line.

Collapse
 
tommy_leonhardsen_81d1f4e profile image
Tommy Leonhardsen

Well,
Humans are atleast as likely as AI to make bugs of that kind.
So; Write Unit-tests - and more important - Have another AI do adverserial code-review.

For my code written by claude-code/Opus I use copilot-cli/GPT 5.5 for adverserial. They "Think" very differently, so this works quite well.

Collapse
 
harsh2644 profile image
Harsh

Tommy fair point Humans make these mistakes too Probably more often, honestly Adversarial AI review is smart using two models that think differently is a great pattern I've used it myself wrote about it in my 5 Levels of AI Code Review article

Here's where I'd gently push back: the adversarial model helps catch different bugs But the empty list assumption the one that cost me 5 hours would likely slip past both models Why? Because neither model has been burned by it Neither has the scar tissue.

So yes use adversarial review. It helps But don't trust it to catch everything The real safety net is still human judgment not because humans are smarter but because humans have felt the pain.

Thanks for the thoughtful comment and for the practical suggestion. 🙌

Collapse
 
quentin_merle profile image
Quentin Merle

Spot on. The absolute worst part is what I call the "hell loop".

You let the AI run, you lose your mental context of the current development thread, and suddenly the agent gets completely lost. It starts going in circles, desperately trying to fix a bug... that it created in the first place.

At that point, you have no choice but to step in and take over. But because you dropped the architectural thread, reverse-engineering the AI's "logic" to figure out what it broke ends up costing you hours of wasted time. It's the ultimate proof that we are still the only ones who can hold the true context!

Collapse
 
harsh2644 profile image
Harsh

Quentin hell loop is perfect Agent gets lost. Tries to fix a bug it created. You watch it spin so long you forget what the code looked like Reverse-engineering AI's logic debugging a system that doesn't have a thought process.

We are still the only ones who can hold true context AI has code. It doesn't have history Trade-offs Decisions not written down.

Hell loop letting AI drive long enough to forget where you were going.

Thanks for this. 🙌

Collapse
 
dcstolf profile image
Daniel Stolf • Edited

Building on @gramli 's point, the invisible-assumption bug isn't AI-specific. It's an artifact of skipping "what are the failure modes" before writing code. A hurried human writes the same bug. AI just makes the writing fast enough that the skip becomes the default.

Inverting the order narrows that 10 or 3x ratio. Before any function exists, write the acceptance criteria, and treat "what should this do when the input is empty / null / malformed" as non-optional in that list. Then turn each one into a failing test. Only then prompt for the implementation.

The empty list crash is the canonical example: if "function returns X when input is []" is a test that has to pass, the AI literally cannot generate code that crashes on empty input without failing that test. The guard becomes a structural requirement, not a thing you remember to add in review.

The naming trap is so true, though. And the most insidious one, because it doesn't surface as a bug. It just makes the code slower to read forever. The AI defaults to data, result, temp, obj, info because they're frequent in its training (which incidentally tells a lot about the average developer). The cost of decoding them three months later lands on you, not on the model. Cheap habit: one pass after the AI returns code where any of those names gets replaced with something specific, before the PR opens.

None of this makes the AI think like a senior engineer. It just forces the discipline that experienced engineers apply on instinct.

Collapse
 
gramli profile image
Daniel Balcarek

Yes, Test Driven Development is a strong answer to many of these problems.

And regarding which incidentally tells a lot about the average developer: I think it's also a consequence of the internet being full of tutorials, code snippets, examples and showcase projects where names like data, result, obj and temp are used heavily. Some of those patterns are also inherited from code written 10–20 years ago, when naming conventions and code quality standards were often different.

Collapse
 
harsh2644 profile image
Harsh

Daniel inverting the order most actionable advice here Before any function exists: write acceptance criteria. Treat empty/null/malformed as non-optional The guard becomes a structural requirement not a remember to add AI defaults to data, result, temp which tells a lot about the average developer Oof Cheap habit: one pass to replace vague names with specific ones before PR opens.

None of this makes AI think like a senior just forces discipline experienced engineers apply on instinct.

Most practical comment here. 🙌

Collapse
 
urmila_sharma_78a50338efb profile image
urmila sharma

Really loved this article it perfectly captures the hidden cost of AI coding assistants. We often focus on how fast we can generate code with AI but debugging that code (especially when it looks right but behaves wrong) can take 10x longer. Your point about false confidence is so true Thanks for sharing this reality check.

Collapse
 
harsh2644 profile image
Harsh

Thank you Urmia Looks right but behaves wrong that's the perfect phrase Plausible wrongness Clean syntax hidden assumptions That's the real trap.

Thanks for articulating it so well. 🙌

Collapse
 
cart0ne profile image
Cartone

Interesting perspective from a developer's side. I'm coming from the opposite direction — I can't code at all, and I run a project with three Claude instances (one as CEO, one writes code, one automates daily tasks). After 90 sessions I've noticed a related pattern: the AI doesn't just miss edge cases in code — it misses them in reasoning too. It proposes solutions that are technically correct but flat. It took me, with zero technical background, to ask "but what about X?" before the AI connected two pieces of context that were already in the conversation. Your "debugging tax" is the code version of what I'd call the "common sense tax" — the time a human spends adding the thinking the AI skipped.

Collapse
 
harsh2644 profile image
Harsh

Cartone most unique perspective here Three Claude instances CEO coder automation you're doing what devs talk about but few actually do AI misses edge cases in reasoning too deeper problem Logic works Context doesn't hold.

Common sense tax perfect name The info is there AI doesn't know which pieces matter unless you tell it The time a human spends adding the thinking AI skipped that's the hidden cost of both code and reasoning.

Thank you for this. 🙌

Collapse
 
seanmarkwei profile image
Sean Markwei

I have been there. Lost soo much money too, honestly.

Paid for the highest tiers thinking I'd get better work done, ended up bleeding dollars on garbage code. In my country's currency that's a lot.... someone's monthly. The undo and re-prompt cycle feels faster but it's true... you're just gambling, not learning. The real issue is debugging the code costs way more than generating it saved. So I stopped blind re-prompting and started actually reading the code first. Catches some of the assumptions before production does.

Thanks for shedding light on this, Harsh. It's the honest conversation we all need to be having.

Collapse
 
harsh2644 profile image
Harsh

Sean paid for highest tiers thinking I'd get better work real pain You didn't just lose time Real money Someone's monthly salary Debugging costs way more than generating saved the whole trade-off Generation is cheap Debugging is expensive Started actually reading code first not stop AI Just pause. Question.

Thank you for naming the financial cost most skip. 🙌

Collapse
 
seanmarkwei profile image
Sean Markwei

Exactly.

The financial hit is what really makes you stop and think. Most discussions around AI tooling happen from places where a few dollars don't matter. But when you're in Ghana and that's rent money, the trade-off math changes completely. You can't afford to lose time on garbage code. It forces you to actually be intentional about what you're shipping, which honestly makes you a better engineer. But yeah, the pause and question approach is the only thing that worked for me.

Thanks for this.💫

Thread Thread
 
harsh2644 profile image
Harsh

When you're in Ghana and that's rent money the trade-off math changes completely That's the line. The same numbers different weight Thank you for sharing this perspective Sean It's grounding. 🙌

Collapse
 
rondo profile image
Rondo

That's why I think domain knowledge is important.
Some edge cases are really hard to consider even for experts before shipping. So not surprisingly AI misses them.
I suppose it would be helpful to reduce debugging hours by combining 'general' edge cases that AI knows well with your domain knowledge.

Collapse
 
harsh2644 profile image
Harsh

Rondo some edge cases are hard to consider even for experts before shipping that's the humbling truth. If experts miss them, of course AI misses them. The AI doesn't know which corners of the problem space are dangerous because it hasn't been burned there before.

Combine general edge cases AI knows well with your domain knowledge this is the practical path. Let the AI catch what it can the list might be empty class of problems You handle the ones that require knowing why empty is a problem in this specific context.

The AI can list edge cases. It can't prioritize them That's where domain knowledge steps in.

Thanks for saying it simply sometimes the shortest comments hold the most truth. 🙌

Collapse
 
mickyarun profile image
arun rajkumar

The debugging tax is real, but the ratio depends massively on how your codebase is structured before you ever touch AI. We run a payment platform — FCA-regulated, real money, the kind of code where a missed edge case isn't a bug report, it's someone's rent payment failing.

Our ratio is closer to 2-3x, not 10x. Not because we're smarter, but because we invested years in typed schemas (Zod everywhere), strict service boundaries, and explicit validation rules. When an AI agent generates code against a Zod schema, the schema itself catches the invisible assumptions — empty lists, null inputs, malformed data — at compile time. The AI doesn't need to "know" about edge cases if the type system refuses to let them through.

Daniel's comment above nails it: this isn't an AI problem, it's a missing-edge-cases problem. AI just delivers that same blind spot faster and more confidently. The fix isn't avoiding AI — it's making your codebase hostile to silent assumptions regardless of who writes the code.

Collapse
 
harsh2644 profile image
Harsh

arun most advanced comment here 2-3x, not 10x not smarter just invested in typed schemas, strict boundaries the schema catches invisible assumptions at compile time. AI doesn't need to know edge cases if the type system refuses them the fix isn't avoiding AI it's making your codebase hostile to silent assumptions regardless of who writes the code That's the line Not trust less design the system so assumptions can't stay silent.

You engineered your way to a lower tax rate. That's Level 5.

Thank you for this. 🙌

Collapse
 
mnemehq profile image
Theo Valmis

Debugging AI code is harder because the failure mode is plausible-looking wrong, not obviously wrong. With human code you can read intent in the structure. With agent-generated code the structure is often coherent but the intent is missing, so you end up reconstructing what the agent thought you wanted before you can diagnose anything.

Collapse
 
harsh2644 profile image
Harsh

Theo plausible-looking wrong is the phrase I've been searching for Thank you Human code wrong often looks wrong. Messy indentation, weird variable names, obvious red flags The intent might be buried, but it's there you can excavate it AI code: wrong looks clean. The structure is coherent The names are fine. Everything feels right. But the intent is missing because intent was never there to begin with.

You end up reconstructing what the agent thought you wanted before you can diagnose anything this is the extra step nobody bills for. Not debugging the code debugging the assumptions behind the code Reverse-engineering a system that doesn't have a thought process.

You've named the invisible cost. Thank you. 🙌

Collapse
 
onisin profile image
Frank von Schrenk

Interestingly, I've been letting Claude (Anthropic) handle my programming for months now. Whenever a logical or technical error arises, I just have it fix itself—and it has worked every time. So, the real question is: which AI are you using for coding?

Collapse
 
harsh2644 profile image
Harsh

Frank fair question the 10x tax isn't model-specific Even Claude the empty list assumption bug would still happen. Not because Claude is bad. Because no model automatically checks assumptions you didn't ask it to AI writes code that works 99% of the time The 1% crash happens because AI assumed something that wasn't true.

Letting AI fix itself works for logic errors. Doesn't work for assumption errors because AI doesn't know it made an assumption.

Happy to learn if you've found a way to make Claude check its own assumptions Genuinely curious. 🙌

Collapse
 
onisin profile image
Frank von Schrenk

If you say the AI silently assumed that a list could never be empty, I would honestly see the mistake more on my side.
The fact that an AI can write code does not mean it can independently make architectural or domain-specific decisions. It does not automatically understand a business flow, nor does it inherently know what exists inside databases or systems.

While working on Onisin OS, I realized that an AI consistently performs better when it has access to more context and operational information. Because of that, I developed a Tool called Bench that allows the AI to work directly on my machine(or multiple) — exploring databases, reading and pushing GitHub repositories, and interacting with development infrastructure in a controlled way.

This significantly improved the quality of the AI’s assumptions and decisions.
I also developed oosmem, a system that allows the AI to retain concepts such as space, time, and actions as persistent contextual memory.

Of course, an AI can make mistakes, just like humans do.
The important part is not expecting perfection, but giving AI systems the right tools and feedback loops so they do not repeat the same mistake twice.

Collapse
 
ralvaracode profile image
Ruben Alvarado

Thanks for sharing your experience. I think that we need to transmit this cost to all share holders to make them understand better the dev cycle.
Code generation makes it more relevant to think about extensive tests to avoid checking only happy paths.

Collapse
 
harsh2644 profile image
Harsh

Ruben transmit this cost to all stakeholders that's the part nobody does Managers see the 30 seconds. They don't see the 5 hours The dashboard shows velocity up. It doesn't show debugging tax Code generation makes extensive tests more relevant to avoid checking only happy paths Yes The AI optimizes for happy path So the human needs to optimize for unhappy path That's not AI's job it's the team's.

Good testing isn't overhead. It's the receipt for the time AI saved you.

Thanks for adding the business lens. 🙌

Collapse
 
valentin_monteiro profile image
Valentin Monteiro

The tax gets worse one level up. You at least had the prompt in your head for the afternoon you generated it. Whoever inherits the code six months later has neither the logic nor the "why" that lived in the prompt, so they pay your 5-hour bug as a 5-day one. That's the part that never shows up in the velocity number: AI doesn't remove the debugging cost, it moves it forward in time and onto someone else's desk. Works great for the demo, less great for the codebase.

Collapse
 
harsh2644 profile image
Harsh

Valentin this is the most important comment in the thread Whoever inherits the code six months later has neither the logic nor the 'why' that lived in the prompt this is the hidden tax within the hidden tax. You paid 5 hours. The next person pays 5 days. The total cost isn't 5 hours it's 5 hours + 5 days + the frustration of decoding a stranger's assumptions AI doesn't remove the debugging cost It moves it forward in time and onto someone else's desk that's the structural deception. The velocity number goes up. The work doesn't disappear it gets deferred and shifted Works great for the demo Less great for the codebase.

This is the line The demo is a moment. The codebase is years Optimize for the wrong one, and you're building a time bomb.

Thank you for this it's the perspective every tech lead needs to read. 🙌

Collapse
 
0x2e73 profile image
SalimFlowStack

Never happened to me.

When you're very specific about how you want the AI to code something, and you verify the feature through unit tests and real frontend testing, check your database to make sure it did the job correctly, etc., it's rare to spend hours debugging something.

The AI sometimes produces some spaghetti code, but by making it work on only one feature at a time, being very specific with your requirements, and always verifying the results directly, I've never had to spend hours debugging a problem.

Collapse
 
harsh2644 profile image
Harsh

Salim glad it hasn't hit you Your workflow sounds solid Very specific requirements + unit tests + real testing + DB verification more discipline than most teams have Here's the nuance: the 5-hour bug wasn't logic it was an assumption bug (AI assumed list never empty) No amount of prompt specificity catches that, because the assumption wasn't in the prompt. It was in the AI's training

The tax isn't inevitable But it's also not avoidable by discipline alone because AI's blind spots are invisible to both AI and prompter.

Thanks for sharing your workflow. 🙌

Collapse
 
mnemehq profile image
Theo Valmis

The 10x debugging time isn't because AI code is uniquely bad. It's because AI code is uniquely opaque to the person who has to maintain it. Human-written code carries the author's mental model — variable choices reveal what the author was thinking, comments mark where they were uncertain, structure reflects how they reasoned about edge cases. AI code has none of that. The output looks plausible at every level, but there's no record of which trade-offs were considered.

The fix isn't better debugging tools. It's specifying constraints before generation so the AI is operating inside a narrower space — fewer plausible-but-wrong outputs to debug because the agent never had the freedom to produce them. Specification work upstream pays back at a much higher rate than debugging work downstream.

Collapse
 
carl_ward_2d0e6fee9693587 profile image
Carl Ward

I had similar issues - but I figured it was actually me. My prompts said what i wanted, but it didnt say what happened for edge cases so I got what I asked for. My learning - its like working with a very smart, very capable, but extremely naive and forgetful developer. So I fixed it with a tool - a knowledge graph that demands rigour in architecture, edge conditions, use-case specifications, test coverage - everything. The result - the system is a lot more robust, reliable, and correct - I still get defects - but the trick is to leverage its strengths and teach it how to work smarter. See my post for more info - the basic Claude Code skill and plugin is free.

Collapse
 
harsh2644 profile image
Harsh

Carl extremely capable, but extremely naive and forgetful developer that's the perfect description of working with current AI Not stupid. Just no memory of past mistakes No scars Every session is a fresh start My prompts said what I wanted, but didn't say what happens for edge cases so I got what I asked for that's the lesson. The AI gives you exactly what you ask for Not what you meant The gap between what I said and what I intended is where the 5 hours live Knowledge graph that demands rigor in architecture, edge conditions, use-case specifications, test coverage this is the structural fix. Not prompt better Design the system so the AI can't skip the rigor The knowledge graph becomes the guardrail.

Thanks for sharing and for making the basic skill free. 🙌

Collapse
 
carl_ward_2d0e6fee9693587 profile image
Carl Ward

If you find it useful let me know - I am continuing to update.

Collapse
 
mnemehq profile image
Theo Valmis

This matches almost every team I've talked to. The debugging cost compounds when agents generate code that compiles and passes shallow tests but violates architectural invariants the AI can't see. Most of the 10x is unwinding decisions that should have been blocked at generation time.

Collapse
 
mickyarun profile image
arun rajkumar

The 10x ratio is real, but the fix isn't just individual discipline — it's systemic. We push AI-generated code through design pattern lints and mandatory integration tests so the empty-list assumptions get caught before prod, not after a 2 AM page. The bigger shift for us: giving the AI actual context about our system's invariants instead of letting it guess from training data. Your rule #1 is the right instinct. If you can't walk through it, it shouldn't ship. Same principle we apply — AI handles the routine 80%, but the edge cases and architecture decisions still need humans who understand the full system.

Collapse
 
elionreigns profile image
E Lion Reigns

10x debug time feels accurate for AI-assisted glue code. I now ask what happens when this gets nothing before every ship. Building solo in Hawaii — always looking for dev friends who get production integration pain.

Collapse
 
vicchen profile image
Vic Chen

This is a sharp framing. The "30 seconds of generation vs. 5 hours of debugging" contrast really lands, especially the point about reverse-engineering assumptions later. I've seen the same thing in ML/data workflows: AI is great for drafts, but production edge cases still demand human judgment and a clear mental model.

Collapse
 
harsh2644 profile image
Harsh

Vic AI is great for drafts, but production edge cases still demand human judgment and a clear mental model that's the summary The AI can do the 80% The 20% the edge cases the assumptions the what happens when things go wrong that's still on us ML/data workflows are a great example A draft model is fine A production pipeline that fails silently? That's the 5-hour tax.

Thanks for adding the ML lens and for the kind words. 🙌

Collapse
 
thetylern profile image
Tyler N

When I was younger (a few years ago, which was a completely different lifetime because I definitely wouldn't consider myself very old), I remember that I use to use AI to try and generate me full applications (one I remember being the snake game). I did not know how to code, at all. I would copy error logs into the chat bot, and spit it's content back in. I was so bad I couldn't even paste snippets of code into the file, only the full file. And, at that time, AI was very incapable of remembering 100 or 200 lines of code for more than a few responses (I could have just been using a bad model, it was mainly ChatGPT). I was debugging (AI'ing) forever, and it never generated anything that actually worked, outside for very small things. And, when it worked, it was usually very bad, and looked like an ape styled it.

Now that Artificial Intelligence is more intelligent, it could definitely code the snake game in the blink of the eye. But, the errors with code generated with Artificial Intelligence can now be very subtle and small. While a subtle or small issue looks a lot better before you notice something is wrong as compared to a large, "FATAL ERROR: FIX ME!!!" error, it is the opposite once you figure out something is wrong, and you have no clue what is is (like when the AI model you used forgot to check if the list is empty).

But now that AI is advanced enough to appear to be usable in production environments, it's errors that were once small due to it's lack of intelligence and therefore lack of usage, are now large due to its higher intelligence and therefore more important usage.

I agree that AI should be used more responsibly to reduce errors in production environments. As Spider Man said (don't ask me a lot about him, I really do not know much about him), "with great power comes great responsibility". But, I would also like to add, that AI should be used second, and human thinking should be used first. The MIT Media Lab study "Your Brain on ChatGPT" studied 54 18-39 year old participants, split them into 3 groups, and had them write an essay. The study concluded that the individuals who used AI first had lower brain activity, and felt that the essay did not belong to them as much as the group who drafted the essay first and then used AI, who showed more brain activity, felt more ownership of the essay, and were able to quote their own essay much better than the AI first group.

In my scenario, the rate between my time coding and my time debugging was infinite. So, I guess that's my answer to your question.

Collapse
 
harsh2644 profile image
Harsh

Tyler this is a whole journey in one comment. From couldn't code to citing MIT studies Now that AI is more intelligent, the errors are subtle and small A subtle bug looks better before you notice something is wrong but once you figure it out, you have no clue what it is that's the shift Loud errors are easier to debug. Quiet errors the empty list assumption, the off-by-one, the invisible misalignment those are the 5-hour traps AI's errors that were once small due to its lack of intelligence (and therefore lack of usage) are now large due to its higher intelligence (and therefore more important usage).

This is the paradox. As AI gets better, the stakes get higher The bugs become rarer, but more subtle. And when they hit, they hit harder the MIT study AI first vs human first this is the data behind the one hour no AI rule. Human-first keeps the brain engaged Ownership stays with the writer/coder.

In my scenario, the rate between time coding and time debugging was infinite because you couldn't code at all. Now you can That's progress The tax went from infinite to finite.

Thank you for this it's one of the most thoughtful comments in the thread. 🙌

Collapse
 
mweed profile image
MW

This post was written by AI - ignored

Collapse
 
scarab-systems profile image
Scarab Systems

This is such a real thread.

I don’t think the problem is simply “AI writes bad code.” Humans write bad code too. The sharper problem is that AI can generate plausible code without carrying the repo’s actual memory, architecture, constraints, or accumulated scars.

That’s the part I keep running into: the agent isn’t always failing because it can’t code. It’s failing because it doesn’t know when it has drifted away from the truth of the repo.

I’ve been working on a diagnostic layer around this exact problem — not another coding agent, but something that sits beside the agent and keeps checking whether the work still matches the repo’s governance, structure, assumptions, and intended direction.

Basically: less “trust the agent,” more “give the agent a supervisor that can tell when it’s making the repo noisier instead of cleaner.”

This post nails the pain point. The next evolution, I think, is not faster code generation. It’s better truth-maintenance around the code being generated.

Collapse
 
harsh2644 profile image
Harsh

This is the most insightful comment in the thread. Thank you the agent isn't failing because it can't code. It's failing because it doesn't know when it has drifted from the truth of the repo.

That's it. The AI doesn't have a sense of this code belongs here vs this code is making things worse It only has local correctness. Not global coherence Not another coding agent a diagnostic layer A supervisor that checks whether the work still matches the repo's governance structure, assumptions This is the next evolution you're describing. Not better generation. Better guardianship Something that watches the agent and says You've gone off course. Rewind Less trust the agent More give the agent a supervisor

That's the frame shift. We've been treating AI as autonomous. What you're describing is treating it as capable-but-needs-oversight A junior with a senior looking over their shoulder.

Better truth-maintenance around the code being generated This is the phrase. Not better code. Better truth The code isn't wrong it's just not true to the repo's history, patterns, and unwritten rules.

You've articulated the next layer beyond my article Thank you this is genuinely valuable. 🙌

Collapse
 
eugene_maiorov profile image
Eugene Maiorov

This hits the nail on the head. The "cognitive load" section is what resonated most with me - reading AI code is literally like reverse-engineering a stranger's handwriting. It falls into the "works on my machine" trap because the LLM will always optimize for the happy path and completely ignore real-world edge cases.

I actually changed my entire workflow for core logic because of this exact 10x debugging tax. Instead of letting the AI write open-ended functions where it can invent assumptions, I force it to interact with my systems strictly through MCP tools. When you restrict it to tightly typed schemas, it physically can't hallucinate missing if statements or bad variable names—it just uses the tool exactly as defined or it fails.

The only headache with that approach is managing the infrastructure for all those little tool servers. I eventually moved my orchestration to Vectoralix so I could just deploy sandboxed endpoints directly from my repos without dealing with the JSON-RPC boilerplate every time. Fencing the AI in with strict, pre-written tool definitions - rather than letting it write freeform logic - is the only way I've found to actually trust what it builds.

Collapse
 
harsh2644 profile image
Harsh

Eugene reading AI code is like reverse-engineering a stranger's handwriting that's the perfect description the AI writes code that's coherent but not yours No fingerprints No intent to recover Instead of letting the AI write open-ended functions where it can invent assumptions I force it to interact through MCP tools with tightly typed schemas this is the pattern. Don't trust the AI to make decisions. Give it a menu of safe, pre-defined actions It can choose from the menu, but it can't cook up something new.

When you restrict it to tightly typed schemas it physically can't hallucinate missing if statements the AI's creativity is powerful. It's also dangerous. You're not killing creativity you're fencing it into safe zones Fencing the AI in with strict, pre-written tool definitions not freeform logic is the only way I've found to actually trust what it builds this is the answer to the 10x tax. Not better AI Better guardrails The AI is powerful but untrustworthy. So you build a container around it that limits what it can do.

Vectoralix sounds like exactly that infrastructure for the container.

Thank you for this practical, actionable, and honest about the infrastructure cost. 🙌

Collapse
 
yogesh_vk profile image
Yogesh VK

Agree 💯. Been there so many times 😂 Things get exaggerated for Devops/Platform folks, where a lot of code ends up in Infrastructure, and impact can be many folds.

Collapse
 
harsh2644 profile image
Harsh

Yogesh impact can be many folds the DevOps multiplier App bug breaks a feature. Infra bug breaks everything Same 10x tax higher stakes.

Thanks for this lens. 🙌

Collapse
 
amit_kochman profile image
Amit Kochman

Exactly what we are trying to solve at Pandorian.ai, enforcing coding standards in scale for large software orgs. Especially now that veloicity is now 10x because of AI.

Collapse
 
harsh2644 profile image
Harsh

Amit enforcing coding standards at scale is exactly the kind of structural fix the article points toward Individual discipline helps. Organizational guardrails help more If Pandorian.ai is building that enforcing standards so teams don't have to rediscover the 10x tax themselves that's valuable.

Thanks for reading and for building in that direction. 🙌

Collapse
 
amit_kochman profile image
Amit Kochman

Yes, we are building for leadership, so teams can keep on running without worrying about it!
Great article, thanks :)

Collapse
 
rachel_lu_8e2f1c9df223677 profile image
Rachel lu

'Fix this bug' rewrites half the file. I diff every AI edit and revert anything over 3 lines.