Harsh

Posted on May 28

I Spent 10x Longer Debugging AI Code Than Writing It

#ai #programming #productivity #discuss

Reverse-engineering code you didn't write

AI wrote the code in 30 seconds

Three lines A simple function I prompted it generated I copied It looked fine Clean syntax Good variable names No obvious errors.

I spent the next 5 hours debugging it.

The bug wasn't in the logic The AI had made a quiet assumption - that a list would never be empty It worked 99% of the time The 1% crashed in production A real user A real failure A very real 5 hours of my life.

30 seconds of generation 5 hours of debugging.

That's not efficiency That's a trade-off nobody is talking about.

This isn't an anti-AI article I use AI every single day It has genuinely changed how I work But I've stopped pretending that speed at write time is the only metric that matters.

Here's what I've learned about the hidden cost of AI-generated code after paying that cost enough times to notice the pattern.

The Myth of Fast Code

We've been sold a story AI makes you faster Prompt copy ship Repeat It's true the writing is faster Dramatically faster What used to take an hour now takes minutes That part is real.

But the story always stops there It doesn't mention what happens after.

The AI writes the code in seconds You ship it You move on Weeks later a bug surfaces - subtle hard to reproduce buried in code you didn't write and don't fully own.

Now you're not debugging logic you understand You're reverse-engineering code from a system that can't explain its own assumptions You're reading it like a stranger's handwriting trying to figure out what they meant.

The fast code isn't free It's borrowed time.

The debt shows up later - and by then you've completely forgotten what the AI assumed when it wrote it.

Three Times AI Code Cost Me More Than It Saved

1. The Invisible Assumption (5 hours)

The AI assumed a list would never be empty Didn't check Didn't add a guard Why would it? It only knows what I asked - not what real users actually do.

The bug showed up in production two weeks after I shipped A user with zero data hit the flow The whole thing crashed.

The fix? One line A simple if not list check.

The debugging? Five hours of confused increasingly frustrated me tracing through logs, adding print statements, questioning my own sanity before I found a single missing assumption.

	Time
⚡ Saved at write time	5 minutes
🔥 Cost at debug time	5 hours

Ratio: 60x.

2. The Works on My Machine Trap (1 full day)

AI code passed all my tests Ran perfectly locally I was confident I shipped it.

In production? Different story entirely.

The AI had optimized for my test environment - the clean inputs I'd been testing with, the neat data shapes in my fixtures the happy paths I'd written It hadn't thought about real data It hadn't thought about the weird edge cases real users create.

I spent a full day chasing a bug that only existed in the wild.

	Time
⚡ Saved at write time	10 minutes
🔥 Cost at debug time	1 full day

3. The Naming Trap (3 hours)

The AI named a variable data.

Generic Vague Technically acceptable And a completely reasonable thing for an AI to do it didn't know what mattered.

Three months later I had no idea what data contained Was it the raw user input? The transformed output? The cached result from the database? Something I'd filtered?

I spent 3 hours tracing through code that should have taken 10 minutes to understand because the AI chose convenience over clarity and I didn't catch it.

	Time
⚡ Saved at write time	0 minutes (I would've named it better)
🔥 Cost at debug time	3 hours

What AI Code Actually Costs

Beyond the hours there are costs that don't show up on any stopwatch.

Cognitive load. You didn't write the code, so you don't have the mental model Every time you touch it, you have to rebuild your understanding from scratch It's like returning to a codebase you've never seen except you supposedly wrote it.

Confidence erosion. After enough works on my machine moments you stop trusting your own testing You start shipping with low-grade anxiety You add logs just in case You write extra tests not because the code needs them but because you don't trust code you didn't write.

The just in case spiral. Extra checks Extra validation Extra error handling not because the requirements demand it but because you're compensating for uncertainty about code you can't fully vouch for This eats time quietly in small pieces.

Opportunity cost. Every hour you spend debugging AI-generated code is an hour you're not spending on the work that actually requires your judgment your context your experience.

These costs are invisible No ticket tracks them No dashboard measures them No retrospective surfaces them.

But they're real And they add up slowly silently until one day you realize debugging has started to feel like the actual job.

What I'm Doing Differently

I'm not quitting AI That ship has long since sailed and I don't want it back.

But I've made a few small changes that have quietly shifted the ratio:

1. I don't ship code I can't explain.
If I can't walk through the logic not skim actually walk through it line by line I don't ship it. Even if it works in testing This catches invisible assumptions before production does.

2. I treat AI output as a first draft.
The AI writes the structure. I rewrite the parts that matter the edge cases the error handling the variable names the things that someone will need to read at 2am when something breaks It's slower It's also code I actually own.

3. I add the missing assumptions explicitly.
The AI always optimizes for the happy path So I've made it a habit to immediately ask: What does this break if the input is empty Null? Malformed Unexpected? I add those checks myself every time.

4. I budget a debugging tax.
Every AI-generated function gets an extra 30 minutes in my time estimate for review and hardening Not pessimism pattern recognition The tax pays for itself in the first incident it prevents.

None of these eliminate the problem But they've meaningfully reduced my personal 10x ratio Some weeks it's closer to 3x now.

That's progress.

The Honest Trade-Off

AI code is faster to write Slower to debug The ratio varies sometimes 2x sometimes 20x sometimes that one time it was 60x and I questioned my life choices.

The question was never is AI good or bad? That's a pointless debate.

The real question is: what's the ratio for your work on your codebase for your team?

For throwaway scripts? Use AI don't look back.

For core logic that someone will need to debug at 2 AM six months from now? Be careful Be deliberate. Be present.

The trade-off is real It's not going away And pretending it doesn't exist doesn't make it disappear it just means you'll discover it in production instead of before it.

One Question

What's the worst AI wrote it fast I debugged it slow story you have?

How long did the bug take to find? What was the assumption you missed?

I'll go first in the comments - the empty list crash 5 hours a single missing if statement.

Your turn. 👇

Top comments (121)

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • May 28 • Edited

I do hate that sometimes. There were cases where I have to undo everything and re-prompt it and for some reason it gets it correctly. Rarely, I debug AI generated code when it comes to small fixes, but when it comes to big, I just undo what AI generated and re-prompt it.

Inefficient, yea. Gonna change that habit since I don't want to lose my skills.

Good work Harsh! Thanks for sharing :)

Harsh • May 28

Francis undo and re-prompt I've been there too many times Feels faster But you're right it's inefficient And worse you're not learning Just gambling Gonna change that habit that's the part that matters. Not debugging faster. Debugging yourself

What helped me 30 seconds of reading before re-prompt. Catches what blind re-prompt misses.

Thanks for the honest share. 🙌

Daniel Balcarek • May 28

It worked 99% of the time, the 1% crashed in production is not just an AI problem. We’ve all seen bugs caused by developers implementing only the happy path long before AI existed.

Covering edge cases and exception scenarios usually comes with experience, often after seeing production crashes yourself.

But I totally agree with the main message: be careful with AI-generated code. Treat it as a draft, review it critically and always understand what you are pushing to production.

Alex • May 29 • Edited

A large part of discovering edge cases and exceptions happens during implementation. There is no way that you can see all those details upfront when prompting.

You could argue that reviewing solves this. But understanding (~= justifying) and criticizing execution paths is not nearly the same as modeling them. So seems it's almost kinda impossible to be careful with AI-generated code at scale.

Daniel Balcarek • May 29

Thanks for the comment!

I partially disagree. A lot of issues are discovered during implementation, debugging, and code review, but many edge cases are only found during testing, and some unfortunately only appear in production. By edge cases, I mean those rare scenarios that can slip past both developers and testers.

Also, we usually don't review an entire application or multiple features at once. We review smaller changes, whether they were written by a colleague or generated by AI. In that sense, AI-generated code can be reviewed the same way.

That said, I agree with your broader concern. If developers start accepting large amounts of AI-generated code without fully understanding it, ownership becomes a real problem.

Reviewing code is one thing; being able to maintain and debug it six months later is another.

Alex • May 29 • Edited

I got your point, it can be simplified to "human makes mistake, AI makes mistakes, you need to review both anyway". But from my perspective, code review was never a very effective measure against bugs (understanding code is not enough) and with generated code it's arguably even worse because of elevated mental overhead.

I make like 5x more mistakes then AI, but mostly it's something stupid and easy to fix once found (btw AI saves hours or debugging those). When working with generated code, it takes so much effort to ensure that the whole thing is not a mistake.

many Edge cases are only found during testing.

To test the edge case, you often need to know that it exists. And to know that it exists, often you need to write code yourself.

Maybe you are right, If your processes aligned, negative effects can be mostly mitigated. But probably with modest (compared to industry expectations) overall productivity boost.

Daniel Balcarek • May 29

That's a fair point.

I think where we differ is that you see implementation itself as an important part of discovering edge cases, while I see developers as being primarily focused on solving the main problem and delivering the feature. Chasing perfect code is expensive and there always has to be a trade-off between development time and covering every possible scenario.

Many edge cases are subtle enough that they can be overlooked during implementation. Testers, on the other hand, often approach the system with a different mindset and are more likely to uncover them. In reality, it's probably a combination of implementation, testing, and production usage that reveals most edge cases.

And I absolutely agree that AI-generated code introduces additional mental overhead.

Alex • May 29 • Edited

It's expensive to write code that is cheap to maintain. Obviously, there is a balance for each particular case. I don't think we differ here.

Sometimes it makes sense to rush a feature as fast as possible. And sometimes it's better for core functionality to operate flawlessly. There is no universal practice.

I didn't say btw that code should be perfect. But there are a lot of critical systems, where implementation matters. There are domains where testing is more expensive than development. Not everything is measured in features.

Harsh • May 28 • Edited

Daniel this is the fairest comment in the thread Thank you You Are absolutely right this is Not an AI problem Its a missing edge cases problem Humans have been shipping happy-path-only code forever AI just makes it easier to ship that same blind spot faster Covering edge cases usually comes after seeing production crashes yourself That's the hard-earned part AI hasn't been burned yet. It doesn't have the scar tissue. It will write the happy path every time unless you explicitly tell it not to.

Treat it as a draft, review it critically Yes The source doesn't change the responsibility. AI or human, the code you ship is yours.

Thanks for adding the nuance this is the most balanced take in the thread. 🙌

p4nd3m1c • Jun 1

yes, you are right, Many wannabe coders just buy Codex, or opencode tokens and start slamming keyboard, they dont even know what memory is! And since AI is made FREE to use, many startups are being launched, made totally by AI, And people who used to CODE THEMSELVES are being replaced by AI SLOP. We need to learn how to use AI better, not to make it our god and Hallelujah!

AI will not take our jobs, It will destroy the world as we know it!

Harsh • Jun 1

I share the concern not the destroy the world conclusion Wannabe coders don't know what memory is real problem. AI makes it dangerous People who coded themselves replaced by AI slop valid worry Senior devs use AI to ship more of what matters They know what good looks like We need to use AI better not make it our god 100% agree Problem isn't tool. It's treating tool as replacement for thinking.

AI won't destroy the world Over-trusting it without understanding will destroy systems trust, and careers.

Thanks for the passionate take. 🙌

p4nd3m1c • Jun 2

btw, which AI agentic env do you use? and what would you tier as WORST of them? just a question.

Harsh • Jun 2

Used: Cursor Cline Continue vanilla Claude/GPT Best: Cursor integrates well, good context awareness Worst? Not a specific tool. Any tool that hides too much where you can't see what assumptions it's making.

The worst is when you stop asking why did it do that?

What's been your experience? 🙌

p4nd3m1c • Jun 2 • Edited

Oh, I see. You have tried many things, yeah.
I feel like every AI Agentic env is good, unless It thinks for 30 straight mins every time I ask for optimizations in my code!
Currently, I think opencode is best, but a few minutes ago, i discovered an InfoStealer logic in it, So i think i will not use it again. You see opencode was commit-ing my code on a private repo somwhere, I think I might be mistaking but I am waiting for reply on my issue i posted on github opencode anomaly something account! I think I am being stupid but lets see!

TxDesk • May 30

Worst story: 13 gates of clean AI-generated code, 107 passing tests, one bug that survived all of it.

Last night I was shipping a frontend feature with per-edit review on every AI-generated file. Plan locked upfront. Strict review gates. Tests required green before each gate closed. By gate 13, we had a custom hook, 6 components, full component test coverage, typecheck clean. 107/107 tests passing. Looked like a clean ship.

Gate 14 was live verification. Page never resolved. Stuck on skeleton forever.

The hook was double-unwrapping a { data: T } envelope. The HTTP client already strips .data universally before returning. The hook then re-stripped, got undefined, page sat on if (!query.data) return <Skeleton />. Three components downstream, undefined propagating quietly, no exception.

Here's the part that fits your post but extends it: the bug wasn't AI making a bad assumption. The AI had executed the plan faithfully. The PLAN had a bad assumption. Plan section 0 said "the hook unwraps .data in queryFn", which was a misread of how the underlying HTTP client worked. AI implemented the plan correctly. Tests passed because the test mocks ALSO returned the wire-format envelope, the hook re-unwrapped them in tests, and the assertions held. Both layers had the same bug, so the closed loop verified itself.

Your four practices catch AI-introduced assumptions. They wouldn't have caught this one because the assumption wasn't in the AI's output. It was in the spec the AI was executing against.

The extension I'd add to your list: review the spec for invisible assumptions before letting AI execute on it. Specifically check that any "the layer below already does X" claim in the spec is verified, not asserted. In my case, "api.ts already unwraps .data" was the kind of claim that needed an actual one-line read of api.ts to verify, not a confident-sounding sentence in the plan.

Cost ratio: ~6 hours of correct AI-driven work to produce something that would have failed in production. Saved by 30 minutes of live verification on the real browser. Live verification is the version of your debugging-tax practice that scales when the AI is good enough that the bug isn't in the code anymore. It's in the contract between layers.

Fix was two-part: remove the double-unwrap in the hook, and strip the wire envelope from every test mock so they faithfully simulate the production layer. Both had to land together, neither alone would have worked.

Harsh • May 30

TxDesk this is the most important comment in the thread. Thank you for writing it up the bug wasn't AI making a bad assumption The AI had executed the plan faithfully The PLAN had a bad assumption this is the next layer Level 5 isn't just reviewing AI output. It's reviewing the instructions you gave the AI The AI can be perfect and still fail if your plan was wrong Tests passed because the test mocks ALSO returned the wire-format envelope. Both layers had the same bug, so the closed loop verified itself this is terrifying The tests didn't catch the bug because the tests shared the bug. The closed loop validated itself No error because no difference the extension: review the spec for invisible assumptions before letting AI execute on it.

Yes. Not just review code Review the plan The AI will do what you said, not what you meant. So the human's job isn't just code review it's spec review live verification is the version of your debugging-tax practice that scales when the bug isn't in the code anymore. It's in the contract between layers this is the key insight. When the AI becomes good enough, the bugs won't be syntax or logic They'll be mismatched assumptions between layers And the only way to catch those is to run the code in the real environment 6 hours of correct AI-driven work. Saved by 30 minutes of live verification.

The ratio The tax doesn't disappear It just moves up the stack.

Thank you for this it's the most valuable comment in the thread. 🙌

TxDesk • May 31

Glad it resonated. The deeper failure mode I keep seeing: the spec and the tests are usually written by the same person at the same time, so they encode the same mental model. The mock is the spec, the test is the mock, the code matches both. Three layers of consistency, zero contact with reality.

The only thing that breaks it is running against the actual external surface - a real wallet, a real RPC, a real protocol contract. Once the loop has a foreign node that doesn't share your mental model, mismatched assumptions surface immediately. Until then, you're just verifying that you're consistent with yourself.

Harsh • May 31

Three layers of consistency, zero contact with reality that's the line the mock is the spec. The test is the mock. The code matches both. Perfect consistency Perfect irrelevance the only thing that breaks it is a foreign node that doesn't share your mental model.

Beautifully said. Thank you for this thread, TeDwk. 🙌

Ofri Peretz • May 29

The naming trap resonates more than the logic bugs do for me. A variable called data isn't just a readability annoyance — in security-sensitive paths it's a liability, because vague names let unsafe values flow further before anyone questions what they actually contain. I've started treating AI-generated variable names as a linting signal: if the name doesn't encode the domain (raw vs. validated, user-supplied vs. internal), I treat the code as unreviewed regardless of how clean the logic looks. The 30-second write / 5-hour debug ratio is real, but I'd frame the fix differently than "slow down at write time" — it's more about what static analysis you run before you trust the output, because the AI isn't going to tell you what it assumed.

Harsh • May 29

Ofri vague names let unsafe values flow further before anyone questions them this is a security insight most people miss A variable called data isn't just annoying It's dangerous Because no one can tell at a glance whether it's been sanitized validated or still contains user input The vagueness hides the risk If the name doesn't encode the domain raw vs validated, user-supplied vs internal treat the code as unreviewed.

That's a concrete rule. Not use better names if the name doesn't tell you the state, reject the code What static analysis you run before you trust the output

This is the frame shift. The article said slow down at write time You're saying automate the checking so you don't have to slow down manually Different approach, same goal catch assumptions before they become production bugs.

The AI won't tell you what it assumed So you need a system that checks for you.

Thank you for this security lens, actionable rule, frame shift. Three wins in one comment. 🙌

Stoyan Minchev • May 28 • Edited

I usually ask an ai to do a code review
In best case, new session, different model. There are different code review approaches that can be used as well. Things like that happen not only with ai generated code. How many times you have been I such situation, but with code written by human? The problem might not be in the AI, but in the process. ;)
And when things like that happen, this knowledge must be kept so that the ai don't do it again in the next session. ;)

Harsh • May 28

Stoyan fair point Humans cause these problems too Difference isn't frequency it's recovery Human code has fingerprints Intent AI code is smooth No intent to recover The problem might be in the process agreed The generate, copy, ship process skips the assumption check.

Knowledge must be kept so AI doesn't repeat it hardest part Humans learn from mistakes AI doesn't unless you explicitly save the lesson.

Thanks for this layer. 🙌

Stoyan Minchev • May 29

This hit me. Totally agree with you. AI can't learn itself.

In all cases, we can't leave it unsupervised.

Sephyi • Jun 1 • Edited

Are you doing TDD — having it write the test case first, then the implementation? I find that alone often makes things noticeably better. There's also the repeated-run technique (I forget what it's called) where you basically rerun the same prompt several times. I'd guarantee that even on the 10th run you'll still surface findings worth fixing.

Beyond that, I personally lean on dialectic verification: at least three separate models (e.g. Claude Opus, GPT-5.5, Gemini 3.1 Pro) each perform an independent, detailed review and produce a standardized report, which all get passed to a model of your choosing with fresh context that then synthesizes a unified final review. That said, I reserve this almost exclusively for large review runs at milestones, or after I've let an agent implement a plan. Lastly, always plan ahead. And after the review, instruct the agent to implement the plan accordingly.

If you're not confident reviewing the final report yourself, I'd suggest running each model 3× — giving you nine reports total. And then rating each finding by how many agents flagged it, so the unified report's findings table carries a confidence rating.

Another thing I just remembered is to always establish concrete coding standards and architecture at the beginning of a project.

The best results with AI-generated content were definitely achieved when I followed all these practices.

Harsh • Jun 1

Sephyi repeated-run technique is underrated. Even the 10th run surfaces something new The model isn't deterministic same prompt, different output, different flaws Dialectic verification three separate models review independently, then a fourth synthesizes a unified review this is Level 3 (cross-model) on steroids Not just run another model Have them disagree then synthesize the disagreement into signal Nine reports total (3 models × 3 runs) rating findings by how many agents flagged it this is the meta-layer Consensus = confidence. Disagreement = investigation needed the trade-off: this is expensive (time, tokens, attention) You reserve it for milestones which is the right call. Not every PR needs nine reports.

If you're not confident reviewing the final report yourself this is the honest admission The human still has to be the one who can review it. The system helps, but doesn't replace judgment.

Thanks for sharing this it's the most sophisticated workflow in the thread. 🙌

Sephyi • Jun 1

No problem. ❤️ If you’d like, I can happily look up the Skill I use for this later in case you could use it. Essentially, it instructs Claude Code to use besides itself for native models, including Codex and Gemini CLI.

Andrii Krugliak • May 29

The 30-seconds-to-write, 5-hours-to-debug split is the number nobody puts on the slide. The quiet assumption that crashes at 1% is worse than a loud failure, because it ships looking fine. I've started treating "time until I trust it in prod" as the real cost, not write time.

Harsh • May 29

Andrii time until I trust it in prod the real metric Not write time Trust time Quiet assumption that crashes at 1% is worse than loud failure because it ships looking fine Loud failure gets caught. Quiet assumption looks like success until it doesn't You shifted from speed to confidence Not how fast to generate How fast to trust.

Smartest reframe here. 🙌

Andrii Krugliak • Jun 4

Trust-time is measurable in a way write-time never forced us to be: how many times did it ship something that passed every check and still broke at 1 percent. We started logging that as its own number, separate from velocity, and it quietly changed which agents we let run unattended. The loud failures were never the ones that hurt.

Backrun • May 29 • Edited

There's a step nobody talks about that happens right after the code is "done."

For non-technical users like marketers and solo founders, the debugging problem you described actually starts before debugging. It starts at deploy. AI gives them the HTML in 30 seconds. Then they spend the next 2 hours trying to figure out Netlify, GitHub Pages, FTP, or just... giving up and leaving the HTML sitting in the chat window.

That's literally why I built HTML Deployer, a Chrome extension that lets you deploy AI-generated HTML directly from your ChatGPT or Claude tab without touching a terminal. The "deploy tax" is as real as your debugging tax, it just hits a different audience.
Great post by the way. The empty list bug story is painfully relatable.

Harsh • May 29

Backrun deploy tax is a real thing and you're right nobody talks about it For developers deploy is an afterthought. For non-technical users, its Not the wall They get the code but they can't get it live The chat window becomes a graveyard of working HTML that never saw the light of day.

The debugging problem starts before debugging at deploy this is the insight Different audience, same hidden cost. Speed at generation, friction at delivery HTML Deployer sounds genuinely useful for this gap No terminal no Netlify config just publish.

The empty list bug glad it landed. And thanks for the kind words. 🙌

xulingfeng • May 29

30 seconds to write, 5 hours to debug — that ratio hit hard. We run Hermes (an agent framework) for automated testing and I've noticed the same pattern: AI generates tests quickly but misses the edge cases that a human with context would catch naturally.

We ended up building a validation layer that forces the agent to explicitly state its assumptions before generating code. It hasn't eliminated the debugging time entirely, but it's cut it by about 60%.

Curious — did you end up with a systematic approach to catch those quiet assumptions, or is it still a per-case thing?

Harsh • May 29

xulingfeng validation layer before generation best pattern in the thread 60% reduction is huge. Honest that it didn't eliminate entirely AI can't know assumptions it doesn't know.

Systematic vs per-case? Still per-case. But patterns emerging:
Empty pattern (lists, inputs, results)
Wrong type pattern (raw vs validated)
Edge case combinations

I keep a mental checklist. After enough 5-hour sessions, you see the same shapes.

What does your validation layer check? Would love to learn more. 🙌

xulingfeng • May 29

Our validation layer checks three things before generation:
Structure — Does the output match the expected schema/type? (List vs dict, required fields present.)
Constraints — Business rules models tend to ignore. (ID can't be empty, result set can't be null.)
Consistency — Cross-field sanity checks. (Sum of parts equals the total.)
Can't catch everything, but it intercepts the most expensive failures.
Followed! Great discussion 🙌

Harsh • May 29

xulingfeng Structure Constraints Consistency Three simple checks. 60% reduction Can't catch everything but intercepts the most expensive failures.

That's the 60% right there.

Followed back. Great discussion. 🏅

xulingfeng • May 30

Appreciate the follow back! 🙌 60% on simple structural checks is a solid baseline — we see similar numbers. The remaining 40% is where the interesting edge cases live. Glad the discussion resonated!

View full discussion (121 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.