Bala Paranj

Posted on Jun 3

Fallacies of GenAI Development #6: AI-Generated Code Is an Asset

#ai #softwaredevelopment #architecture #engineering

This is the sixth in a series of eight posts on the false assumptions teams make when building with generative AI. Fallacy #1 covered the generation-engineering gap. Fallacy #2 covered plausible vs. correct. Fallacy #3 covered AI verifying AI. Fallacy #4 covered removing review. Fallacy #5 covered context vs. verification. This post covers the assumption that generated code is value — when it's actually cost.

The Fallacy

"The AI generated 10,000 lines this week. We're 10x more productive."

Why it's tempting

Productivity has always been hard to measure in software. Lines of code was a bad metric, but it was a metric. When AI made code generation fast, the metric became irresistible again. The team generated more code. The PRs are larger. The features ship faster. The graphs go up and to the right.

Leadership loves it. More output. More features. More velocity. The investment in AI tooling is paying off — look at the numbers. The team is producing more than ever.

And it feels good to the engineers too. You describe a feature. The AI generates the implementation. You ship it. The dopamine hit of productivity is real. You built many things today. The backlog is shrinking. The sprint velocity is through the roof.

Why it's wrong

Jeff Atwood said it plainly: code is a liability, not an asset. The concept originates in Lean Manufacturing, applied to software by Mary and Tom Poppendieck (2003): unshipped code is Work in Progress waste, and shipped code is a maintenance burden. AI didn't change the economics — it made it faster to create the liability.

Every line of code is something you have to:

Compile. More code means longer build times. Bigger binaries.
Test. More code means more test cases needed. Dependencies grow quadratically — 10x code may mean 100x test compute.
Debug. More code means more potential failure points. More places for bugs to hide.
Secure. More code means more attack surface. More code paths to audit.
Understand. More code means more cognitive load. More context to hold when making changes.
Maintain. More code means more things that break when dependencies change. More migration work when frameworks evolve.

Studies by the Consortium for Information & Software Quality (CISQ) put numbers to this: the cost of developing code is typically only 20–30% of its total lifecycle cost. The remaining 70–80% is maintenance. If AI makes development 10x faster but doesn't change the maintenance profile, the total lifecycle savings are negligible — only a 10% reduction of the 30% development slice. If the AI increases code volume, the total cost of ownership actually rises, even if development was free.

The team that generated 10,000 lines this week didn't create 10,000 lines of value. They created 10,000 lines of ongoing cost. The value was in the feature. The cost is in the code. The feature could have been delivered with 1,000 lines if the right abstractions existed. The 9,000 extra lines are pure liability.

The boom

Month 1-3: The output surge. The team generates more code than ever. Features ship. The backlog shrinks. Sprint velocity doubles, then triples. Leadership presents the metrics at the quarterly review. AI is working.

Month 4-6: The build slows down. Compilation time increases. CI pipeline takes longer. Developers wait for builds. The wait time is small at first — 10%, 15% longer. Nobody notices because the generation is so fast. But the trend is upward.

Month 7-9: The test burden compounds. New code depends on existing code. Existing code depends on other existing code. The dependency graph grows quadratically. A change to one module triggers tests in 50 other modules. The test suite that ran in 8 minutes now takes 45 minutes. Developers start skipping test runs locally and relying on CI. CI queues back up.

Winters, Manshreck, and Wright document this in Software Engineering at Google (2020). Adam Bender names it precisely: "If your codebase is 10 times larger and you're trying to test all the dependencies so that you're sure nothing will break, you may have upwards of 100 times as many tests running. Maybe 1,000 times as many tests. That's going to be a line item in your budget."

Month 10-12: The maintenance cliff. A dependency needs updating. A security patch requires changes. A framework is deprecated. Each change touches thousands of generated lines that nobody fully understands. The team that generated code in minutes spends weeks migrating it. The code that was cheap to produce is expensive to maintain. Lehman's Second Law of Software Evolution (1980) predicts this: as a program is continually changed, its complexity increases unless active work is done to reduce it. AI is purely additive — it adds structure-deteriorating code at high speed. Without deliberate refactoring and deletion, the system reaches the maintenance cliff faster than human teams can recover.

There's a subtler rot underneath. Five developers each generated their own version of formatDate(), parseUserInput(), and retryWithBackoff(). Each used a different prompt. Each produced a slightly different implementation. A bug is found in one version of parseUserInput() — but nobody knows four other versions exist scattered across the codebase. The fix patches one. The other four remain broken. This is inconsistency debt — the invisible cost of generation without reuse. A library would have had one implementation. Five generations created five liabilities. Google's "One Version Rule" (documented in Software Engineering at Google) exists precisely to prevent this: every library has a single version across the monorepo. AI generation is the ultimate violator of the One Version Rule — each prompt produces a "new version" of a common utility, creating dependency hell within a single codebase.

And then the quiet realization: the team isn't building new features anymore. They're maintaining old code. The AI generates new code faster than ever. But the team spends most of their time on the code that already exists — patching, migrating, debugging, securing. The generation speed is irrelevant when the maintenance burden consumes all available time.

The ratio inverts. In month 1, the team spent 90% of their time building new features and 10% maintaining existing code. By month 12, it's 30% building and 70% maintaining. The AI made the 30% faster. Nobody made the 70% faster. The overall velocity DECREASES even as the generation speed INCREASES.

The unit of progress is wrong

The software industry has been optimizing for the wrong unit of progress for decades. AI made the mistake faster.

Code generation:     More lines → more volume → more liability
Function composition: Same functions, composed differently → more capability → zero new liability

Unix proved the right unit 50 years ago. Doug McIlroy, the inventor of Unix pipes, summarized the philosophy in 1978: "Expect the output of every program to become the input to another, as yet unknown, program." grep, sort, uniq, awk, sed — each is a function with a stable interface (stdin/stdout). They've survived:

The transition from mainframes to minicomputers to PCs to servers to cloud
Six generations of operating systems
Dozens of programming language fashions
Every deployment paradigm from tape to containers

They survived because the unit of reuse is the function, not the code. Nobody regenerates grep for each project. Nobody writes a new sorting algorithm for each application. You compose existing functions through stable interfaces. The value is in the composition. The functions are reusable. The liability is near zero because you're not maintaining them, it is the job of the upstream maintainers. AI generation is an anti-Unix force: it generates custom, non-standard implementations every time instead of reusing standard tools.

Google proved the same principle at organizational scale. When a team needs an RSS feed generator, they reuse the existing RSS function. They don't generate new code that reimplements RSS parsing. Google's internal codebase has millions of reusable functions with stable interfaces. The productivity comes from knowing which function to use, not from generating a new implementation.

The metric inversion

The fallacy persists because teams measure the wrong thing:

What teams measure:          What they should measure:
─────────────────────        ───────────────────────────
Lines of code generated      Properties verified per module
PRs merged per sprint        Specification coverage (% of behaviors governed)
Features shipped             Functions reused vs. code generated
Sprint velocity points       Maintenance burden (% time on existing vs. new)

The left column goes up when you generate more code. Leadership celebrates. The right column would show that more generated code means LOWER specification coverage (more ungoverned behavior), LOWER reuse ratios (more reimplemented functions), and HIGHER maintenance burden (more time on existing code).

The team that generated 10,000 lines and verified zero properties is more productive on the left-column metrics and more indebted on the right-column metrics. The left column measures output speed. The right column measures engineering health. When these diverge — output going up, health going down — the team is building debt, not value.

The resolution: compose, don't generate

The alternative to generating more code is composing fewer, better-verified units.

GENERATE (current model):
    Need an RSS parser    → AI generates 200 lines
    Need a date formatter → AI generates 80 lines
    Need an HTTP client   → AI generates 150 lines
    Need JSON validation  → AI generates 120 lines

    Total: 550 lines of new code to maintain
    Properties verified: 0
    Reuse potential: 0 (tightly coupled to this project)

COMPOSE (alternative model):
    Need an RSS parser    → import rss-parser (maintained upstream)
    Need a date formatter → import date-fns (maintained upstream)
    Need an HTTP client   → import http-client (maintained upstream)
    Need JSON validation  → validate against JSON Schema (maintained upstream)

    Total: 4 import statements + 20 lines of composition
    Properties verified: each library has its own test suite
    Reuse potential: 100% (same libraries across every project)

The composed version has 96% less code. 96% less to compile, test, debug, secure, understand, and maintain. The libraries are maintained by their upstream communities. The composition is the only new code — and it's 20 lines, reviewable in minutes.

AI's best role isn't generating the 550 lines. It's helping you FIND the right four libraries and COMPOSE them correctly. The generation is the least valuable part. The selection and composition are the most valuable parts.

The expert AI-assisted engineer isn't a faster typist. They're a better librarian. Their value isn't in generating a custom implementation of a common pattern. It's in knowing that an optimized, well-tested, actively-maintained library already exists — and using the AI to write the 20 lines of glue code to connect it. The librarian ships less code and more capability. The typist ships more code and more debt. Caldiera and Basili (1991) demonstrated this in their research on software reuse: successful software evolution depends on prioritizing reuse over creation. AI makes creation so cheap that it crowds out the impulse to search for existing libraries — a phenomenon engineers recognize as "Not Invented Here Syndrome," now running at machine speed.

When generation is appropriate

Generation is appropriate when:

No reusable function exists. Genuinely novel logic with no upstream library. Generate it — but immediately extract it as a reusable function with a stable interface, tests, and a specification. Don't leave it inline.

The code is disposable. Prototypes, experiments, one-off scripts. Generate freely. But don't let disposable code become production code. Bender's isolation principle: "You don't want that cool prototype code to find its way into production."

The composition IS the code. Glue code that connects well-tested components. This is appropriate for generation because the value is in the wiring, the individual components are already verified, and the glue code can be verified against the interface contracts of the components it connects.

In each case, the generated code should be the MINIMUM necessary — not the maximum the AI can produce. The AI that generates the least code to solve the problem is more valuable than the AI that generates the most. Less code = less liability = less maintenance = more time for new capabilities.

The Delete Code Test

If your team generated 10,000 lines last week and you deleted 9,000 of them, replacing them with imports and 200 lines of composition — would the system work the same way?

If yes, you generated 9,000 lines of pure liability.

If you're not sure, you don't have enough specifications to know — which is Fallacy #2 (plausible ≠ correct) and Fallacy #5 (context ≠ verification) revisiting you.

The goal isn't more code. It's more CAPABILITY with LESS CODE. Every line you didn't generate is a line you don't maintain. Every function you reused instead of regenerated is a function someone else maintains. Every specification you verified is a property that holds regardless of how much code is behind it.

Code is a liability. Less of it is better. The most productive team isn't the one that generates the most. It's the one that ships the most capability while maintaining the least code.

What you can do this week

1. Measure your generation-to-reuse ratio. For your last 10 AI-generated files, count: how many imported existing libraries vs. how many reimplemented functionality that libraries already provide? If reimplementation exceeds reuse, the AI is generating liability.

2. Before generating, ask: "Does this function already exist?" Better yet, ask the AI: "Is there a popular, well-maintained open-source library that handles [Task X]?" BEFORE asking it: "Write a function that handles [Task X]." If a library exists, the AI's job is to write the integration code, not the logic. The 5-minute search saves weeks of future maintenance.

3. Track maintenance burden monthly. What percentage of your team's time goes to maintaining existing code vs. building new capabilities? If that percentage is rising while code generation is also rising, the generation is feeding the maintenance burden. The metric tells you whether your AI investment is producing value or producing debt.

4. Measure your deletions-to-additions ratio. A high-performing AI-assisted team should be deleting redundant code and replacing it with better abstractions as often as they add new code. If your ratio is 10:1 additions-to-deletions, the codebase is growing without consolidation. If it's closer to 2:1 or even 1:1, the team is composing — replacing generated sprawl with verified, reusable units. The ratio tells you whether the AI is building capability or building debt.

The AI is a remarkable code generator. That's the problem. The skill isn't generating more code. It's knowing when NOT to generate — when to compose, when to reuse, when to import, and when to write the 20 lines of composition instead of the 550 lines of reimplementation.

Next in the series: **Fallacy #7 — "Specifications Are a New Artifact You Have to Create."* Why the specifications already exist in your codebase, why Parnas told you to create them in 1972, and why the only new thing is mechanical enforcement.*

The Fallacies of GenAI Development: eight assumptions every team is making. Each one leads to an architectural failure. Each one has already been solved.

References

Caldiera, G. and Basili, V. (1991). "Identifying and Qualifying Reusable Software Components." IEEE Computer, 24(2).
CISQ (Consortium for Information & Software Quality). "The Cost of Poor Software Quality in the US."
Lehman, M.M. (1980). "Programs, Life Cycles, and Laws of Software Evolution." Proceedings of the IEEE, 68(9).
McIlroy, M.D. (1978). "Unix Time-Sharing System: Foreword." The Bell System Technical Journal, 57(6).
Poppendieck, M. and Poppendieck, T. (2003). Lean Software Development: An Agile Toolkit. Addison-Wesley.
Winters, T., Manshreck, T., and Wright, H. (2020). Software Engineering at Google: Lessons Learned from Programming Over Time. O'Reilly Media.

DEV Community