DEV Community

Cover image for Minimal Code Doesn’t Mean Stable Code
Adam - The Developer
Adam - The Developer

Posted on

Minimal Code Doesn’t Mean Stable Code

Failure modes in distributed systems

The argument sounds reasonable: fewer lines of code mean fewer bugs. Simpler to review, easier to reason about, less surface area for defects. Sounds great. It's true. But it's also incomplete.

The problem starts when backend developers treat production systems like homework assignments. In a single-process app:

you control execution. You know the order. Threads might race, but at least they share the same memory and clock.

Once you have APIs talking to databases, webhooks firing at midnight, async jobs on a queue, and three replicas behind a load balancer, the failure modes multiply: connections drop, messages arrive out of order, clocks disagree, and partial failures show up at 3 AM on Tuesdays.

Trimming code doesn't make any of that go away. It just hides the complexity until something breaks.

When Minimal Code Meets Production

Consider what happens when your minimalist masterpiece meets reality:

Your service temporarily loses connection to the database for 30 seconds. Your code has no timeout logic. Requests hang. Users refresh. More requests queue up. Eventually something breaks.

Two instances process the same webhook because you thought "that probably won't happen." No idempotency key, so the charge runs twice. Your balance sheet now has an extra $50,000 in it. Your accountant is confused. Your manager is less confused.

A worker crashes mid-operation. There's no recovery mechanism. The transaction is abandoned in an inconsistent state. Your data is now in a state that violates every assumption you made about how it should look.

A retry storm after a downstream blip hammers your API because nothing backs off or deduplicates. Rate limits trip. Legitimate traffic gets dropped. You're debugging an outage caused by code that "handled errors" by logging and returning.

None of these are prevented by writing less. They're prevented by writing the boring safeguards you skipped because they looked redundant.

What Production Actually Requires

Modern backend systems need safeguards that simple applications never had to think about:

Idempotency. Every operation must be safe to retry. A payment webhook redelivered, a queue message processed twice, a client that retries on timeout—all of these need a way to recognize "already done." Operation IDs, version numbers, dedupe keys. Not glamorous. Required.

Timeouts. Requests to other services need deadlines. Without them, cascading failures happen silently and gradually consume all your resources. Your code will just sit there, waiting, like a phone call that never connects.

Compensation Logic. When a multi-step operation fails partway through, something has to undo the work already committed. You can't abandon a half-finished saga and hope nobody notices. That's more code than assuming success. People skip it anyway.

Conflict Detection. When two writers touch the same record—two API instances, a retry overlapping with the original request—you need version checks, timestamps, or optimistic locking. Pretending conflicts don't exist works until two updates land in the wrong order.

Observability. Logging, metrics, and traces that let you reconstruct what happened when something fails. At 3 AM, you'll wish this existed. When something breaks and you have no logs, you'll understand why this matters.

You can't delete these and call it simplification. You're just moving complexity from your editor into your on-call rotation.

Less Code vs. Less Noise

Kill redundant abstractions, dead logic, and speculative frameworks. That's good discipline.

Deleting retry wrappers, validation, circuit breakers, or idempotency checks because they "add noise" is a different move.

You're betting stability on dependencies you don't control. When the database hiccups, the partner API times out, or Kubernetes reschedules a pod mid-request, the system doesn't get simpler. It gets wrong.

The Test

If your app runs more than one instance, talks to other services, or processes work asynchronously, these questions will eventually matter:

  1. If a process dies mid-operation, can the system detect it and recover correctly?
  2. If a message is delayed several seconds, what actually happens?
  3. If two workers attempt the same operation at once, is the result deterministic or a coin flip?

If you can't answer all three with specific mechanisms—not vibes, not "we'll fix it in prod"—the codebase isn't simple. It's fragile.

Write the safeguards. Handle the failure modes. The goal isn't more lines for their own sake; it's making hidden complexity visible before production does it for you.

Top comments (21)

Collapse
 
xwero profile image
david duymelinck

It looks like the post mainly targets high volume systems and event driven systems.

I agree that simple code should not mean abandoning safeguards, but I wonder if there are people that think that way once scaling the system horizontally is not an option anymore?

Collapse
 
adamthedeveloper profile image
Adam - The Developer

That's a fair point. The examples intentionally focus on horizontally scaled and async systems because those failure modes are becoming increasingly normal in everyday backend development.

A lot of modern applications are distributed long before teams consciously think of them that way. Running multiple replicas behind a load balancer, background workers, queues, retries, webhooks, caches, autoscaling — this is becoming standard infrastructure even for relatively small products.

So this wasn't really aimed only at "massive scale" systems. The point was more that once your application runs across multiple processes, instances, or services, some of the "boring" safeguards stop being optional.

The interesting part is that modern tooling makes distribution feel deceptively invisible. You can deploy three replicas to Kubernetes in minutes, but the moment you do that, problems like duplicate execution, retries, partial failures, ordering issues, and race conditions become real whether you explicitly designed for them or not.

That's really the gap the article was trying to highlight.

Collapse
 
xwero profile image
david duymelinck

A lot of modern applications are distributed long before teams consciously think of them that way

It is true distribution is very easy with today's tools. But that doesn't mean when it happens the consequences should be ignored.

For me the high volume system and event driven system are two separate things. Most of the time high volume system are event driven, but event driven systems can be low volume.
The safeguards you mention have a lot of reach. For example the 30 second database connection loss should never happen. To fix that as soon as possible monitoring should be set up to bring the database back up. That is not in the scope of the application code, that is operations.

Thread Thread
 
adamthedeveloper profile image
Adam - The Developer

yup, agreed.

I think the key distinction is just that infra/ops and application correctness sit in different layers - both matter, but they solve different parts of the failure story.

And also fair point on event-driven vs high-volume —they’re orthogonal, but they tend to overlap in practice, which is probably why they get mixed in discussions like this.

Collapse
 
kansoldev profile image
Yahaya Oyinkansola

Makes a lot of sense what you've said here. This is why most teams need to understand the amount of technical debt they are adding when going for so called "simple solutions". Sometimes, simple too can look complex or ugly, as long as this issues like race conditions and duplicate requests are solved properly.

Thread Thread
 
adamthedeveloper profile image
Adam - The Developer

Exactly, that’s pretty much the tradeoff.

“Simple” at the surface can still carry hidden complexity if those failure modes aren’t handled explicitly. And yeah, sometimes the correct solution looks a bit ugly precisely because it’s accounting for things like retries, races, and duplication.

The real technical debt usually isn’t in the extra safeguards themselves, it’s in pretending those problems won’t exist.

Collapse
 
codingwithjiro profile image
Elmar Chavez

Wow, I just started exploring more into backend and I'm not expecting to learn great real-world topics here. But yeah, I agree. Less code is often good but we must also code what is required and what makes our codebase more robust. Thanks for this!

Collapse
 
adamthedeveloper profile image
Adam - The Developer

uh huh, “less code” is only good when it’s removing noise, not when it’s removing safeguards.

Once you start dealing with retries, failures, and multiple instances, robustness becomes part of the design, not an optional extra. And sometimes that naturally makes the code a bit heavier, even if the system is actually better engineered.

im glad it helped — this is one of those things that only really clicks once you’ve seen a few real production failures.

Collapse
 
capestart profile image
CapeStart

Minimal code reduces syntax. Stable systems survive reality.

Collapse
 
adamthedeveloper profile image
Adam - The Developer

Well said. Syntax is cheap. Reality has retries, timeouts, and partial failures at when ur asleep.

Collapse
 
rondo profile image
Rondo

So true. There's no guarantee that short code is always stable. It's not a matter of length of code but stability itself.
I think readability also comes before the length of code just in case issues happen. Some people(even myself sometimes) think short code is more readable than long code but it's not always true.
Thank you for the insight.

Collapse
 
adamthedeveloper profile image
Adam - The Developer

Exactly this. Readability > length every time. Those "extra" lines often tell future-you what could go wrong. Appreciate you 🙏

Collapse
 
glassesramone1234 profile image
Brian Munz

Years ago I remember being very proud of myself for writing a recursive function which performed some complicated data sorting. It replaced maybe 400 lines of code with 30 lines, but unfortunately for devs to make changes to it, they had to spend frustrating time trying to figure out what was going on. Any savings in lines of code was lost in time and anger 😅

Collapse
 
adamthedeveloper profile image
Adam - The Developer

Yeah, I’ve definitely seen this too 😄

Less code feels great at first, but if nobody else can safely touch it later, the “win” disappears pretty fast.

Collapse
 
thetylern profile image
Tyler N

I think that you’ve made a really important point. I think that this topic of developers removing or simply not writing code to handle edge cases (like two users doing the same operation at the same time, messaging delays, operations happening multiple times, like your post said) is a large issue. I think there are large and concerning similarities between what you talked about in your post and the recent massive movement to rewrite as many things as you can in Rust. Often times, developers remove edge cases just to rewrite something in Rust, and many times just completely AI generate the rewrite without time for human review (like the bun.js rewrite). In your opinion, do you think the strictness of the rustc compiler outweighs the often times decades of edge cases in large distributed programs, and do you think it outweighs it enough to justify total rewrites?

Collapse
 
adamthedeveloper profile image
Adam - The Developer

Man, that's a thoughtful take. The Rust angle is interesting.

I think the compiler catches memory bugs, not production failure modes. Rust won't save you from idempotency, clock skew, or a database going away for 30 seconds. Those are design problems, not language problems.

Rewrites are tempting because greenfield feels clean. But like you said — you're often throwing out years of edge case fixes that were paid for in pager duty time.

The Bun example is spot on. AI-generated rewrites with no human review? That's just moving the complexity somewhere else, not removing it.

So no — rustc's strictness doesn't justify most rewrites of large distributed systems. Would much rather see teams invest in observability and proper failure handling in whatever language they already have.

Appreciate you adding that layer to the discussion 🙏

Collapse
 
thetylern profile image
Tyler N

Thanks for replying to my comment! When code is being blindly regenerated with AI without human review, the obervability and error handling that existed before can be removed during the rewrite.

I think that before a group or team goes ahead and rewrites everything in Rust, they usually first try to incorporate Rust in less dramatic ways than a full rewrite. The Linux Kernel is definitely in this stage now, and I hope that if they start to encourage people to rewrite things in Rust, they at least ensure that no edge cases are being removed, or simply forgotten. As a Linux user myself, this is very important for me.

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more