Alok Ranjan Daftuar

Posted on May 29 • Originally published at aloknecessary.github.io

Why Lift-and-Shift Fails Quietly: Architectural Smells That Appear After Migration

#cloud #architecture #devops #migration

Every cloud migration starts with a promise: "We'll get onto cloud first, optimize later." That sentence is where the trouble begins.

Lift-and-shift leaves on-premises assumptions baked into a system operating in a fundamentally different environment. The failure doesn't arrive on day one. It arrives three months later, in a Slack alert at 2am, or in an invoice that made a VP ask uncomfortable questions.

1. Latency Amplification

On a physical LAN, a service call is sub-millisecond. In a cloud VPC, even same-AZ calls incur 1-3ms. A service making 40 synchronous downstream calls goes from ~4ms network overhead to ~160ms — without any code change.

Same call graph. Same code. 8x more latency — purely from network topology.

Fix: consolidate reads with batch APIs, introduce async messaging for non-critical paths, add caching for hot reference data.

2. Chatty Services

The N+1 problem at infrastructure scale. A service making 60 per-entity HTTP calls to render a dashboard is annoying on LAN. In cloud, it's a 300-600ms tax on every page load.

Chatty patterns also exhaust connection pools faster — each call traverses the network and holds an open connection during transit.

Fix: batch endpoints on all internal APIs, DataLoader pattern, connection pool profiling under realistic concurrency.

3. Cost Surprises

The PoC cost $340. The first production month is $8,200. Nobody changed the architecture.

Data egress — free on-prem, metered in cloud. Cross-AZ, cross-region, and internet egress all bill.
Over-provisioning — on-prem sizing instincts (buy for 3-5 years) don't translate. Cloud charges per idle CPU cycle.
Idle infrastructure — dev/staging environments left running 24/7.

4. Stateful Assumptions

In-memory session state works with a single server. The moment you auto-scale, 33% of requests hit instances with no session. Filesystem dependencies break when containers reschedule or pods restart.

Fix: externalize session to Redis. Replace local filesystem writes with object storage at the upload boundary.

5. The Observability Void

On-prem monitoring (Nagios, Zabbix) watches hardware metrics that mean nothing in cloud. What you need to observe is different: cold start times, managed service throttling, connection pool utilization, cost-per-request.

The danger window is immediately after migration when legacy monitoring reports "all green" while user-facing metrics degrade invisibly.

6. The Monolith in Microservice Clothing

Containerized and deployed to Kubernetes with separate deployments per service. On the surface: microservices. Underneath: shared database schemas, synchronous HTTP chains, coordinated deployments. A distributed monolith you think is clean is a production incident waiting to happen.

A Realistic Migration Philosophy

Lift-and-shift is not a failure state. It's a phase. The mistake is treating it as a destination. Every migrated workload should have a documented list of known architectural debts, an owner for each, and a timeline to address them — agreed before the migration.

Moving to cloud does not modernize your architecture. It gives you a new environment in which your existing architectural decisions — good and bad — will be amplified.

Read the Full Article

This is a summary of my deep dive into post-migration architectural smells. The full article covers all six patterns with diagnostics, mitigations, and a pre-migration review checklist:

👉 Why Lift-and-Shift Fails Quietly — Full Article

The full article includes:

Latency amplification with SVG architecture diagram (on-prem vs cloud)
Chatty services with before/after code examples and connection pool diagnostics
Cost surprise breakdown with egress pricing tables
Stateful assumptions with session externalization code (Node.js/Redis)
Observability void with Prometheus recording rules for post-migration signals
Distributed monolith diagnostic patterns
Complete pre-migration architecture review checklist

Top comments (3)

Adam Lewis • May 30

The "monolith in microservice clothing" is the one that does most of the damage in my experience. Separate Kubernetes deployments fool the org chart, but a shared database schema and synchronous HTTP chains mean every release still needs a coordination meeting. The seam was never cut. The pods just made the wiring harder to see.

The pre-migration debt list with an owner and a timeline is the right call and almost nobody does it. Once the workload runs in cloud the pressure to address it drops because the original promise ("we'll optimise later") has been technically kept. A dated owner per debt entry is what gives you something to point at in six months when nothing has moved.

Alok Ranjan Daftuar • Jun 1

Spot on — and you've named the mechanism more precisely than I did in the post. The org chart gets restructured, separate repos get created, Kubernetes deployments get separate namespaces, and yet the seam was never actually cut. The shared schema is the tell. If two "services" can't be deployed independently without a migration script or a coordination call, they're not services — they're a monolith with a longer blast radius and a much harder debugging story.
The Kubernetes layer makes it worse, not better, for exactly the reason you describe: the coupling is now invisible. On-prem, the wiring was at least physically traceable. In a distributed deployment, you need a service map from your APM just to reconstruct the call graph that any senior engineer would have drawn on a whiteboard six months ago.
On the debt registry point — completely agree, and the psychology of it is the harder problem. The pressure to address pre-migration debt doesn't just drop after cutover; it inverts. The original promise was "we'll optimise after we're on cloud," and technically, you are on cloud. Promise kept. The dated owner per item is precisely what breaks that logic — it converts a vague intention into an accountable commitment. Without it, the backlog just grows quietly until someone's on-call at 2am tracing a cascade through a dependency chain that everyone knew was fragile but nobody owned.
The architecture can absolutely stand tall through a lift — but only if the team is honest about what "lifted" actually means. Running in Kubernetes is not the same as being cloud-native. That distinction is the whole point of the post.

Adam Lewis • Jun 9 • Edited

Yeah, the shared schema is usually the giveaway. If two services both read and write the same tables they aren't really separate, whatever the deployment diagram says, because neither one can change the shape without checking with the other first. What helped us was making one service the owner of each schema and having the others go through it instead of touching the tables directly. It doesn't decouple everything on its own, but it at least makes the coupling visible, and you stop getting the surprise where a migration on one side quietly breaks the other. prickles.org/tenet/schema-sovereig...