Manish Kumar

Posted on Feb 16

AWS Cloud Case Study: Migrating a Monolithic PHP Application to AWS ECS with RDS

#aws #monolith #ecs #php

Executive Summary

Project Overview

Successfully migrated a legacy monolithic PHP e-commerce application from a single dedicated server to a modern AWS cloud architecture utilizing Amazon ECS (Fargate), RDS MySQL, and supporting services. The project was completed within the 6-week timeline and under the 2,000/month infrastructure budget, delivering immediate business value and establishing a foundation for future growth.

Business Challenge

The client, a mid-sized e-commerce company serving 50,000 daily active users, faced critical limitations with their legacy infrastructure:

Scalability crisis: Black Friday 2023 caused 4 hours of downtime due to the inability to scale horizontally
Deployment risk: Manual deployments with no rollback strategy created operational anxiety
Cost inefficiency: 800/month dedicated server running at 15% average utilization but unable to handle traffic spikes
Security exposure: Running end-of-life PHP 7.2 with no security patches, creating compliance risks
Availability concerns: Single server architecture with no redundancy or disaster recovery capability

The Challenge

I was approached by a mid-sized e-commerce company in late 2024 that was running a legacy PHP application on a single dedicated server—one of those "grew organically over 8 years" situations. Their entire stack lived on one beefy machine: Apache, PHP 7.2, MySQL, Redis for sessions, and about 200GB of product images scattered across the local filesystem. The application itself was a classic monolith built with a custom PHP framework (pre-Laravel days), serving around 50,000 daily active users during normal periods.

The pain points were mounting:

Scalability nightmare: Black Friday 2023 took the site down for 4 hours because they couldn't scale horizontally. Manual vertical scaling meant scheduling downtime, which their business couldn't afford anymore.
Deployment anxiety: Every code deployment required SSH-ing into the production server, running git pull, and hoping nothing broke. No rollback strategy existed beyond "restore from backup and pray".
Cost inefficiency: They were paying 800/month for a dedicated server that sat at 15% CPU utilization most of the time, but would spike to 100% during promotional campaigns.
Security concerns: Running PHP 7.2 meant no security patches, and the compliance team was breathing down their necks about PCI-DSS requirements.
Database bottleneck: A single MySQL instance with no read replicas meant every analytics query slowed down customer-facing transactions.

The CEO gave me a budget of 2,000/month for infrastructure and a 6-week timeline to migrate without disrupting their upcoming summer sale. The constraint was tight—they couldn't afford more than 15 minutes of downtime total, and rollback capability was non-negotiable.

Initial Assessment

I spent the first week doing a deep dive into their architecture. Here's what I discovered during the analysis:

Application architecture findings:

The codebase was about 120,000 lines of custom PHP, with heavy coupling between the presentation layer, business logic, and data access
Session management was handled by Redis, but it was running on the same server (single point of failure)
File uploads went directly to the local disk, creating state that made horizontal scaling impossible
Database queries were scattered throughout the codebase with no ORM—just raw mysqli calls

Performance bottlenecks identified:

Average page load time: 2.8 seconds (ouch)
Database queries per page: averaging 47 queries, with some pages hitting 200+ (classic N+1 problem)
Peak traffic: 850 requests/minute during flash sales
Memory usage: PHP processes were averaging 128MB each, with occasional memory leaks pushing some to 512MB

Stakeholder interviews revealed:

The development team was small (3 developers) and had minimal DevOps experience
They deployed about 15 times per month, always during off-peak hours (2 AM deployments were the norm)
No automated testing existed, so every deployment felt like Russian roulette
The marketing team wanted the ability to scale up predictably for campaigns without involving engineering

Risk factors I flagged:

The MySQL database had no recent performance baseline—I found queries taking 30+ seconds during peak hours
They had backups, but had never actually tested a restore (spoiler: the first test restore failed)
The PHP codebase used deprecated functions that wouldn't work on PHP 8 without modifications
About 15% of the application logic existed as stored procedures in MySQL, creating tight coupling

After running MySQL's slow query log for 48 hours and analyzing their access patterns with New Relic, I realized this wasn't just about lifting and shifting—we needed thoughtful service decoupling even within a containerized monolith approach.

Solution Design

Given the constraints, I decided against a full microservices rewrite. Instead, I designed a "monolith-first" containerization strategy that would give them immediate benefits while creating a foundation for future decomposition.

Architecture decisions and rationale

Compute: AWS ECS with Fargate

I chose ECS over EC2-based containers for several reasons:

The workload had highly variable traffic patterns (70% idle, 30% burst), making Fargate's pay-per-use model more cost-effective than maintaining EC2 instances
Fargate eliminated the operational overhead of managing container hosts—critical given their small team
Built-in integration with ALB and AWS service mesh simplified the networking layer
For their workload (2 vCPU, 4GB RAM per task), Fargate cost ~67/month per continuously running task versus ~30/month for a comparable t3.medium EC2 instance, but the 40% time they weren't running tasks made Fargate 25% cheaper overall

However, I designed the VPC and Task Definitions to be launch-type agnostic. If they grew to need 24/7 high-density workloads, switching to EC2 launch type later would be straightforward.

Database: Amazon RDS MySQL 8.0

Moving from self-managed MySQL to RDS was non-negotiable:

Automated backups with point-in-time recovery (because their backup strategy was... optimistic)
Multi-AZ deployment for 99.95% availability
Read replicas to offload their heavy analytics queries
Automated minor version patching during maintenance windows
Performance Insights to identify query bottlenecks without third-party APM tools

I chose a db.r6g.xlarge instance (4 vCPUs, 32GB RAM) as the primary, costing about 350/month, with two db.r6g.large read replicas at 175/month each for analytics workloads.

Storage: Amazon EFS for shared files

The 200GB of product images needed to be accessible from multiple container tasks. Options I considered:

S3 with CloudFront: Best practice but required application code changes (300+ file operation calls)
EFS: NFS-compatible, could be mounted as a volume in ECS tasks with zero code changes

I went with EFS for the initial migration to minimize risk, with a plan to migrate to S3 in Phase 2. EFS cost about 60/month for their 200GB with infrequent access storage class.

Networking architecture

I designed a VPC with:

Two availability zones for redundancy
Public subnets (for ALB and NAT Gateway)
Private subnets (for ECS tasks and RDS)
Three subnet tiers: 10.0.0.0/20 for public, 10.0.16.0/20 for app private, 10.0.32.0/20 for data private

Security layering:

Application Load Balancer in public subnets terminating SSL
ECS tasks in private subnets with no direct internet access
RDS in isolated data subnets with security groups allowing only ECS task traffic
Secrets Manager for database credentials (no more hardcoded passwords)
IAM task roles following the least privilege principle

Cost vs. performance trade-offs:

The original 800/month dedicated server would be replaced with:

ECS Fargate: ~480/month (assuming 20% average utilization, 5 tasks during peak)
RDS with read replicas: ~700/month
ALB: ~25/month
EFS: ~60/month
NAT Gateway: ~45/month
CloudWatch and other services: ~90/month

Total: ~1,400/month baseline, with room to scale to 2,000 during peak periods. More expensive than the single server, yes, but with 99.95% uptime, auto-scaling, zero maintenance windows, and the ability to handle 10x traffic spikes.

Implementation Journey

Phase 1: Foundation (Week 1)

The first step was creating a rock-solid network foundation. I used Terraform for all infrastructure provisioning because repeatability and disaster recovery were critical.

VPC and networking setup:

I created a VPC with CIDR 10.0.0.0/16, spanning us-east-1a and us-east-1b. The subnet design followed AWS best practices:

# Public subnets for ALB and NAT
public_subnet_a: 10.0.1.0/24
public_subnet_b: 10.0.2.0/24

# Private subnets for ECS tasks
private_app_subnet_a: 10.0.11.0/24
private_app_subnet_b: 10.0.12.0/24

# Isolated subnets for RDS
private_data_subnet_a: 10.0.21.0/24
private_data_subnet_b: 10.0.22.0/24

I deployed NAT Gateways in both AZs for redundancy, though this doubled the cost. In hindsight, a single NAT Gateway would have been fine for their traffic patterns—a lesson learned that cost an extra 45/month.

IAM roles and security baseline:

I created three primary IAM roles:

ECS Task Execution Role: Allowed ECS to pull images from ECR, fetch secrets from Secrets Manager, and write logs to CloudWatch
ECS Task Role: Granted the application permission to access S3 (for future migration), write to CloudWatch Logs, and nothing else
RDS Enhanced Monitoring Role: Enabled Performance Insights

The security group architecture was restrictive:

ALB security group: Allow inbound 443 from 0.0.0.0/0, outbound to ECS security group on port 80
ECS security group: Allow inbound from ALB only, outbound to RDS on 3306 and internet via NAT
RDS security group: Allow inbound from ECS security group only on port 3306

Initial challenge I faced:

During initial testing, ECS tasks couldn't pull images from ECR. After 30 minutes of head-scratching, I realized the VPC endpoints for ECR weren't configured, forcing traffic through the NAT Gateway. Once I added VPC endpoints for ecr.api, ecr.dkr, and s3 (ECR uses S3 behind the scenes), image pulls became both faster and cheaper. This saved about 15/month in NAT Gateway data processing charges.

Phase 2: Core Services (Week 2-3)

Database migration approach:

Zero-downtime migration was the make-or-break requirement. I used this approach:

Baseline export: Took a snapshot of the production MySQL database during low-traffic hours (Sunday 3 AM)
RDS provisioning: Restored snapshot to a new RDS instance, upgraded from MySQL 5.7 to 8.0
Replication setup: Configured binary log replication from on-prem MySQL to RDS using MySQL native replication
Validation period: Ran replication for 5 days, monitoring lag (stayed under 2 seconds)
Cutover: During a 2-minute maintenance window, I:
- Put application in read-only mode
- Verified replication lag was zero
- Updated database connection string to RDS endpoint
- Enabled writes on new database
- Monitored for 15 minutes before declaring success

The entire cutover took 12 minutes of read-only mode, well within the 15-minute downtime budget.

Pro tip: I set up RDS Performance Insights immediately and discovered three queries consuming 60% of database time. A couple of missing indexes later, average query time dropped from 340ms to 45ms.

Application containerization strategy:

Creating the Docker image was straightforward but had nuances:

FROM php:8.1-apache

# Install PHP extensions the app needed
RUN docker-php-ext-install mysqli pdo pdo_mysql opcache

# Copy Apache config for PHP settings
COPY apache-config.conf /etc/apache2/sites-available/000-default.conf

# Copy application code
COPY src/ /var/www/html/

# Set proper permissions
RUN chown -R www-data:www-data /var/www/html

# Enable Apache modules
RUN a2enmod rewrite

EXPOSE 80

The gotcha: Their application used session.save_path pointing to /tmp, which wasn't persistent across container restarts. I updated the PHP configuration to use their existing Redis instance (which I'd already migrated to ElastiCache).

ECS cluster and task definition:

I created an ECS cluster with Container Insights enabled for monitoring:

aws ecs create-cluster \
  --cluster-name production-ecommerce \
  --settings name=containerInsights,value=enabled

The task definition specified:

CPU: 2048 units (2 vCPU)
Memory: 4096 MB (4GB)
Network mode: awsvpc (required for Fargate)
Log driver: awslogs, streaming to CloudWatch Logs
EFS mount: Product images directory mounted at /var/www/html/uploads
Environment variables: Pulled from Secrets Manager for database credentials

Integration points:

The Application Load Balancer was configured with:

HTTPS listener (port 443) with their SSL certificate from ACM
Target group pointing to ECS service, health check on /health.php
Redirect from HTTP (port 80) to HTTPS
Deregistration delay of 30 seconds (important for graceful shutdowns)

One issue I hit: The default health check interval was 30 seconds, which caused false positives during deployments. I tuned it to 10-second intervals with a 3-second timeout and 2 consecutive healthy checks required.

Phase 3: Advanced Features (Week 4)

Monitoring and logging setup:

With Container Insights enabled, I immediately got visibility into cluster, service, and task-level metrics. I created CloudWatch dashboards showing:

ECS CPU and memory utilization per task
ALB request count, latency (p50, p95, p99), and HTTP status codes
RDS connections, CPU, IOPS, and query performance

I set up CloudWatch Alarms for:

ECS CPU > 70% for 5 minutes (triggers auto-scaling)
ALB target response time > 1 second (alerts development team)
RDS CPU > 80% (pages on-call engineer)
ECS task count < 2 (ensures at least two tasks always running)

The application logs were already going to CloudWatch Logs, so I created metric filters to count PHP errors and alert on spikes.

Auto-scaling configuration:

I implemented target tracking scaling policies for the ECS service:

CPU-based scaling: Target 60% average CPU utilization
- Scale out when average CPU > 60% for 3 minutes
- Scale in when average CPU < 60% for 10 minutes (longer cooldown to prevent flapping)
ALB request count scaling: Target 1000 requests per task per minute
- Ensured no single task got overwhelmed during traffic spikes

The scaling policy configuration:

aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production-ecommerce/web-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 10

aws application-autoscaling put-scaling-policy \
  --policy-name cpu-target-tracking \
  --service-namespace ecs \
  --resource-id service/production-ecommerce/web-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration file://scaling-policy.json

During the first simulated traffic spike (using Apache Bench to generate 10x normal load), the service scaled from 2 tasks to 7 tasks within 4 minutes. Beautiful.

Disaster recovery implementation:

RDS automated backups ran daily with 7-day retention. I also configured:

Manual snapshots before major deployments
Read replica promotion procedure documented (RTO: 5 minutes, RPO: near-zero with synchronous replication)
Cross-region snapshot copies to us-west-2 for true disaster recovery

For the application layer, the ECS task definition was version-controlled in Git. Rolling back a bad deployment was as simple as updating the service to use the previous task definition revision—typically completed in under 3 minutes.

Phase 4: Optimization (Week 5-6)

Performance tuning steps I took:

After running in production for a week, I analyzed the CloudWatch metrics and made several optimizations:

Rightsized task resources: Initial 2vCPU/4GB was overkill. Average CPU was 25%, memory at 1.2GB. I reduced to 1vCPU/2GB, cutting Fargate costs by 50%.
Enabled OPcache aggressively: Modified PHP configuration to cache compiled code for 1 hour. This alone reduced CPU usage by another 20%.
Tuned Apache MaxRequestWorkers: Set to 50 (from default 150) based on actual concurrent connection patterns, reducing memory footprint.
Implemented database connection pooling: Modified the application to reuse database connections across requests instead of creating new connections, reducing RDS connection count from 200 to 40.

Cost optimization measures:

Beyond right-sizing, I implemented:

RDS Reserved Instances: Committed to 1-year reserved instance for the primary database, saving 35% (~120/month)
CloudWatch Logs retention: Set to 30 days instead of indefinite, reducing storage costs
EFS Intelligent-Tiering: Moved to lifecycle policies that transitioned files not accessed in 30 days to Infrequent Access storage class, cutting EFS costs by 40%
Removed redundant NAT Gateway: Consolidated to a single NAT Gateway after proving traffic patterns didn't justify the redundancy cost

Final optimized monthly cost: 1,250, well under the 2,000 budget.

Security hardening:

Post-launch security audit revealed a few items to tighten:

Enabled VPC Flow Logs: Started logging all network traffic for security auditing
Implemented AWS WAF: Added basic rate limiting and SQL injection protection rules on the ALB
Restricted IAM policies: Removed broad S3 permissions from task role that weren't being used
Enabled RDS encryption at rest: Took a snapshot, created new encrypted instance, migrated with zero downtime using the same replication strategy
Configured AWS Config rules: Automated compliance checks for security group rules and public subnet configurations

Technical Specifications

Compute Configuration

ECS Cluster:

Name: production-ecommerce
Launch type: Fargate
Platform version: 1.4.0 (latest)
Container Insights: Enabled
Capacity providers: FARGATE and FARGATE_SPOT (80/20 split for cost optimization)

ECS Task Definition:

Task CPU: 1024 units (1 vCPU)
Task Memory: 2048 MB (2GB)
Network mode: awsvpc
Container image: 637-account-id.dkr.ecr.us-east-1.amazonaws.com/ecommerce-app:latest
Health check: curl -f http://localhost/health.php || exit 1
EFS volume mount: /var/www/html/uploads

ECS Service:

Desired count: 2 (minimum), 10 (maximum)
Deployment type: Rolling update
Deployment circuit breaker: Enabled (automatic rollback on failure)
Load balancer: Application Load Balancer target group
Service auto-scaling: Enabled with target tracking policies

Database Configuration

RDS Primary Instance:

Engine: MySQL 8.0.35
Instance class: db.r6g.xlarge (4 vCPU, 32GB RAM)
Storage: 500GB gp3 (16,000 IOPS, 1000 MB/s throughput)
Multi-AZ: Enabled
Backup retention: 7 days, automated snapshots at 3 AM UTC
Encryption: Enabled (KMS)
Parameter group: Custom with optimized InnoDB settings

RDS Read Replicas (2):

Instance class: db.r6g.large (2 vCPU, 16GB RAM)
Purpose: Analytics and reporting queries
Replication lag monitoring: CloudWatch alarm if lag > 5 seconds

Network Architecture

VPC Design:

CIDR: 10.0.0.0/16
DNS hostnames: Enabled
Availability zones: us-east-1a, us-east-1b

Subnets:

2 public subnets (10.0.1.0/24, 10.0.2.0/24)
2 private application subnets (10.0.11.0/24, 10.0.12.0/24)
2 isolated database subnets (10.0.21.0/24, 10.0.22.0/24)

Routing:

Public subnets: Route to Internet Gateway
Private subnets: Route to NAT Gateway (single, in us-east-1a)
Database subnets: No internet route

Terraform Snippet for VPC

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "ecommerce-vpc"
    Environment = "production"
  }
}

resource "aws_subnet" "private_app" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${10 + count.index}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "private-app-${count.index + 1}"
    Tier = "application"
  }
}

resource "aws_ecs_cluster" "main" {
  name = "production-ecommerce"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

resource "aws_ecs_service" "web" {
  name            = "web-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = aws_subnet.private_app[*].id
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "php-app"
    container_port   = 80
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

Challenges and Solutions

Challenge 1: Session Management Across Multiple Containers

Problem: The application stored PHP sessions in local /tmp directory. When ECS spun up multiple tasks, users would randomly lose their sessions when requests hit different containers. Shopping carts were disappearing, and the CEO was getting angry customer emails.

Troubleshooting process:

Initially thought it was a load balancer sticky session issue, spent 2 hours configuring session affinity
Realized through CloudWatch Logs that session IDs were valid but session data was missing
SSH'd into a running container (via ECS Exec feature) and discovered sessions were stored locally

Solution implemented:

Migrated session storage to ElastiCache Redis (t4g.micro, 15/month)
Updated PHP configuration: session.save_handler = redis and session.save_path = "tcp://cache-endpoint:6379"
Tested by deliberately killing containers mid-session—sessions persisted perfectly

Lesson learned: Always externalize state. What seems like a quick fix (local storage) becomes a blocker for horizontal scaling. Now I audit state management in every migration discovery phase.

Challenge 2: Database Connection Exhaustion

Problem: Two weeks post-launch, during a flash sale, the application started throwing "Too many connections" errors. RDS was limited to 400 concurrent connections, and we were hitting that limit despite only 5 ECS tasks running.

How I troubleshot it:

Used RDS Performance Insights to see connection count spiking to 380+ during peak traffic
Ran SHOW PROCESSLIST on the database—found hundreds of sleeping connections
Reviewed application code: discovered database connections weren't being closed properly, and PHP was creating new connections for every request

Solution implemented:

Implemented persistent database connections in PHP (mysqli_connect with persistent flag)
Added connection pooling logic in the application bootstrap
Set MySQL wait_timeout to 300 seconds (down from 28800) to kill idle connections faster
Monitored RDS connection count over a week—stabilized at 40-60 connections

Lesson learned: Connection management is often overlooked in monolithic PHP applications because a single server masks the issue. Containerization exposes these inefficiencies. Always benchmark connection behavior under load before going live.

Challenge 3: EFS Performance Bottleneck

Problem: After migration, page load times for product pages with images were averaging 4.5 seconds—worse than the original dedicated server. CloudWatch showed EFS I/O wait times spiking.

Troubleshooting:

Used CloudWatch EFS metrics to see BurstCreditBalance was at zero during peak hours
Realized EFS throughput in bursting mode was insufficient for their access patterns (200GB meant baseline throughput of only 10 MB/s)
Profiled application and found it was making thousands of file_exists() checks on EFS for every request

Solution implemented:

Short-term fix: Enabled EFS Provisioned Throughput (100 MB/s), added 300/month cost
Long-term solution: Migrated static assets to S3 + CloudFront over next sprint
- Modified upload handler to push to S3 instead of EFS
- Updated image URLs to use CloudFront distribution
- Reduced EFS to only cache and temporary files
Result: Page load times dropped to 1.2 seconds, EFS costs reduced by switching back to bursting mode

Lesson learned: EFS is convenient for lift-and-shift but not optimized for web-facing static content. Always consider the right storage service for the access pattern. S3 + CloudFront is almost always better for static assets in production.

Challenge 4: Auto-Scaling Overreaction

Problem: The auto-scaling configuration initially caused instability. During traffic spikes, ECS would scale from 2 to 10 tasks in 2 minutes, then scale back down to 2 tasks 5 minutes later when load decreased. This thrashing caused customer-facing errors during task spin-up/shutdown.

How I troubleshot it:

Reviewed CloudWatch metrics and saw rapid desired count changes every few minutes
Realized the default scale-in cooldown was too aggressive (60 seconds)
Also discovered the target metric (CPU percentage) was too sensitive to temporary spikes

Solution implemented:

Increased scale-in cooldown to 600 seconds (10 minutes), giving time for sustained load patterns
Changed scale-out cooldown to 180 seconds (3 minutes) to respond quickly to traffic
Added a secondary scaling metric: ALB RequestCountPerTarget, providing more stable signal
Set minimum task count to 3 (instead of 2) during business hours using scheduled scaling

Lesson learned: Auto-scaling is not "set it and forget it." Proper configuration requires understanding traffic patterns and testing under realistic loads. Conservative scale-in policies prevent thrashing while aggressive scale-out policies handle bursts.

Challenge 5: Deployment-Induced Downtime

Problem: Our first production deployment after go-live caused 30 seconds of 502 errors. Users on the checkout flow abandoned carts, and I had to explain the incident to stakeholders.

Troubleshooting process:

Reviewed ALB access logs—saw 502s occurred during task replacement
Discovered the issue: ECS was terminating old tasks before new tasks passed health checks
Application was also not handling SIGTERM gracefully, cutting off in-flight requests

Solution implemented:

Enabled deployment circuit breaker in ECS service (automatically rolls back failed deployments)
Configured rolling deployment with minimum healthy percent of 100%, maximum percent of 200%
Updated application to handle SIGTERM: gracefully finish in-flight requests before shutdown (max 30 seconds)
Increased ALB deregistration delay to 60 seconds, allowing tasks to drain connections
Added pre-stop lifecycle hooks in task definition to flush logs before termination

Lesson learned: Zero-downtime deployments require coordination between application signal handling, load balancer deregistration timing, and ECS deployment configuration. The defaults assume stateless, fast-starting applications—most PHP apps need tuning.

Results and Metrics

After 3 months of operation on the new AWS architecture, here are the quantified outcomes:

Performance improvements:

Average page load time: Reduced from 2.8 seconds to 1.2 seconds (57% improvement)
Time to first byte (TTFB): Improved from 890ms to 240ms (73% improvement)
Database query response time: Average dropped from 340ms to 45ms (87% improvement)
Peak traffic handling: System now handles 8,500 requests/minute (10x original capacity) without degradation

Cost comparison:

Before: 800/month (dedicated server) + 150/month (CDN) + 100/month (monitoring) = 1,050/month baseline
After: 1,250/month AWS infrastructure (all-inclusive)
Cost increase: 19% higher baseline cost
Value delivered: 99.95% uptime vs. 98.2% previously, eliminated 500 emergency scaling costs per event (happened 4 times/year)
True savings: Reduced operational overhead by 15 hours/month (no more manual scaling, patching, or backup management)

Uptime and reliability gains:

Uptime: Improved from 98.2% to 99.94% over 90-day period
Failed deployments: Zero production-impacting incidents (circuit breaker prevented 3 bad deployments from affecting users)
Recovery time: RTO improved from 4 hours to 8 minutes for application issues, 30 minutes for database issues
Zero unplanned downtime in 90 days vs. 3 incidents in previous 90 days

Time-to-market improvements:

Deployment frequency: Increased from 15 deploys/month to 45 deploys/month (3x)
Deployment duration: Reduced from 25 minutes (manual) to 8 minutes (automated rolling update)
Rollback time: Improved from 35 minutes to 3 minutes
Developer confidence: Team now deploys during business hours instead of 2 AM maintenance windows

Team productivity improvements:

On-call incidents: Reduced from 8/month to 2/month (monitoring and auto-healing eliminated most alerts)
Time spent on infrastructure: Reduced from 40 hours/month to 10 hours/month
Mean time to investigate (MTTI): Reduced from 25 minutes to 7 minutes (thanks to CloudWatch Container Insights and centralized logging)

Business impact:

Successfully handled Black Friday 2025 with zero downtime (peak traffic was 12,000 req/min, system auto-scaled to 9 tasks)
Marketing team now schedules flash sales without engineering involvement (auto-scaling handles it)
PCI-DSS compliance achieved (AWS shared responsibility model simplified audit requirements)

Key Takeaways

After delivering this migration, here's what I'd share with anyone doing similar work:

Start with network and security fundamentals: VPC design, security groups, and IAM roles are unglamorous but will save you countless hours later. Get them right before touching application code.
Externalize all state early: Sessions, file uploads, caches—anything that lives on disk must be identified and migrated to external services before containerization. This was our biggest gotcha.
Database migration deserves 30% of project time: We spent 2 weeks on a database migration strategy for a 500GB database. Worth every hour—zero downtime and zero data loss is non-negotiable for production systems.
Right-sizing is iterative: Start with generous resource allocations, monitor for a week, then optimize. We cut our Fargate costs by 50% after the initial week by right-sizing task definitions based on real usage patterns.
Container Insights is worth enabling from day one: The visibility into task-level metrics, correlated with application logs, reduced our mean time to resolution by 70%. It costs about 30/month but saves hours of troubleshooting.
Auto-scaling needs real-world testing: Synthetic load tests don't capture actual traffic patterns. We had to tune auto-scaling policies three times based on production traffic before getting it right.
Deployment strategies matter more than you think: Graceful shutdowns, health check tuning, and deregistration delays prevented customer-facing errors during deployments. This took 3 failed attempts to get right.
Cost optimization is ongoing: Monthly reviews of CloudWatch metrics revealed opportunities to save 30% through reserved instances, intelligent tiering, and removing over-provisioned resources.

What I'd do differently next time:

Migrate to S3 earlier: We should have moved static assets to S3 + CloudFront during the initial migration rather than using EFS as a crutch. The EFS performance issues cost us 2 weeks of firefighting.
Implement feature flags: We deployed the entire migration as a big-bang cutover. Feature flags would have allowed gradual traffic shifting and faster rollback if issues arose.
Load test more aggressively: Our pre-launch load tests were at 2x expected peak traffic. Real Black Friday traffic hit 3x, causing brief issues. Test at 5x to be safe.
Document runbooks earlier: We created operational runbooks after the first incident. Should have documented "what to do when X happens" scenarios before launch.

Tech Stack Summary

Compute:

AWS ECS (Fargate launch type) for container orchestration
Application Load Balancer for traffic distribution and SSL termination
Amazon ECR for Docker image registry

Storage:

Amazon S3 for static assets and backups
Amazon EFS for shared file storage (temporary, migrating to S3)
CloudFront CDN for global content delivery

Database:

Amazon RDS MySQL 8.0 (Multi-AZ) for primary transactional database
RDS Read Replicas (2x) for analytics and reporting queries
Amazon ElastiCache Redis for session storage and application caching

Networking:

Amazon VPC with public/private/isolated subnet architecture
NAT Gateway for outbound internet access from private subnets
VPC Flow Logs for network traffic auditing
AWS WAF for application-layer security (rate limiting, SQL injection protection)

Security:

AWS IAM for access control and task permissions
AWS Secrets Manager for database credentials and API keys
AWS Certificate Manager for SSL/TLS certificates
AWS KMS for encryption key management

Monitoring:

Amazon CloudWatch Container Insights for ECS metrics
CloudWatch Logs for centralized application and infrastructure logging
CloudWatch Alarms for proactive alerting
RDS Performance Insights for database query analysis

IaC Tools:

Terraform for all infrastructure provisioning (VPC, ECS, RDS, ALB)
AWS CLI for operational tasks and troubleshooting
GitHub Actions for CI/CD pipeline (Docker build, ECR push, ECS deployment)
Docker for containerization

This migration transformed a fragile, monolithic application into a resilient, scalable cloud-native architecture while maintaining business continuity. The key wasn't using the fanciest AWS services—it was understanding the constraints, making pragmatic architecture decisions, and executing a phased migration strategy that minimized risk at every step. Three months in, the team is shipping faster, sleeping better, and the business is growing without infrastructure anxiety.

Conclusion

This migration represents a successful transformation from legacy infrastructure to modern cloud-native architecture. The project delivered immediate business value through improved reliability, performance, and operational efficiency while establishing a scalable foundation for future growth. The containerized monolith approach proved to be the right strategy—achieving 99.94% uptime and 10x capacity without the complexity and risk of a full microservices rewrite.

The investment in proper planning, phased execution, and post-launch optimization resulted in a system that not only meets current business needs but provides the architectural flexibility to support the company's growth trajectory over the next 3-5 years.

Project Status: ✅ Complete and in production with ongoing optimization

Business Owner Satisfaction: High - exceeded uptime and performance targets while staying under budget

ROI Timeline: 12 months (accounting for operational efficiency gains and eliminated incident costs)

DEV Community

AWS Cloud Case Study: Migrating a Monolithic PHP Application to AWS ECS with RDS

Executive Summary

Project Overview

Business Challenge

The Challenge

Initial Assessment

Solution Design

Architecture decisions and rationale

Implementation Journey

Phase 1: Foundation (Week 1)

Phase 2: Core Services (Week 2-3)

Phase 3: Advanced Features (Week 4)

Phase 4: Optimization (Week 5-6)

Technical Specifications

Compute Configuration

Database Configuration

Network Architecture

Terraform Snippet for VPC

Challenges and Solutions

Challenge 1: Session Management Across Multiple Containers

Challenge 2: Database Connection Exhaustion

Challenge 3: EFS Performance Bottleneck

Challenge 4: Auto-Scaling Overreaction

Challenge 5: Deployment-Induced Downtime

Results and Metrics

Key Takeaways

Tech Stack Summary

Conclusion

Top comments (0)