Executive Summary
Project Overview
Successfully migrated a legacy monolithic PHP e-commerce application from a single dedicated server to a modern AWS cloud architecture utilizing Amazon ECS (Fargate), RDS MySQL, and supporting services. The project was completed within the 6-week timeline and under the 2,000/month infrastructure budget, delivering immediate business value and establishing a foundation for future growth.
Business Challenge
The client, a mid-sized e-commerce company serving 50,000 daily active users, faced critical limitations with their legacy infrastructure:
- Scalability crisis: Black Friday 2023 caused 4 hours of downtime due to the inability to scale horizontally
- Deployment risk: Manual deployments with no rollback strategy created operational anxiety
- Cost inefficiency: 800/month dedicated server running at 15% average utilization but unable to handle traffic spikes
- Security exposure: Running end-of-life PHP 7.2 with no security patches, creating compliance risks
- Availability concerns: Single server architecture with no redundancy or disaster recovery capability
The Challenge
I was approached by a mid-sized e-commerce company in late 2024 that was running a legacy PHP application on a single dedicated server—one of those "grew organically over 8 years" situations. Their entire stack lived on one beefy machine: Apache, PHP 7.2, MySQL, Redis for sessions, and about 200GB of product images scattered across the local filesystem. The application itself was a classic monolith built with a custom PHP framework (pre-Laravel days), serving around 50,000 daily active users during normal periods.
The pain points were mounting:
- Scalability nightmare: Black Friday 2023 took the site down for 4 hours because they couldn't scale horizontally. Manual vertical scaling meant scheduling downtime, which their business couldn't afford anymore.
- Deployment anxiety: Every code deployment required SSH-ing into the production server, running git pull, and hoping nothing broke. No rollback strategy existed beyond "restore from backup and pray".
- Cost inefficiency: They were paying 800/month for a dedicated server that sat at 15% CPU utilization most of the time, but would spike to 100% during promotional campaigns.
- Security concerns: Running PHP 7.2 meant no security patches, and the compliance team was breathing down their necks about PCI-DSS requirements.
- Database bottleneck: A single MySQL instance with no read replicas meant every analytics query slowed down customer-facing transactions.
The CEO gave me a budget of 2,000/month for infrastructure and a 6-week timeline to migrate without disrupting their upcoming summer sale. The constraint was tight—they couldn't afford more than 15 minutes of downtime total, and rollback capability was non-negotiable.
Initial Assessment
I spent the first week doing a deep dive into their architecture. Here's what I discovered during the analysis:
Application architecture findings:
- The codebase was about 120,000 lines of custom PHP, with heavy coupling between the presentation layer, business logic, and data access
- Session management was handled by Redis, but it was running on the same server (single point of failure)
- File uploads went directly to the local disk, creating state that made horizontal scaling impossible
- Database queries were scattered throughout the codebase with no ORM—just raw mysqli calls
Performance bottlenecks identified:
- Average page load time: 2.8 seconds (ouch)
- Database queries per page: averaging 47 queries, with some pages hitting 200+ (classic N+1 problem)
- Peak traffic: 850 requests/minute during flash sales
- Memory usage: PHP processes were averaging 128MB each, with occasional memory leaks pushing some to 512MB
Stakeholder interviews revealed:
- The development team was small (3 developers) and had minimal DevOps experience
- They deployed about 15 times per month, always during off-peak hours (2 AM deployments were the norm)
- No automated testing existed, so every deployment felt like Russian roulette
- The marketing team wanted the ability to scale up predictably for campaigns without involving engineering
Risk factors I flagged:
- The MySQL database had no recent performance baseline—I found queries taking 30+ seconds during peak hours
- They had backups, but had never actually tested a restore (spoiler: the first test restore failed)
- The PHP codebase used deprecated functions that wouldn't work on PHP 8 without modifications
- About 15% of the application logic existed as stored procedures in MySQL, creating tight coupling
After running MySQL's slow query log for 48 hours and analyzing their access patterns with New Relic, I realized this wasn't just about lifting and shifting—we needed thoughtful service decoupling even within a containerized monolith approach.
Solution Design
Given the constraints, I decided against a full microservices rewrite. Instead, I designed a "monolith-first" containerization strategy that would give them immediate benefits while creating a foundation for future decomposition.
Architecture decisions and rationale
Compute: AWS ECS with Fargate
I chose ECS over EC2-based containers for several reasons:
- The workload had highly variable traffic patterns (70% idle, 30% burst), making Fargate's pay-per-use model more cost-effective than maintaining EC2 instances
- Fargate eliminated the operational overhead of managing container hosts—critical given their small team
- Built-in integration with ALB and AWS service mesh simplified the networking layer
- For their workload (2 vCPU, 4GB RAM per task), Fargate cost ~67/month per continuously running task versus ~30/month for a comparable t3.medium EC2 instance, but the 40% time they weren't running tasks made Fargate 25% cheaper overall
However, I designed the VPC and Task Definitions to be launch-type agnostic. If they grew to need 24/7 high-density workloads, switching to EC2 launch type later would be straightforward.
Database: Amazon RDS MySQL 8.0
Moving from self-managed MySQL to RDS was non-negotiable:
- Automated backups with point-in-time recovery (because their backup strategy was... optimistic)
- Multi-AZ deployment for 99.95% availability
- Read replicas to offload their heavy analytics queries
- Automated minor version patching during maintenance windows
- Performance Insights to identify query bottlenecks without third-party APM tools
I chose a db.r6g.xlarge instance (4 vCPUs, 32GB RAM) as the primary, costing about 350/month, with two db.r6g.large read replicas at 175/month each for analytics workloads.
Storage: Amazon EFS for shared files
The 200GB of product images needed to be accessible from multiple container tasks. Options I considered:
- S3 with CloudFront: Best practice but required application code changes (300+ file operation calls)
- EFS: NFS-compatible, could be mounted as a volume in ECS tasks with zero code changes
I went with EFS for the initial migration to minimize risk, with a plan to migrate to S3 in Phase 2. EFS cost about 60/month for their 200GB with infrequent access storage class.
Networking architecture
I designed a VPC with:
- Two availability zones for redundancy
- Public subnets (for ALB and NAT Gateway)
- Private subnets (for ECS tasks and RDS)
- Three subnet tiers: 10.0.0.0/20 for public, 10.0.16.0/20 for app private, 10.0.32.0/20 for data private
Security layering:
- Application Load Balancer in public subnets terminating SSL
- ECS tasks in private subnets with no direct internet access
- RDS in isolated data subnets with security groups allowing only ECS task traffic
- Secrets Manager for database credentials (no more hardcoded passwords)
- IAM task roles following the least privilege principle
Cost vs. performance trade-offs:
The original 800/month dedicated server would be replaced with:
- ECS Fargate: ~480/month (assuming 20% average utilization, 5 tasks during peak)
- RDS with read replicas: ~700/month
- ALB: ~25/month
- EFS: ~60/month
- NAT Gateway: ~45/month
- CloudWatch and other services: ~90/month
Total: ~1,400/month baseline, with room to scale to 2,000 during peak periods. More expensive than the single server, yes, but with 99.95% uptime, auto-scaling, zero maintenance windows, and the ability to handle 10x traffic spikes.
Implementation Journey
Phase 1: Foundation (Week 1)
The first step was creating a rock-solid network foundation. I used Terraform for all infrastructure provisioning because repeatability and disaster recovery were critical.
VPC and networking setup:
I created a VPC with CIDR 10.0.0.0/16, spanning us-east-1a and us-east-1b. The subnet design followed AWS best practices:
# Public subnets for ALB and NAT
public_subnet_a: 10.0.1.0/24
public_subnet_b: 10.0.2.0/24
# Private subnets for ECS tasks
private_app_subnet_a: 10.0.11.0/24
private_app_subnet_b: 10.0.12.0/24
# Isolated subnets for RDS
private_data_subnet_a: 10.0.21.0/24
private_data_subnet_b: 10.0.22.0/24
I deployed NAT Gateways in both AZs for redundancy, though this doubled the cost. In hindsight, a single NAT Gateway would have been fine for their traffic patterns—a lesson learned that cost an extra 45/month.
IAM roles and security baseline:
I created three primary IAM roles:
- ECS Task Execution Role: Allowed ECS to pull images from ECR, fetch secrets from Secrets Manager, and write logs to CloudWatch
- ECS Task Role: Granted the application permission to access S3 (for future migration), write to CloudWatch Logs, and nothing else
- RDS Enhanced Monitoring Role: Enabled Performance Insights
The security group architecture was restrictive:
- ALB security group: Allow inbound 443 from 0.0.0.0/0, outbound to ECS security group on port 80
- ECS security group: Allow inbound from ALB only, outbound to RDS on 3306 and internet via NAT
- RDS security group: Allow inbound from ECS security group only on port 3306
Initial challenge I faced:
During initial testing, ECS tasks couldn't pull images from ECR. After 30 minutes of head-scratching, I realized the VPC endpoints for ECR weren't configured, forcing traffic through the NAT Gateway. Once I added VPC endpoints for ecr.api, ecr.dkr, and s3 (ECR uses S3 behind the scenes), image pulls became both faster and cheaper. This saved about 15/month in NAT Gateway data processing charges.
Phase 2: Core Services (Week 2-3)
Database migration approach:
Zero-downtime migration was the make-or-break requirement. I used this approach:
- Baseline export: Took a snapshot of the production MySQL database during low-traffic hours (Sunday 3 AM)
- RDS provisioning: Restored snapshot to a new RDS instance, upgraded from MySQL 5.7 to 8.0
- Replication setup: Configured binary log replication from on-prem MySQL to RDS using MySQL native replication
- Validation period: Ran replication for 5 days, monitoring lag (stayed under 2 seconds)
-
Cutover: During a 2-minute maintenance window, I:
- Put application in read-only mode
- Verified replication lag was zero
- Updated database connection string to RDS endpoint
- Enabled writes on new database
- Monitored for 15 minutes before declaring success
The entire cutover took 12 minutes of read-only mode, well within the 15-minute downtime budget.
Pro tip: I set up RDS Performance Insights immediately and discovered three queries consuming 60% of database time. A couple of missing indexes later, average query time dropped from 340ms to 45ms.
Application containerization strategy:
Creating the Docker image was straightforward but had nuances:
FROM php:8.1-apache
# Install PHP extensions the app needed
RUN docker-php-ext-install mysqli pdo pdo_mysql opcache
# Copy Apache config for PHP settings
COPY apache-config.conf /etc/apache2/sites-available/000-default.conf
# Copy application code
COPY src/ /var/www/html/
# Set proper permissions
RUN chown -R www-data:www-data /var/www/html
# Enable Apache modules
RUN a2enmod rewrite
EXPOSE 80
The gotcha: Their application used session.save_path pointing to /tmp, which wasn't persistent across container restarts. I updated the PHP configuration to use their existing Redis instance (which I'd already migrated to ElastiCache).
ECS cluster and task definition:
I created an ECS cluster with Container Insights enabled for monitoring:
aws ecs create-cluster \
--cluster-name production-ecommerce \
--settings name=containerInsights,value=enabled
The task definition specified:
- CPU: 2048 units (2 vCPU)
- Memory: 4096 MB (4GB)
- Network mode: awsvpc (required for Fargate)
- Log driver: awslogs, streaming to CloudWatch Logs
-
EFS mount: Product images directory mounted at
/var/www/html/uploads - Environment variables: Pulled from Secrets Manager for database credentials
Integration points:
The Application Load Balancer was configured with:
- HTTPS listener (port 443) with their SSL certificate from ACM
- Target group pointing to ECS service, health check on
/health.php - Redirect from HTTP (port 80) to HTTPS
- Deregistration delay of 30 seconds (important for graceful shutdowns)
One issue I hit: The default health check interval was 30 seconds, which caused false positives during deployments. I tuned it to 10-second intervals with a 3-second timeout and 2 consecutive healthy checks required.
Phase 3: Advanced Features (Week 4)
Monitoring and logging setup:
With Container Insights enabled, I immediately got visibility into cluster, service, and task-level metrics. I created CloudWatch dashboards showing:
- ECS CPU and memory utilization per task
- ALB request count, latency (p50, p95, p99), and HTTP status codes
- RDS connections, CPU, IOPS, and query performance
I set up CloudWatch Alarms for:
- ECS CPU > 70% for 5 minutes (triggers auto-scaling)
- ALB target response time > 1 second (alerts development team)
- RDS CPU > 80% (pages on-call engineer)
- ECS task count < 2 (ensures at least two tasks always running)
The application logs were already going to CloudWatch Logs, so I created metric filters to count PHP errors and alert on spikes.
Auto-scaling configuration:
I implemented target tracking scaling policies for the ECS service:
-
CPU-based scaling: Target 60% average CPU utilization
- Scale out when average CPU > 60% for 3 minutes
- Scale in when average CPU < 60% for 10 minutes (longer cooldown to prevent flapping)
-
ALB request count scaling: Target 1000 requests per task per minute
- Ensured no single task got overwhelmed during traffic spikes
The scaling policy configuration:
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/production-ecommerce/web-service \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 \
--max-capacity 10
aws application-autoscaling put-scaling-policy \
--policy-name cpu-target-tracking \
--service-namespace ecs \
--resource-id service/production-ecommerce/web-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration file://scaling-policy.json
During the first simulated traffic spike (using Apache Bench to generate 10x normal load), the service scaled from 2 tasks to 7 tasks within 4 minutes. Beautiful.
Disaster recovery implementation:
RDS automated backups ran daily with 7-day retention. I also configured:
- Manual snapshots before major deployments
- Read replica promotion procedure documented (RTO: 5 minutes, RPO: near-zero with synchronous replication)
- Cross-region snapshot copies to us-west-2 for true disaster recovery
For the application layer, the ECS task definition was version-controlled in Git. Rolling back a bad deployment was as simple as updating the service to use the previous task definition revision—typically completed in under 3 minutes.
Phase 4: Optimization (Week 5-6)
Performance tuning steps I took:
After running in production for a week, I analyzed the CloudWatch metrics and made several optimizations:
- Rightsized task resources: Initial 2vCPU/4GB was overkill. Average CPU was 25%, memory at 1.2GB. I reduced to 1vCPU/2GB, cutting Fargate costs by 50%.
- Enabled OPcache aggressively: Modified PHP configuration to cache compiled code for 1 hour. This alone reduced CPU usage by another 20%.
- Tuned Apache MaxRequestWorkers: Set to 50 (from default 150) based on actual concurrent connection patterns, reducing memory footprint.
- Implemented database connection pooling: Modified the application to reuse database connections across requests instead of creating new connections, reducing RDS connection count from 200 to 40.
Cost optimization measures:
Beyond right-sizing, I implemented:
- RDS Reserved Instances: Committed to 1-year reserved instance for the primary database, saving 35% (~120/month)
- CloudWatch Logs retention: Set to 30 days instead of indefinite, reducing storage costs
- EFS Intelligent-Tiering: Moved to lifecycle policies that transitioned files not accessed in 30 days to Infrequent Access storage class, cutting EFS costs by 40%
- Removed redundant NAT Gateway: Consolidated to a single NAT Gateway after proving traffic patterns didn't justify the redundancy cost
Final optimized monthly cost: 1,250, well under the 2,000 budget.
Security hardening:
Post-launch security audit revealed a few items to tighten:
- Enabled VPC Flow Logs: Started logging all network traffic for security auditing
- Implemented AWS WAF: Added basic rate limiting and SQL injection protection rules on the ALB
- Restricted IAM policies: Removed broad S3 permissions from task role that weren't being used
- Enabled RDS encryption at rest: Took a snapshot, created new encrypted instance, migrated with zero downtime using the same replication strategy
- Configured AWS Config rules: Automated compliance checks for security group rules and public subnet configurations
Technical Specifications
Compute Configuration
ECS Cluster:
- Name:
production-ecommerce - Launch type: Fargate
- Platform version: 1.4.0 (latest)
- Container Insights: Enabled
- Capacity providers: FARGATE and FARGATE_SPOT (80/20 split for cost optimization)
ECS Task Definition:
- Task CPU: 1024 units (1 vCPU)
- Task Memory: 2048 MB (2GB)
- Network mode: awsvpc
- Container image: 637-account-id.dkr.ecr.us-east-1.amazonaws.com/ecommerce-app:latest
- Health check:
curl -f http://localhost/health.php || exit 1 - EFS volume mount:
/var/www/html/uploads
ECS Service:
- Desired count: 2 (minimum), 10 (maximum)
- Deployment type: Rolling update
- Deployment circuit breaker: Enabled (automatic rollback on failure)
- Load balancer: Application Load Balancer target group
- Service auto-scaling: Enabled with target tracking policies
Database Configuration
RDS Primary Instance:
- Engine: MySQL 8.0.35
- Instance class: db.r6g.xlarge (4 vCPU, 32GB RAM)
- Storage: 500GB gp3 (16,000 IOPS, 1000 MB/s throughput)
- Multi-AZ: Enabled
- Backup retention: 7 days, automated snapshots at 3 AM UTC
- Encryption: Enabled (KMS)
- Parameter group: Custom with optimized InnoDB settings
RDS Read Replicas (2):
- Instance class: db.r6g.large (2 vCPU, 16GB RAM)
- Purpose: Analytics and reporting queries
- Replication lag monitoring: CloudWatch alarm if lag > 5 seconds
Network Architecture
VPC Design:
- CIDR: 10.0.0.0/16
- DNS hostnames: Enabled
- Availability zones: us-east-1a, us-east-1b
Subnets:
- 2 public subnets (10.0.1.0/24, 10.0.2.0/24)
- 2 private application subnets (10.0.11.0/24, 10.0.12.0/24)
- 2 isolated database subnets (10.0.21.0/24, 10.0.22.0/24)
Routing:
- Public subnets: Route to Internet Gateway
- Private subnets: Route to NAT Gateway (single, in us-east-1a)
- Database subnets: No internet route
Terraform Snippet for VPC
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "ecommerce-vpc"
Environment = "production"
}
}
resource "aws_subnet" "private_app" {
count = 2
vpc_id = aws_vpc.main.id
cidr_block = "10.0.${10 + count.index}.0/24"
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "private-app-${count.index + 1}"
Tier = "application"
}
}
resource "aws_ecs_cluster" "main" {
name = "production-ecommerce"
setting {
name = "containerInsights"
value = "enabled"
}
}
resource "aws_ecs_service" "web" {
name = "web-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = aws_subnet.private_app[*].id
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "php-app"
container_port = 80
}
deployment_circuit_breaker {
enable = true
rollback = true
}
}
Challenges and Solutions
Challenge 1: Session Management Across Multiple Containers
Problem: The application stored PHP sessions in local /tmp directory. When ECS spun up multiple tasks, users would randomly lose their sessions when requests hit different containers. Shopping carts were disappearing, and the CEO was getting angry customer emails.
Troubleshooting process:
- Initially thought it was a load balancer sticky session issue, spent 2 hours configuring session affinity
- Realized through CloudWatch Logs that session IDs were valid but session data was missing
- SSH'd into a running container (via ECS Exec feature) and discovered sessions were stored locally
Solution implemented:
- Migrated session storage to ElastiCache Redis (t4g.micro, 15/month)
- Updated PHP configuration:
session.save_handler = redisandsession.save_path = "tcp://cache-endpoint:6379" - Tested by deliberately killing containers mid-session—sessions persisted perfectly
Lesson learned: Always externalize state. What seems like a quick fix (local storage) becomes a blocker for horizontal scaling. Now I audit state management in every migration discovery phase.
Challenge 2: Database Connection Exhaustion
Problem: Two weeks post-launch, during a flash sale, the application started throwing "Too many connections" errors. RDS was limited to 400 concurrent connections, and we were hitting that limit despite only 5 ECS tasks running.
How I troubleshot it:
- Used RDS Performance Insights to see connection count spiking to 380+ during peak traffic
- Ran
SHOW PROCESSLISTon the database—found hundreds of sleeping connections - Reviewed application code: discovered database connections weren't being closed properly, and PHP was creating new connections for every request
Solution implemented:
- Implemented persistent database connections in PHP (mysqli_connect with persistent flag)
- Added connection pooling logic in the application bootstrap
- Set MySQL
wait_timeoutto 300 seconds (down from 28800) to kill idle connections faster - Monitored RDS connection count over a week—stabilized at 40-60 connections
Lesson learned: Connection management is often overlooked in monolithic PHP applications because a single server masks the issue. Containerization exposes these inefficiencies. Always benchmark connection behavior under load before going live.
Challenge 3: EFS Performance Bottleneck
Problem: After migration, page load times for product pages with images were averaging 4.5 seconds—worse than the original dedicated server. CloudWatch showed EFS I/O wait times spiking.
Troubleshooting:
- Used CloudWatch EFS metrics to see BurstCreditBalance was at zero during peak hours
- Realized EFS throughput in bursting mode was insufficient for their access patterns (200GB meant baseline throughput of only 10 MB/s)
- Profiled application and found it was making thousands of
file_exists()checks on EFS for every request
Solution implemented:
- Short-term fix: Enabled EFS Provisioned Throughput (100 MB/s), added 300/month cost
- Long-term solution: Migrated static assets to S3 + CloudFront over next sprint
- Modified upload handler to push to S3 instead of EFS
- Updated image URLs to use CloudFront distribution
- Reduced EFS to only cache and temporary files
- Result: Page load times dropped to 1.2 seconds, EFS costs reduced by switching back to bursting mode
Lesson learned: EFS is convenient for lift-and-shift but not optimized for web-facing static content. Always consider the right storage service for the access pattern. S3 + CloudFront is almost always better for static assets in production.
Challenge 4: Auto-Scaling Overreaction
Problem: The auto-scaling configuration initially caused instability. During traffic spikes, ECS would scale from 2 to 10 tasks in 2 minutes, then scale back down to 2 tasks 5 minutes later when load decreased. This thrashing caused customer-facing errors during task spin-up/shutdown.
How I troubleshot it:
- Reviewed CloudWatch metrics and saw rapid desired count changes every few minutes
- Realized the default scale-in cooldown was too aggressive (60 seconds)
- Also discovered the target metric (CPU percentage) was too sensitive to temporary spikes
Solution implemented:
- Increased scale-in cooldown to 600 seconds (10 minutes), giving time for sustained load patterns
- Changed scale-out cooldown to 180 seconds (3 minutes) to respond quickly to traffic
- Added a secondary scaling metric: ALB RequestCountPerTarget, providing more stable signal
- Set minimum task count to 3 (instead of 2) during business hours using scheduled scaling
Lesson learned: Auto-scaling is not "set it and forget it." Proper configuration requires understanding traffic patterns and testing under realistic loads. Conservative scale-in policies prevent thrashing while aggressive scale-out policies handle bursts.
Challenge 5: Deployment-Induced Downtime
Problem: Our first production deployment after go-live caused 30 seconds of 502 errors. Users on the checkout flow abandoned carts, and I had to explain the incident to stakeholders.
Troubleshooting process:
- Reviewed ALB access logs—saw 502s occurred during task replacement
- Discovered the issue: ECS was terminating old tasks before new tasks passed health checks
- Application was also not handling SIGTERM gracefully, cutting off in-flight requests
Solution implemented:
- Enabled deployment circuit breaker in ECS service (automatically rolls back failed deployments)
- Configured rolling deployment with minimum healthy percent of 100%, maximum percent of 200%
- Updated application to handle SIGTERM: gracefully finish in-flight requests before shutdown (max 30 seconds)
- Increased ALB deregistration delay to 60 seconds, allowing tasks to drain connections
- Added pre-stop lifecycle hooks in task definition to flush logs before termination
Lesson learned: Zero-downtime deployments require coordination between application signal handling, load balancer deregistration timing, and ECS deployment configuration. The defaults assume stateless, fast-starting applications—most PHP apps need tuning.
Results and Metrics
After 3 months of operation on the new AWS architecture, here are the quantified outcomes:
Performance improvements:
- Average page load time: Reduced from 2.8 seconds to 1.2 seconds (57% improvement)
- Time to first byte (TTFB): Improved from 890ms to 240ms (73% improvement)
- Database query response time: Average dropped from 340ms to 45ms (87% improvement)
- Peak traffic handling: System now handles 8,500 requests/minute (10x original capacity) without degradation
Cost comparison:
- Before: 800/month (dedicated server) + 150/month (CDN) + 100/month (monitoring) = 1,050/month baseline
- After: 1,250/month AWS infrastructure (all-inclusive)
- Cost increase: 19% higher baseline cost
- Value delivered: 99.95% uptime vs. 98.2% previously, eliminated 500 emergency scaling costs per event (happened 4 times/year)
- True savings: Reduced operational overhead by 15 hours/month (no more manual scaling, patching, or backup management)
Uptime and reliability gains:
- Uptime: Improved from 98.2% to 99.94% over 90-day period
- Failed deployments: Zero production-impacting incidents (circuit breaker prevented 3 bad deployments from affecting users)
- Recovery time: RTO improved from 4 hours to 8 minutes for application issues, 30 minutes for database issues
- Zero unplanned downtime in 90 days vs. 3 incidents in previous 90 days
Time-to-market improvements:
- Deployment frequency: Increased from 15 deploys/month to 45 deploys/month (3x)
- Deployment duration: Reduced from 25 minutes (manual) to 8 minutes (automated rolling update)
- Rollback time: Improved from 35 minutes to 3 minutes
- Developer confidence: Team now deploys during business hours instead of 2 AM maintenance windows
Team productivity improvements:
- On-call incidents: Reduced from 8/month to 2/month (monitoring and auto-healing eliminated most alerts)
- Time spent on infrastructure: Reduced from 40 hours/month to 10 hours/month
- Mean time to investigate (MTTI): Reduced from 25 minutes to 7 minutes (thanks to CloudWatch Container Insights and centralized logging)
Business impact:
- Successfully handled Black Friday 2025 with zero downtime (peak traffic was 12,000 req/min, system auto-scaled to 9 tasks)
- Marketing team now schedules flash sales without engineering involvement (auto-scaling handles it)
- PCI-DSS compliance achieved (AWS shared responsibility model simplified audit requirements)
Key Takeaways
After delivering this migration, here's what I'd share with anyone doing similar work:
- Start with network and security fundamentals: VPC design, security groups, and IAM roles are unglamorous but will save you countless hours later. Get them right before touching application code.
- Externalize all state early: Sessions, file uploads, caches—anything that lives on disk must be identified and migrated to external services before containerization. This was our biggest gotcha.
- Database migration deserves 30% of project time: We spent 2 weeks on a database migration strategy for a 500GB database. Worth every hour—zero downtime and zero data loss is non-negotiable for production systems.
- Right-sizing is iterative: Start with generous resource allocations, monitor for a week, then optimize. We cut our Fargate costs by 50% after the initial week by right-sizing task definitions based on real usage patterns.
- Container Insights is worth enabling from day one: The visibility into task-level metrics, correlated with application logs, reduced our mean time to resolution by 70%. It costs about 30/month but saves hours of troubleshooting.
- Auto-scaling needs real-world testing: Synthetic load tests don't capture actual traffic patterns. We had to tune auto-scaling policies three times based on production traffic before getting it right.
- Deployment strategies matter more than you think: Graceful shutdowns, health check tuning, and deregistration delays prevented customer-facing errors during deployments. This took 3 failed attempts to get right.
- Cost optimization is ongoing: Monthly reviews of CloudWatch metrics revealed opportunities to save 30% through reserved instances, intelligent tiering, and removing over-provisioned resources.
What I'd do differently next time:
- Migrate to S3 earlier: We should have moved static assets to S3 + CloudFront during the initial migration rather than using EFS as a crutch. The EFS performance issues cost us 2 weeks of firefighting.
- Implement feature flags: We deployed the entire migration as a big-bang cutover. Feature flags would have allowed gradual traffic shifting and faster rollback if issues arose.
- Load test more aggressively: Our pre-launch load tests were at 2x expected peak traffic. Real Black Friday traffic hit 3x, causing brief issues. Test at 5x to be safe.
- Document runbooks earlier: We created operational runbooks after the first incident. Should have documented "what to do when X happens" scenarios before launch.
Tech Stack Summary
Compute:
- AWS ECS (Fargate launch type) for container orchestration
- Application Load Balancer for traffic distribution and SSL termination
- Amazon ECR for Docker image registry
Storage:
- Amazon S3 for static assets and backups
- Amazon EFS for shared file storage (temporary, migrating to S3)
- CloudFront CDN for global content delivery
Database:
- Amazon RDS MySQL 8.0 (Multi-AZ) for primary transactional database
- RDS Read Replicas (2x) for analytics and reporting queries
- Amazon ElastiCache Redis for session storage and application caching
Networking:
- Amazon VPC with public/private/isolated subnet architecture
- NAT Gateway for outbound internet access from private subnets
- VPC Flow Logs for network traffic auditing
- AWS WAF for application-layer security (rate limiting, SQL injection protection)
Security:
- AWS IAM for access control and task permissions
- AWS Secrets Manager for database credentials and API keys
- AWS Certificate Manager for SSL/TLS certificates
- AWS KMS for encryption key management
Monitoring:
- Amazon CloudWatch Container Insights for ECS metrics
- CloudWatch Logs for centralized application and infrastructure logging
- CloudWatch Alarms for proactive alerting
- RDS Performance Insights for database query analysis
IaC Tools:
- Terraform for all infrastructure provisioning (VPC, ECS, RDS, ALB)
- AWS CLI for operational tasks and troubleshooting
- GitHub Actions for CI/CD pipeline (Docker build, ECR push, ECS deployment)
- Docker for containerization
This migration transformed a fragile, monolithic application into a resilient, scalable cloud-native architecture while maintaining business continuity. The key wasn't using the fanciest AWS services—it was understanding the constraints, making pragmatic architecture decisions, and executing a phased migration strategy that minimized risk at every step. Three months in, the team is shipping faster, sleeping better, and the business is growing without infrastructure anxiety.
Conclusion
This migration represents a successful transformation from legacy infrastructure to modern cloud-native architecture. The project delivered immediate business value through improved reliability, performance, and operational efficiency while establishing a scalable foundation for future growth. The containerized monolith approach proved to be the right strategy—achieving 99.94% uptime and 10x capacity without the complexity and risk of a full microservices rewrite.
The investment in proper planning, phased execution, and post-launch optimization resulted in a system that not only meets current business needs but provides the architectural flexibility to support the company's growth trajectory over the next 3-5 years.
Project Status: ✅ Complete and in production with ongoing optimization
Business Owner Satisfaction: High - exceeded uptime and performance targets while staying under budget
ROI Timeline: 12 months (accounting for operational efficiency gains and eliminated incident costs)


Top comments (0)