DEV Community

Cover image for From Zero Downtime to Zero Intervention. How I Built a Self-Healing Deployment Pipeline on AWS.
Oluwagbade Odimayo
Oluwagbade Odimayo

Posted on

From Zero Downtime to Zero Intervention. How I Built a Self-Healing Deployment Pipeline on AWS.

This is Part 2 of a series. Part 1 covers building the core blue-green deployment pipeline on AWS EKS from scratch on Ubuntu. Read it here: Your Deployments Are Causing Downtime. Mine Do Not. Here Is Why. This post picks up exactly where that one left off.


After Part 1, I had a working blue-green pipeline. Push to main, GitHub Actions builds the image, deploys to the idle environment, verifies health, and switches traffic. Zero downtime. Proven with a curl loop.

But I kept thinking about what happens after the switch.

Someone has to be watching. If the new version has a subtle bug that only shows up under real production load, and it is 3am, and nobody is watching the dashboard, users are getting errors for however long it takes for someone to wake up, notice, and roll back manually. That could be 30 minutes. That is not zero downtime. That is delayed downtime.

I also kept thinking about the all-or-nothing nature of the switch. Blue-green is binary. You go from 0% on green to 100% on green in one command. If green has a bug, 100% of your users hit it simultaneously before the rollback fires.

So I went back and built four more things: Terraform to manage the infrastructure as code, Prometheus and Grafana to make the switch moment visible in real-time metrics, canary releases to limit the blast radius of a bad release, and automated rollback to make the system self-healing.

This post covers all four. Every command is real. Every screenshot was taken on a live AWS EKS cluster.


What I Added on Top of Part 1

The core system from Part 1 stays exactly the same. Two Kubernetes deployments (blue and green), a Service that routes traffic based on a label selector, NGINX Ingress for the public URL, and a GitHub Actions pipeline that automates the switch.

What I added:

  • Terraform to provision all AWS infrastructure as code
  • Prometheus to scrape HTTP request metrics from every pod
  • Grafana to visualise request rates per environment in real time
  • Canary releases using NGINX Ingress weight annotations
  • Automated rollback using a shell script that watches pod health after every switch

Each one builds on the previous. You cannot do automated rollback without metrics. You cannot do canary without the NGINX Ingress already in place. The order matters.


Terraform: Infrastructure as Code

The first time I built this project, I used eksctl create cluster and a series of aws CLI commands to provision the infrastructure. That works once. But there is no record of what was created, it cannot be version-controlled, and rebuilding requires remembering every command in the correct order.

Terraform solves all three problems.

resource "aws_eks_cluster" "main" {
  name     = var.cluster_name
  role_arn = aws_iam_role.eks_cluster.arn
  version  = var.kubernetes_version

  vpc_config {
    subnet_ids             = aws_subnet.public[*].id
    endpoint_public_access = true
  }

  depends_on = [
    aws_iam_role_policy_attachment.eks_cluster_policy,
  ]
}
Enter fullscreen mode Exit fullscreen mode

The entire AWS infrastructure, VPC, subnets, internet gateway, route table, EKS cluster, node group, ECR repository with lifecycle policy, and all IAM roles and policy attachments, is defined across three files. Rebuilding everything takes one command:

terraform apply
Enter fullscreen mode Exit fullscreen mode

It completes in 10 to 15 minutes and prints every output you need:

cluster_name           = "bluegreen-cluster"
cluster_version        = "1.31"
ecr_repository_url     = "677276115158.dkr.ecr.us-east-1.amazonaws.com/bluegreen-app"
kubectl_config_command = "aws eks update-kubeconfig --region us-east-1 --name bluegreen-cluster"
Enter fullscreen mode Exit fullscreen mode

Tearing down is equally clean: terraform destroy. Except for one important gotcha.

The Tear-Down Problem Nobody Warns You About

When you run terraform destroy, it fails with a VPC dependency error. The reason is that NGINX Ingress creates an AWS Elastic Load Balancer through Kubernetes, outside of Terraform's knowledge. That load balancer is still attached to the subnets when Terraform tries to delete them, blocking the entire VPC deletion.

The fix is to delete the load balancer manually before running destroy:

LB_NAME=$(aws elb describe-load-balancers --region us-east-1 \
  --query "LoadBalancerDescriptions[*].LoadBalancerName" --output text)

aws elb delete-load-balancer --region us-east-1 --load-balancer-name $LB_NAME
sleep 30
terraform destroy -auto-approve
Enter fullscreen mode Exit fullscreen mode

This is a real-world infrastructure management problem that catches most people off guard. Resources created by Helm or kubectl outside Terraform cannot be destroyed by Terraform. You have to clean them up first.


Prometheus and Grafana: Making the Switch Visible

After Part 1, the only way to see the traffic switch happen was a terminal showing JSON responses. A proper observability stack answers questions a curl loop cannot: what was the request rate before the switch? How fast did green ramp up? Was there any degradation during the transition?

Adding Metrics to the Application

I added prom-client to the Node.js application and created two metrics: a counter tracking HTTP requests labelled by route, status code, color, and version, and a gauge identifying which environment each pod belongs to.

const httpRequests = new client.Counter({
  name: "http_requests_total",
  help: "Total HTTP requests by route, status, color and version",
  labelNames: ["method", "route", "status", "color", "version"],
  registers: [register],
});
Enter fullscreen mode Exit fullscreen mode

Every request to /health or / increments the counter with the correct color label. Blue pods increment with color="blue". Green pods increment with color="green". When traffic switches, the accumulation shifts from one color to the other and Prometheus captures that shift in its time-series data.

The Metrics Endpoint Bug That Cost a Rebuild

I updated the code, rebuilt the image, deployed to the cluster, and tested the /metrics endpoint. It returned 404. The pods were running the old image without the metrics route despite a successful push to ECR.

The reason: imagePullPolicy defaults to IfNotPresent. Since the image tag (blue) had not changed, Kubernetes served the cached version from the node without pulling the new one from ECR.

Two fixes, both required:

# In deployment-blue.yaml and deployment-green.yaml
imagePullPolicy: Always
Enter fullscreen mode Exit fullscreen mode
# When building
docker build --no-cache -t ... ./app
Enter fullscreen mode Exit fullscreen mode

Always verify inside the pod before assuming the cluster has the right image:

POD=$(kubectl get pods -l version=blue -o jsonpath='{.items[0].metadata.name}')
kubectl exec $POD -- wget -qO- http://localhost:3000/metrics | head -5
Enter fullscreen mode Exit fullscreen mode

If that fails, the cluster has the wrong image regardless of what ECR shows.

Connecting Prometheus to the Application

A Kubernetes ServiceMonitor tells Prometheus what to scrape:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: bluegreen-monitor
  namespace: monitoring
  labels:
    release: monitoring
spec:
  namespaceSelector:
    matchNames:
      - default
  selector:
    matchLabels:
      app: bluegreen-app
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
Enter fullscreen mode Exit fullscreen mode

One important detail: the port: http references the port name in the Service, not the port number. If the Service port has no name, the scrape silently fails and you get no data. The Service must have a named port:

ports:
  - name: http
    port: 80
    targetPort: 3000
Enter fullscreen mode Exit fullscreen mode

The Grafana Dashboard

The key PromQL queries for the Blue-Green Traffic Monitor dashboard:

rate(http_requests_total{color="blue"}[1m])
rate(http_requests_total{color="green"}[1m])
Enter fullscreen mode Exit fullscreen mode

This is what the switch looked like on the dashboard during a live deployment:

Grafana showing blue dropping and green rising

At 21:49:00, the blue /health line dropped from 0.55 requests per second to 0.25. At the same moment, the green line appeared at 0.19 and climbed back to match the previous blue rate. The crossover is visible as a V-shape. Both environments tracked simultaneously. The switch captured in data.

That is what a curl loop cannot show you.


Canary Releases: Limiting the Blast Radius

Blue-green is binary. You move from 0% to 100% in one command. If green has a bug, every user hits it. Canary releases add a middle step.

How Canary Works with NGINX

NGINX Ingress supports traffic splitting using weight annotations. Instead of one ingress pointing at one service, you create two:

# Stable ingress: routes majority traffic to blue
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: bluegreen-ingress-stable
spec:
  ingressClassName: nginx
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: bluegreen-service-blue
                port:
                  number: 80
Enter fullscreen mode Exit fullscreen mode
# Canary ingress: routes weighted percentage to green
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: bluegreen-ingress-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "5"
spec:
  ingressClassName: nginx
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: bluegreen-service-green
                port:
                  number: 80
Enter fullscreen mode Exit fullscreen mode

With canary-weight: "5", approximately 5% of requests go to green and 95% go to blue. Users hit the same public URL. NGINX handles the split transparently.

The Canary Progression

Here is what a canary release looks like in practice. Send 20 requests and observe the distribution:

for i in $(seq 1 20); do
  curl -s http://$INGRESS_HOST/health | grep -o '"color":"[^"]*"'
done
Enter fullscreen mode Exit fullscreen mode

Output at 5% canary:

"color":"blue"
"color":"blue"
"color":"blue"
"color":"green"
"color":"blue"
"color":"blue"
...
Enter fullscreen mode Exit fullscreen mode

18 blue, 2 green. Exactly the expected distribution.

Progress the canary by updating the weight annotation:

# Move to 50%
kubectl annotate ingress bluegreen-ingress-canary \
  nginx.ingress.kubernetes.io/canary-weight="50" --overwrite

# Full cutover
kubectl annotate ingress bluegreen-ingress-canary \
  nginx.ingress.kubernetes.io/canary-weight="100" --overwrite
Enter fullscreen mode Exit fullscreen mode

At 100%, all 20 requests return green. The canary is complete.

Canary at 5% showing 2 green responses out of 20

Canary at 100% showing all green

The value of canary is simple: if green has a bug, only 5% of users see it. You catch it on Grafana before it affects everyone. The other 95% never knew anything happened.


Automated Rollback: Zero Intervention

Canary limits the blast radius. Automated rollback removes the human from the recovery loop entirely.

The Rollback Script

After every traffic switch, a script monitors pod health every 15 seconds for 2 minutes. If the error rate on the live environment exceeds 5%, it automatically patches the Service selector back to the previous environment.

#!/bin/bash

PREVIOUS_ENV=$1
THRESHOLD=5
WATCH_SECONDS=120
INTERVAL=15
ELAPSED=0

CURRENT=$(kubectl get service bluegreen-service \
  -o jsonpath='{.spec.selector.version}')

echo "Watching error rate for ${WATCH_SECONDS}s"
echo "Current live environment: ${CURRENT}"
echo "Will roll back to ${PREVIOUS_ENV} if error rate exceeds ${THRESHOLD}%"

while [ $ELAPSED -lt $WATCH_SECONDS ]; do
  sleep $INTERVAL
  ELAPSED=$((ELAPSED + INTERVAL))

  TOTAL=0
  ERRORS=0

  for POD in $(kubectl get pods -l version=${CURRENT} \
    -o jsonpath='{.items[*].metadata.name}'); do
    RESPONSE=$(kubectl exec $POD -- \
      wget -qO- http://localhost:3000/health 2>&1)
    TOTAL=$((TOTAL + 1))
    if echo "$RESPONSE" | grep -q "unhealthy\|500\|error"; then
      ERRORS=$((ERRORS + 1))
    fi
  done

  ERROR_RATE=$(( (ERRORS * 100) / TOTAL ))
  echo "Check at ${ELAPSED}s: ${ERRORS}/${TOTAL} pods unhealthy (${ERROR_RATE}%)"

  if [ $ERROR_RATE -gt $THRESHOLD ]; then
    echo "ERROR RATE ${ERROR_RATE}% EXCEEDS THRESHOLD ${THRESHOLD}%"
    echo "Rolling back to ${PREVIOUS_ENV}..."
    kubectl patch service bluegreen-service \
      -p "{\"spec\":{\"selector\":{\"app\":\"bluegreen-app\",\"version\":\"${PREVIOUS_ENV}\"}}}"
    echo "Rolled back to ${PREVIOUS_ENV} at ${ELAPSED}s"
    exit 1
  fi

  echo "Error rate ${ERROR_RATE}% is within threshold. Continuing."
done

echo "No issues detected after ${WATCH_SECONDS}s. Release is stable."
exit 0
Enter fullscreen mode Exit fullscreen mode

Why It Checks Pods Directly Instead of the Ingress URL

The first version of this script sent HTTP requests to the ingress URL and watched for non-200 responses. It always reported 0% errors even when green was broken.

The reason: Kubernetes readiness probes prevented the broken green pods from ever receiving traffic through the Service. So the ingress was always routing to healthy pods regardless of what the Service selector said.

Checking pod health directly via kubectl exec bypasses the Service and Ingress entirely. It gives a direct signal from the pod itself, which is exactly the signal you need to decide whether to roll back.

Live Demonstration

Set green to return 500 errors using a FORCE_ERROR environment variable:

kubectl set env deployment/bluegreen-green FORCE_ERROR=true
Enter fullscreen mode Exit fullscreen mode

Switch traffic to broken green, then start the watcher:

kubectl patch service bluegreen-service \
  -p '{"spec":{"selector":{"app":"bluegreen-app","version":"green"}}}'

bash k8s/auto-rollback.sh blue
Enter fullscreen mode Exit fullscreen mode

Output:

Watching error rate for 120s
Current live environment: green
Will roll back to blue if error rate exceeds 5%
Check at 15s: 2/2 pods unhealthy (100%)
ERROR RATE 100% EXCEEDS THRESHOLD 5%
Rolling back to blue...
service/bluegreen-service patched
Rolled back to blue at 15s
Enter fullscreen mode Exit fullscreen mode

Automated rollback firing at 15 seconds

15 seconds. No human intervention. No dashboard watching. No alert. The system detected the problem and fixed itself.


The Complete System

Developer
    |
    | git push to main
    v
GitHub Actions (29 seconds average)
    |
    +-- Build and push image to ECR
    +-- Deploy to idle environment
    +-- Health check
    +-- Switch traffic
    +-- Run auto-rollback watcher for 2 minutes
    |
    v
Amazon EKS (Terraform-provisioned, 17 resources)
    |
    +-- NGINX Ingress
    |       +-- 95% traffic to blue (stable)
    |       +-- 5% traffic to green (canary)
    |
    +-- Prometheus (scrapes /metrics every 15s)
    +-- Grafana (request rates per environment, live)
    |
    +-- Blue Deployment  (2 pods)
    +-- Green Deployment (2 pods)
Enter fullscreen mode Exit fullscreen mode

Every component has a specific job. Terraform makes it reproducible. Prometheus makes it observable. Grafana makes it visible. Canary limits the risk. Automated rollback removes the human from recovery.


Key Takeaways

Infrastructure as code is not optional for anything you want to rebuild. The first time you use eksctl create cluster you think it is fine. The second time you rebuild, you realise you have forgotten half the steps.

imagePullPolicy: Always is required when you reuse image tags. Without it, Kubernetes serves cached containers even after you push a new image. This wastes hours of debugging.

Check pod health directly, not through the ingress. Readiness probes protect the ingress from broken pods. That is good for users but bad for rollback scripts that rely on seeing errors through the ingress URL.

The V-shape on the Grafana graph is the proof. When blue drops and green rises simultaneously on the request rate dashboard, you can see exactly when the switch happened, how fast it was, and that both environments were healthy throughout.

Canary and automated rollback together close the loop. Canary limits how many users see a bad release. Automated rollback removes the human from fixing it. Together they move you from a pipeline that requires human vigilance to one that is genuinely self-managing.


The Repository

Everything in both parts of this series is at:

github.com/gbadedata/zero-downtime-bluegreen-eks

27 screenshots from the live deployment. All Kubernetes manifests, Terraform configuration, GitHub Actions workflows, ServiceMonitor, canary ingresses, and the automated rollback script.


Part 1: Your Deployments Are Causing Downtime. Mine Do Not. Here Is Why.

Top comments (2)

Collapse
 
harjjotsinghh profile image
Harjot Singh

"But I kept thinking about what happens after the switch" is the question that separates a deploy pipeline from a deployment system. Zero-downtime is the easy half, anyone can prove it with a curl loop; the hard half is the subtle bug that only surfaces under real production load minutes later, when the curl loop is long green and a human has moved on. That's the gap where "zero downtime" quietly becomes "downtime, just delayed." Self-healing (watch the new version on real traffic, auto-rollback on a health/SLO regression) is the right answer because it removes the someone-has-to-be-watching dependency, and that dependency is exactly what fails at 2am. I built almost this same thing for my own deploys, the key lesson was that the rollback trigger has to be a real signal (error-rate or latency SLO breach) not just liveness, because a process can be up and serving garbage. What's your rollback criterion, a Prometheus alert threshold, or a comparison against the old version's baseline during a bake window? This whole self-healing-deploy problem is core to how I ship Moonshift.

Collapse
 
gbadedata profile image
Oluwagbade Odimayo

Thanks, this is exactly the point I was trying to get at. A curl loop proves the traffic switch worked, but it does not prove the release is safe after it starts receiving real traffic. That gap after the switch is where a lot of “zero-downtime” claims become weak.

In my current implementation, the rollback criterion is deliberately simple: after the switch, the script watches the newly live environment for two minutes and rolls back if more than 5% of the pods report unhealthy/error responses. I made it check the pods directly instead of the ingress because readiness probes can hide broken pods from external traffic, which would make the rollback signal look clean when the new version is actually bad.

That said, I agree with your point: the stronger production-grade version should not stop at liveness or basic health. The better criterion is an SLO-based trigger from Prometheus: elevated 5xx rate, latency degradation, or a comparison between the new version and the previous stable baseline during a bake window. That is the next logical improvement: move from “is the pod healthy?” to “is the new version behaving at least as well as the old version under real traffic?”

So the honest answer is: my current rollback is health/error-threshold based, but the production direction is Prometheus-driven SLO regression detection against the old baseline. That is where self-healing deployment becomes much more credible. Your point about a process being up while still serving garbage is dead right. My article already mentions direct pod-health rollback and the 5% threshold, but this comment exposes the next level the system needs.