Carlos Filho for AWS Community Builders

Posted on Jun 2

How I would put an SLA operations portal on AWS (and what I weighed along the way)

#aws #python #devops #observability

Context

I maintain an internal portal that tracks SLA across several operational queues in Jira. It reads tickets through the Jira API, hours through the Tempo API, calculates a health score per team, and shows everything in a dashboard. Today it runs as a simple Flask server, with an in memory cache and a few JSON files to keep history.

It works. But "runs on my machine" is not architecture. So I sat down to think: if I needed to ship this for real, with people depending on it, what would it look like on AWS?

This post is that exercise. It is not a step by step Terraform tutorial. It is more about the decisions and the trade-offs. I will try to be honest about what makes sense and what would be overkill.

Portal home screen: consolidated health score and per team.

What the portal does today

Before throwing AWS on top of it, it helps to understand the parts:

Backend: a Flask server in Python. It exposes a few REST endpoints (/api/queues/:id/data, /api/queues/overview, /api/trends, and so on).
Frontend: a single HTML file with a fair amount of JavaScript and Chart.js. No framework, no build step.
External integrations: the Jira REST API and the Tempo API. That is where all the data comes from.
Cache: an in memory dictionary with a 5 minute TTL. The first call to a large queue (like Service Desk, with around 2,500 tickets) takes about 30 seconds; the next ones come back instantly.
Persistence: two JSON files. One is the queue registry (configuration), the other keeps daily snapshots of the health score to build trend charts.
Daily job: once a day it takes a "snapshot" of all queues and saves it to history.
Secrets: Jira and Tempo tokens, today in environment variables.

Notice that it is a small system. That matters, because the temptation to throw ten AWS services at a project this size is strong and almost always wrong.

Operations page for one queue: first response and resolution SLA, multidimensional health score, and distribution by analyst.

The AWS design

I will split it by responsibility.

Where the backend runs

The first question is: Lambda or container?

The backend has one annoying detail. The call that fetches all tickets from a large queue can take more than 30 seconds because of Jira API pagination. Lambda has a maximum timeout of 15 minutes, so it would fit, but 30 seconds of cold start plus a synchronous call that size makes me uncomfortable for a UI. Nobody likes clicking and waiting half a minute.

So I went with ECS Fargate. A container running Flask (behind Gunicorn, not the Flask dev server), running as a service. I do not have to manage EC2, I pay for what I use, and I do not have the cold start dance for the heavier requests.

I put an Application Load Balancer in front. It handles HTTPS (with an ACM certificate) and distributes across the containers. If usage grows one day, I turn on Fargate auto scaling by CPU. Today, with the current team, one or two tasks are enough.

One alternative I considered was App Runner. It is simpler than plain Fargate and it abstracts away the load balancer. For a project this size, App Runner would probably be enough and would give me less to configure. I stayed with Fargate because I know it better and I want control over the networking, but if I were in a hurry, App Runner would be a defensible choice.

The frontend

The frontend is static HTML. It does not need a server for that.

I drop the file in an S3 bucket and put CloudFront in front. CloudFront caches the HTML close to the user and serves it over HTTPS. Flask stops having the job of serving the HTML and becomes only an API.

A practical detail: since the frontend calls the API at /api/..., I configure CloudFront with two origins. A request that hits /api/* goes to the Load Balancer (the backend); everything else comes from S3. That way the user only knows one domain and I do not have to worry about CORS.

Persistence

This is where I had to hold back the most so I would not overcomplicate things.

The two JSON files have different natures:

Queue registry (queue_registry.json): it is configuration. It rarely changes, and by hand. This can stay a file, just in an S3 bucket instead of local disk. The backend reads it at startup (or with a short cache). I do not need a database for this.

Snapshot history (trends.json): this grows every day and I run queries like "give me the last 30 days of this queue." A single JSON file that keeps growing is a bad idea in the medium term, because every time I read the whole file just to grab a slice. Here DynamoDB fits well. Each snapshot becomes an item, with the date as part of the key. I can fetch a date range without reading the entire history. And it is serverless, so I do not have a database instance to babysit on a small project.

I noticed I could have gone with RDS, but I have no complex relationships and no JOINs. It would mean paying for something I do not use. DynamoDB with a simple access pattern (queue key plus date) does the job.

The cache

Today the cache lives in process memory. It works because it is a single process. The moment I have two tasks on Fargate, each one has its own cache, and that is where the inconsistency starts. One user gets data from 1 minute ago, another from 4 minutes ago.

The standard answer for this is ElastiCache (Redis). A cache shared across the tasks, with a TTL. The tasks write to and read from the same place.

But, and here is where I would be honest in a real article, for this volume I might not even stand up Redis on day one. Keeping the cache in memory with two tasks means that, in the worst case, the data can be 5 minutes "out of phase" between one task and another. For an SLA dashboard that refreshes every 5 minutes anyway, that is tolerable. I would add Redis when the number of tasks grows or when the cost of re fetching from Jira starts to hurt. It is one of those things you can put off without shame.

The daily job

The daily snapshot is a cron today. On AWS that becomes EventBridge Scheduler firing once a day.

What does it fire? Here Lambda makes sense. The job does not have the UI latency problem. It runs in the background, it can take the 30 to 40 seconds fetching all the queues, and nobody is waiting on a screen. A Lambda function that takes the snapshot and writes to DynamoDB. Cheap, with no server running all day just to run once.

Secrets

The Jira and Tempo tokens cannot sit in a plain text environment variable in a serious setup. They go to Secrets Manager. The container and the Lambda pull the tokens at runtime through an IAM role. Bonus: Secrets Manager does rotation if we configure it, although a third party API token does not always support automatic rotation.

Logs and the minimum of observability

Everything sends logs to CloudWatch Logs. That comes almost for free with Fargate and Lambda.

I would set at least two alarms in CloudWatch:

If the 5xx error rate on the Load Balancer rises, I want to know.
If the daily job fails (the Lambda errors out), I want to know, because otherwise the trend chart gets a gap and I only find out weeks later.

I am not going to build an elaborate observability dashboard for a project this size. The CloudWatch logs and two alarms cover what matters.

Putting it all together

In the AWS version, the user would reach the portal through a domain like this:

https://d1a2b3c4d5e6f7.cloudfront.net

(that is the domain CloudFront generates automatically; in production I would point a CNAME like cldops.mycompany.com to it through Route 53)

The flow looks like this:

User reaches https://d1a2b3c4d5e6f7.cloudfront.net, hits CloudFront.
CloudFront serves the HTML from S3.
The JavaScript calls /api/..., CloudFront routes it to the Load Balancer, then to Fargate (Flask/Gunicorn).
Flask checks the cache (memory or Redis), and if needed, fetches from Jira/Tempo.
Queue configuration comes from a JSON in S3; history comes from DynamoDB.
Once a day, EventBridge fires a Lambda that takes the snapshot and writes to DynamoDB.
Tokens live in Secrets Manager; logs and alarms in CloudWatch.

As a box diagram, it looks roughly like this:

                          ┌─────────────────┐
   user ────HTTPS───────► │   CloudFront     │
                          │ d1a2b3c4d5e6f7   │
                          │ .cloudfront.net  │
                          └────────┬─────────┘
                          /            \
                   /api/* │             │ rest
                          ▼             ▼
                  ┌──────────────┐   ┌──────────┐
                  │     ALB      │   │    S3     │
                  └──────┬───────┘   │ (HTML)    │
                         ▼           └──────────┘
                  ┌──────────────┐
                  │ ECS Fargate  │──► Jira API
                  │ Flask/Gunic. │──► Tempo API
                  └──────┬───────┘
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
        ┌─────────┐ ┌─────────┐ ┌──────────┐
        │DynamoDB │ │  S3     │ │ Secrets  │
        │(trends) │ │(config) │ │ Manager  │
        └─────────┘ └─────────┘ └──────────┘

   EventBridge (1x/day) ──► Lambda (snapshot) ──► DynamoDB

The cross queue comparison: you can see at a glance which queue is healthy and which one needs attention.

How this becomes IaC

A diagram is nice, but it does not deploy anything. I would write this in Terraform so I do not end up clicking around the console and forgetting what I did. I am not going to paste the whole module here (it would be huge), but I can show the skeleton so you get a sense of the size.

The file structure I would use:

infra/
├── main.tf            # providers and state backend
├── variables.tf       # region, project name, etc
├── network.tf         # VPC, subnets, security groups
├── ecs.tf             # cluster, task definition, service
├── alb.tf             # load balancer and target group
├── frontend.tf        # S3 bucket + CloudFront
├── data.tf            # DynamoDB + config bucket
├── scheduler.tf       # EventBridge + snapshot Lambda
├── secrets.tf         # Secrets Manager
└── outputs.tf         # CloudFront URL, etc

A piece of frontend.tf, which is what generates that CloudFront domain:

# Bucket that holds the dashboard HTML
resource "aws_s3_bucket" "frontend" {
  bucket = "${var.project}-frontend"
}

resource "aws_cloudfront_distribution" "portal" {
  enabled             = true
  default_root_object = "dashboard.html"

  # Origin 1: the static HTML in S3
  origin {
    domain_name              = aws_s3_bucket.frontend.bucket_regional_domain_name
    origin_id                = "s3-frontend"
    origin_access_control_id = aws_cloudfront_origin_access_control.frontend.id
  }

  # Origin 2: the API on the load balancer
  origin {
    domain_name = aws_lb.api.dns_name
    origin_id   = "alb-api"
    custom_origin_config {
      http_port              = 80
      https_port             = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols   = ["TLSv1.2"]
    }
  }

  # Everything that is NOT /api/* comes from S3
  default_cache_behavior {
    target_origin_id       = "s3-frontend"
    viewer_protocol_policy = "redirect-to-https"
    allowed_methods        = ["GET", "HEAD"]
    cached_methods         = ["GET", "HEAD"]
    cache_policy_id        = data.aws_cloudfront_cache_policy.optimized.id
  }

  # /api/* goes to the backend, no cache (dynamic data)
  ordered_cache_behavior {
    path_pattern           = "/api/*"
    target_origin_id       = "alb-api"
    viewer_protocol_policy = "redirect-to-https"
    allowed_methods        = ["GET", "HEAD", "OPTIONS", "POST"]
    cached_methods         = ["GET", "HEAD"]
    cache_policy_id        = data.aws_cloudfront_cache_policy.disabled.id
  }

  restrictions {
    geo_restriction { restriction_type = "none" }
  }

  viewer_certificate {
    cloudfront_default_certificate = true
  }
}

output "portal_url" {
  value = "https://${aws_cloudfront_distribution.portal.domain_name}"
}

Notice the detail with the two cache_behavior blocks: I let CloudFront cache the static HTML freely, but I mark /api/* as no cache, because it is dynamic SLA data. If I cached the API at CloudFront, it would fight with the 5 minute cache that already exists in the backend and the data would go stale in ways that are hard to debug.

And outputs.tf prints the URL at the end of terraform apply:

# after the apply:
# portal_url = "https://d1a2b3c4d5e6f7.cloudfront.net"

A snippet of ecs.tf to give an idea of how the backend is declared:

resource "aws_ecs_task_definition" "api" {
  family                   = "${var.project}-api"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 512
  memory                   = 1024
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([{
    name  = "flask-api"
    image = "${aws_ecr_repository.api.repository_url}:latest"
    portMappings = [{ containerPort = 8080 }]
    # tokens come from Secrets Manager, not hardcoded
    secrets = [
      { name = "JIRA_API_TOKEN", valueFrom = aws_secretsmanager_secret.jira.arn },
      { name = "TEMPO_API_TOKEN", valueFrom = aws_secretsmanager_secret.tempo.arn }
    ]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.api.name
        "awslogs-region"        = var.region
        "awslogs-stream-prefix" = "flask"
      }
    }
  }])
}

This is not the full file. It is missing the aws_ecs_service, the auto scaling, the IAM policies with the right permission for the secrets, the network.tf with the VPC. But you can see it is not magic: it is declaring each box from the diagram and wiring one to the other by their IDs.

If I were doing this for real, I probably would not even write all of it by hand. I would use a ready made community module for the VPC (the terraform-aws-modules/vpc/aws is well tested) and spend my time on what is specific to the project.

What I left out on purpose

I think half the value of this kind of exercise is in what you do NOT include.

No Kubernetes (EKS). For a single container, EKS would be bringing a tractor to go to the bakery.
No separate API Gateway. The Load Balancer already handles the routing. API Gateway would make sense if I had several APIs, complex authorization, throttling per client. That is not the case.
No data lake, no Athena, no Glue. It is SLA history, not big data. DynamoDB handles it.
No multi region. It is an internal tool. If the region goes down for a few hours, the team looks at Jira directly. It is not worth the complexity.

Rough cost

I am not going to pretend I know the exact number, because it depends a lot on usage. But the order of magnitude for an internal system this size:

Fargate: the biggest item, but with one or two small tasks it is manageable.
S3 + CloudFront: cents for this amount of traffic.
DynamoDB: on demand, paying per write/read, with one snapshot a day and occasional queries, it stays very low.
Lambda + EventBridge: runs once a day, basically free.
Secrets Manager: charges per secret per month, a few dollars.

The bulk of the cost is Fargate running all the time. If usage were very sporadic, then it would be worth reconsidering Lambda for the backend and accepting the cold start.

Migration, if I were to do it for real

I would not do everything at once. The order I would follow:

Containerize Flask and ship it to Fargate behind the Load Balancer. The system keeps using local JSON files inside the container (ephemeral, but it works to start).
Move the frontend to S3 + CloudFront.
Take the tokens out of the environment and put them in Secrets Manager.
Migrate the snapshot history from JSON to DynamoDB. This is the step that actually changes code.
Move the daily job to EventBridge + Lambda.
Only then, if needed, stand up Redis on ElastiCache.

Each step leaves the system working. No big bang.

The trends page uses the daily snapshots, which is exactly the data that would move to DynamoDB.

Wrapping up

The point I wanted to get across is that good architecture is not the one that uses the most services. It is the one that uses enough for the problem you have. This portal is small and will probably stay small. Its AWS version reflects that: Fargate for the backend, S3/CloudFront for the front, DynamoDB for the history, Lambda for the job, and the rest is plumbing (Secrets Manager, CloudWatch, IAM).

If it ever grows to dozens of teams and thousands of hits a day, I will revisit it. But designing for that hypothetical future now would mean spending time and money on a problem I do not have yet.

If you have a similar little project, an internal tool that "runs on your machine," I recommend doing this exercise. You learn more by designing the production version of something small than by copying an architecture diagram from a big company.

Comments and corrections are welcome. If you would have done something differently, tell me, especially on the cache part, which is where I was most unsure.

DEV Community