DEV Community

ko-chan
ko-chan

Posted on • Edited on • Originally published at ko-chan.github.io

Landmines and Solutions in Self-Hosted CI/CD: 15 Runners x Shared Docker Environment [Part 7]

This article was originally published on Saru Blog.


What You Will Learn

  • Problem and solution patterns in self-hosted runner environments
  • How to prevent resource contention on a shared Docker daemon
  • The port assignment problem killed by the birthday paradox
  • How to investigate when "CI that was working suddenly breaks"

Introduction

In Part 2, I wrote about automating E2E tests using WebAuthn and Mailpit. The tests themselves work fine. The problem was the CI infrastructure.

Saru has 4 frontends x 4 backend APIs. E2E tests run independently for each portal, plus there are cross-portal tests (integration tests between portals). Running these in parallel means 7+ jobs executing simultaneously.

Initially, I used GitHub-hosted runners, but E2E tests require a database, Keycloak, and a mail server—lots of Docker containers. GitHub-hosted runners are slow to set up each time and cost-inefficient for parallel execution.

So I migrated to self-hosted runners. The decision was correct, but a flood of new problems emerged.

This article chronicles the problems encountered in a self-hosted CI environment and their solutions, in chronological order. I hope it helps anyone adopting a similar configuration.

1. Docker Desktop/WSL2 Was Too Unstable

The Initial Setup

I initially ran runners on Docker Desktop (WSL2 backend) on Windows. The setup was simple:

Windows Host
  └─ WSL2
      └─ Docker Desktop
          └─ GitHub Actions Runner x N
Enter fullscreen mode Exit fullscreen mode

The problem: "containers randomly die." During E2E tests, the PostgreSQL container would suddenly vanish, or Keycloak would become unresponsive. docker inspect showed Exit Code: 137 (SIGKILL).

Tracing the cause led to Docker Desktop/WSL2's virtualization layer:

Container → Docker Engine → WSL2 → Hyper-V → Windows
Enter fullscreen mode Exit fullscreen mode

WSL2 itself runs as a lightweight Hyper-V VM, with Docker stacked on top. When memory pressure rises, WSL2 triggers the OOM Killer, indiscriminately killing Docker containers.

Migration to Hyper-V VM

The solution was to bypass WSL2 and create an Ubuntu VM directly on Hyper-V:

Container → Docker Engine → Ubuntu VM → Hyper-V → Windows
Enter fullscreen mode Exit fullscreen mode
Item Value
VM Name saru-ci-runner
OS Ubuntu 24.04
vCPU 16
Memory 64GB
Disk 200GB
Network External Switch (bridged)

Hyper-V VM has fewer virtualization layers and more stable memory management than WSL2. WSL2 uses dynamic memory allocation "shared with the host" (defaulting to 50% of host RAM or up to 8GB), while Hyper-V VM allocates fixed memory, reducing the risk of OOM Killer strikes.

On top of this, I deployed 15 GitHub Actions Runners as systemd services:

# saru-hyperv-1 through saru-hyperv-15
for i in $(seq 1 15); do
  sudo systemctl status actions.runner.ko-chan-saru.saru-hyperv-$i
done
Enter fullscreen mode Exit fullscreen mode

15 runners sharing a single Docker daemon. This "sharing" would later cause many problems.

2. Port Collisions via the Birthday Paradox

Problem

E2E tests have each job start its own PostgreSQL, Keycloak, frontend, and backend. To avoid port collisions, I calculated port numbers from the GitHub Actions RUN_ID:

# Initial implementation (problematic)
PORT_OFFSET=$(( RUN_ID % 3000 ))
POSTGRES_PORT=$(( 10000 + PORT_OFFSET ))
Enter fullscreen mode Exit fullscreen mode

This looks fine, but since 15 runners share a single Docker daemon, ports from simultaneously running jobs can collide.

This has the same structure as the birthday paradox. With 3000 possible port offsets and 5 concurrent jobs, the collision probability is about 0.33% (1 - 3000!/(3000^5 × 2995!)). Seems trivial, but when CI runs dozens of times per day, collisions happen several times a week. And when ports collide, you get the cryptic error "container started but service unreachable."

Solution: RUNNER_NAME-based allocation

Instead of the random RUN_ID, I switched to deterministic port assignment from the runner name:

# Extract runner number from name (e.g., saru-hyperv-7 → 7)
if [[ "${RUNNER_NAME}" =~ saru-hyperv-([0-9]+) ]]; then
  RUNNER_NUM=${BASH_REMATCH[1]}
else
  RUNNER_NUM=$(( (RUN_ID % 15) + 1 ))
fi

# Allocate 200-port blocks per runner
RUNNER_BLOCK=$((RUNNER_NUM * 200))
PORTAL_OFFSET=$((PORTAL_INDEX * 10))

# Frontend/Backend: 20000 + (RUNNER_NUM × 200) + (PORTAL_INDEX × 10) + {0,1,2,3}
BASE_PORT=20000
OFFSET=$((RUNNER_BLOCK + PORTAL_OFFSET))
PORTAL_PORT=$((BASE_PORT + OFFSET + 1))
API_PORT=$((BASE_PORT + OFFSET + 2))

# Infra: 30000 + (RUNNER_NUM × 1000) + {0,100,200,...} + PORTAL_INDEX
BASE_INFRA_PORT=30000
INFRA_RUNNER_BLOCK=$((RUNNER_NUM * 1000))
POSTGRES_PORT=$((BASE_INFRA_PORT + INFRA_RUNNER_BLOCK + PORTAL_INDEX))
KEYCLOAK_PORT=$((BASE_INFRA_PORT + INFRA_RUNNER_BLOCK + 100 + PORTAL_INDEX))
Enter fullscreen mode Exit fullscreen mode

The key insight: "each runner executes only one job at a time." Since the runner number uniquely determines the port block, collisions are impossible by design:

Runner Frontend Range Infra Range
saru-hyperv-1 20200–20313 31000–31510
saru-hyperv-2 20400–20513 32000–32510
... ... ...
saru-hyperv-15 23000–23113 45000–45510

All ports confirmed to fit within 65535.

3. docker system prune Kills Other Jobs' Containers

Problem

I had Docker cleanup at the end of each CI job:

# ⚠️ This was the problem
- name: Cleanup
  run: docker system prune -f
Enter fullscreen mode Exit fullscreen mode

docker system prune deletes all stopped containers. Since 15 runners share a single Docker daemon, one job's cleanup can destroy containers another job is actively using.

Especially tricky is the timing right after container startup. If prune runs after Docker Compose or Run starts a container but before the health check passes, the starting container is judged "stopped" and deleted.

Solution: Targeted cleanup

# ✅ Safe cleanup
# docker system prune and docker container prune are FORBIDDEN
# They destroy concurrent jobs' containers
# Only remove dangling images older than 24 hours
docker image prune -f --filter "until=24h" 2>/dev/null || true
Enter fullscreen mode Exit fullscreen mode

Container deletion targets only those tied to your own RUN_ID. Name patterns filter out persistent containers (saru-postgres-integ, etc.):

# Delete only this job's containers (protect persistent ones)
RUN_ID="${{ github.run_id }}"
for container in $(docker ps -a --format '{{.Names}}' \
  | { grep "^saru-" | grep -v "saru-postgres-integ\|saru-keycloak-dev\|saru-mailpit-dev" || true; }); do
  if [[ ! "$container" =~ "${RUN_ID}" ]]; then
    docker rm -f "$container" 2>/dev/null || true
  fi
done
Enter fullscreen mode Exit fullscreen mode

Lesson: In shared Docker daemon environments, docker system prune is a banned weapon. Always use scoped deletion.

4. PostgreSQL Silently Crashes from Shared Memory Exhaustion

Problem

PostgreSQL containers in CI started crashing immediately after startup. pg_isready succeeds momentarily, but the following psql command returns "container is not running":

✓ pg_isready -U test → success
✗ psql -U test -c "CREATE DATABASE ..." → container is not running
Enter fullscreen mode Exit fullscreen mode

This is extremely confusing. PostgreSQL internally restarts after initdb, so if pg_isready succeeds right before that restart, the process no longer exists when the next command runs.

But the real cause was different. Docker's default /dev/shm size (64MB) is insufficient for PostgreSQL.

Solution

docker run -d \
  --name saru-postgres-ci \
  --shm-size=256m \  # ← This is critical
  --restart=unless-stopped \
  -e POSTGRES_USER=test \
  -e POSTGRES_PASSWORD=test \
  -p 15432:5432 \
  --health-cmd "pg_isready -U test" \
  postgres:16-alpine \
  postgres -c max_connections=200
Enter fullscreen mode Exit fullscreen mode

Specifying --shm-size=256m ensures adequate shared memory for PostgreSQL. Parallel tests (-parallel 2 or higher) especially need more shared buffers, and 64MB is not enough.

This was an intermittent issue, only reproducing under high test load. Root cause identification took a full day.

Whether OOM Killer is the cause can be verified with docker inspect:

docker inspect "${CONTAINER}" --format '{{.State.OOMKilled}}'
# true means memory exhaustion was the cause
Enter fullscreen mode Exit fullscreen mode

5. The Persistent PostgreSQL Container Pattern

Problem

Initially, each job started and stopped its own PostgreSQL container. However:

  • Container startup takes 10–15 seconds each time
  • Ports are not released immediately on stop, causing collisions on next startup
  • Container lifecycle management becomes complex (forgotten stops, zombie containers, etc.)

Solution: Persistent Container + Per-Job Database

Combined with the --shm-size=256m from section 4, I switched to keeping the container running permanently and creating/dropping temporary databases per job:

# Start PostgreSQL container if not running (first time only)
- name: Start persistent PostgreSQL
  run: |
    POSTGRES_CONTAINER="saru-postgres-integ"
    if ! docker ps --format '{{.Names}}' | grep -qx "${POSTGRES_CONTAINER}"; then
      docker run -d \
        --name "${POSTGRES_CONTAINER}" \
        --shm-size=256m \
        --restart=unless-stopped \
        -e POSTGRES_USER=test \
        -p 15432:5432 \
        postgres:16-alpine
    fi

    # Create database for this job
    DB_NAME="integ_${{ github.run_id }}"
    docker exec "${POSTGRES_CONTAINER}" \
      psql -U test -h 127.0.0.1 -c "CREATE DATABASE \"${DB_NAME}\" OWNER test;"

# Delete only the database at job end
- name: Cleanup database
  if: always()
  run: |
    docker exec saru-postgres-integ \
      psql -U test -h 127.0.0.1 \
      -c "DROP DATABASE IF EXISTS \"integ_${{ github.run_id }}\";"

    # Also clean up stale databases from past failed jobs
    STALE_DBS=$(docker exec saru-postgres-integ psql -U test \
      -d postgres -h 127.0.0.1 -tAc \
      "SELECT datname FROM pg_database WHERE datname LIKE 'integ_%';")
    for DB in $STALE_DBS; do
      docker exec saru-postgres-integ psql -U test \
        -d postgres -h 127.0.0.1 \
        -c "DROP DATABASE IF EXISTS \"${DB}\";"
    done
Enter fullscreen mode Exit fullscreen mode

Three key points:

  1. --restart=unless-stopped: Container auto-recovers even when VM restarts
  2. Job ID as database name: Prevents interference between concurrent jobs
  3. Stale database cleanup: Periodically removes garbage left by failed jobs

6. Why Force TCP Connections in psql

Problem

When executing psql inside a PostgreSQL container without specifying the connection method, Unix sockets are used by default. However, PostgreSQL has an initdb → restart cycle on first startup, during which the Unix socket briefly disappears:

# ⚠️ Unix socket (default): may fail during restart
docker exec postgres psql -U test -c "SELECT 1"

# ✅ TCP connection: retries work during restart
docker exec postgres psql -U test -h 127.0.0.1 -c "SELECT 1"
Enter fullscreen mode Exit fullscreen mode

Adding -h 127.0.0.1 forces TCP connection, making connection failure errors clearer (a definitive "connection refused" rather than an ambiguous "socket file not found").

Lesson: Always add -h 127.0.0.1 to psql calls in CI scripts.

7. Docker Network Pool Exhaustion

Problem

One day, all E2E jobs suddenly started failing. Error message:

Error response from daemon: could not find an available,
non-overlapping IPv4 address pool among the defaults to
assign to the network
Enter fullscreen mode Exit fullscreen mode

Docker Compose creates a bridge network per project. Docker allocates /16 subnets from the default range 172.17.0.0/16 through 172.31.0.0/16, limiting available networks to about 30. When 15 runners simultaneously run E2E tests and each job creates multiple networks, this pool runs dry.

Solution

# Delete old networks (protect current RUN_ID's)
RUN_ID="${{ github.run_id }}"
for net in $(docker network ls --format '{{.Name}}' | { grep "^saru-ci-" || true; }); do
  if [[ ! "$net" =~ "${RUN_ID}" ]]; then
    containers=$(docker network inspect "$net" \
      --format '{{len .Containers}}' 2>/dev/null || echo "in-use")
    if [ "$containers" = "0" ]; then
      docker network rm "$net" 2>/dev/null || true
    fi
  fi
done
Enter fullscreen mode Exit fullscreen mode

Delete unused old networks at the start of each job. Instead of docker network prune, filter by name and only remove those with "0 connected containers."

8. Preventing OTP Contention

Problem

E2E tests include OTP (one-time password) authentication. Each E2E job shares the same Mailpit instance, searching Mailpit's API for emails to retrieve OTP codes.

The problem: when multiple jobs log in simultaneously with the same email address (e.g., system-admin@saru.local), multiple OTP emails for the same recipient arrive in Mailpit. Timestamp filtering helps somewhat, but millisecond-level contention cannot be fully prevented.

Solution: Per-Job Email Addresses

matrix:
  portal:
    - name: system-auth
      system_email: "system-admin@saru.local"
    - name: system-entities
      system_email: "system-entities@saru.local"
    - name: system-products
      system_email: "system-products@saru.local"
    - name: system-misc
      system_email: "system-misc@saru.local"
Enter fullscreen mode Exit fullscreen mode

Each job uses a different system admin account (email address), eliminating OTP email retrieval contention. Multiple system admin accounts are registered in the backend seed data.

9. The hashFiles Syntax Gotcha

Problem

I got stuck trying to use dynamic paths with GitHub Actions' hashFiles():

# ⚠️ This doesn't work (string concatenation can't be used inside hashFiles)
key: ${{ runner.os }}-turbo-${{ hashFiles('apps/' + matrix.app + '/**') }}
Enter fullscreen mode Exit fullscreen mode

hashFiles() arguments only accept literal strings. Expressions are not evaluated.

Solution

# ✅ Use the format() helper
key: ${{ runner.os }}-turbo-${{ matrix.app }}-${{ hashFiles(format('apps/{0}/**', matrix.app), 'packages/**', 'pnpm-lock.yaml', 'turbo.json') }}
Enter fullscreen mode Exit fullscreen mode

Use format() to build the path first, then pass the result to hashFiles(). This is not in the GitHub Actions documentation—I found it in a community discussion (#25718).

10. Migration Round-Trip Testing

CI tests a "migration round-trip" every time:

# Up → Down 1 step → Up again
DATABASE_URL="..." go run ./cmd/migrate -action up
DATABASE_URL="..." go run ./cmd/migrate -action down -steps 1
DATABASE_URL="..." go run ./cmd/migrate -action up
Enter fullscreen mode Exit fullscreen mode

This ensures:

  • Down migrations are not broken
  • The "can never Up again after Up→Down" pattern is detected
  • Safety for production rollbacks is guaranteed

11. Automatic Diagnostics on Failure

To make root cause identification easier when CI fails, diagnostic information is collected in if: failure() steps:

- name: Diagnose PostgreSQL on failure
  if: failure()
  run: |
    # Container state
    docker inspect "${POSTGRES_CONTAINER}" --format '{{.State.Status}}'

    # Active connections
    docker exec "${POSTGRES_CONTAINER}" psql -U test -h 127.0.0.1 \
      -c "SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;"

    # Container logs (last 30 lines)
    docker logs "${POSTGRES_CONTAINER}" --tail 30

    # Host memory status
    free -h
Enter fullscreen mode Exit fullscreen mode

Just having this breaks the loop of "CI failed → check logs → can't find the cause → re-run and pray."

12. Disk Management

Since 15 runners share the same 200GB disk, disk management is critical:

strategy:
  max-parallel: 2  # Limit concurrency to prevent disk exhaustion
Enter fullscreen mode Exit fullscreen mode

Each job consumes about 2GB for node_modules installation, frontend builds, Playwright browser cache, etc. 8 concurrent jobs means 16GB, plus Docker images and build cache—200GB fills up quickly.

A periodic cleanup script is also prepared:

Target Retention
CI artifacts 7 days
Runner _temp 1 day
.turbo cache 7 days
node_modules 3 days
Go build cache 14 days
Docker images 30 days
Playwright browsers Keep (essential for E2E)

Summary: Lessons Learned from Self-Hosted CI

Lesson Details
Shared resources collide Docker daemon, ports, networks, disk
Determinism over randomness Port assignment: RUNNER_NUM-based over RUN_ID % N
prune is a weapon docker system prune is forbidden in shared environments
Persistent container + temporary DB Simplifies container lifecycle management
Force TCP psql with -h 127.0.0.1 avoids Unix socket traps
Leave diagnostics on failure Automate root cause identification with if: failure()
Don't trust Docker defaults Explicitly specify --shm-size, max_connections

Self-hosted CI brings flexibility and speed that GitHub-hosted cannot match. But it also means accepting the complexity of "managing infrastructure yourself."

In solo development, when CI breaks, you are the only one who can fix it. That is precisely why it was important to pursue the "why" when problems occur and build systems that prevent recurrence. Every solution presented here was born from an actual incident.


Series Articles

Top comments (0)