DEV Community

Cover image for How We Run Lighthouse from 18 Regions in Under 2 Minutes
Yuri Tománek
Yuri Tománek

Posted on • Originally published at blog.ahojmetrics.com

How We Run Lighthouse from 18 Regions in Under 2 Minutes

Most performance monitoring tools test your site from one location, or run tests sequentially across regions. That means testing from 18 locations can take 20+ minutes.

We needed something faster. Ahoj Metrics tests from 18 global regions simultaneously in about 2 minutes. Here's how.

The Architecture

The core idea is simple: don't keep workers running. Spawn them on demand, run the test, destroy them.

We use Fly.io's Machines API to create ephemeral containers in specific regions. Each container runs a single Lighthouse audit, sends the results back via webhook, and destroys itself.

Here's how a request flows through the system:

The key design decision: one audit = one ReportRequest, regardless of how many regions you test. Test from 1 region or 18 - it's the same user action.

Spawning Machines with the Fly.io API

Here's the actual code that creates a machine in a specific region:

class FlyMachinesService
  API_BASE_URL = "https://api.machines.dev/v1"

  def self.create_machine(region:, env:, app_name:)
    url = "#{API_BASE_URL}/apps/#{app_name}/machines"

    body = {
      region: region,
      config: {
        image: ENV.fetch("WORKER_IMAGE", "registry.fly.io/am-worker:latest"),
        size: "performance-8x",
        auto_destroy: true,
        restart: { policy: "no" },
        stop_config: {
          timeout: "30s",
          signal: "SIGTERM"
        },
        env: env,
        services: []
      }
    }

    response = HTTParty.post(
      url,
      headers: headers,
      body: body.to_json,
      timeout: 30
    )

    if response.success?
      Response.new(success: true, data: response.parsed_response)
    else
      Response.new(
        success: false,
        error: "API error: #{response.code} - #{response.body}"
      )
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

A few things worth noting:

auto_destroy: true is the magic. The machine cleans itself up after the process exits. No lingering containers, no zombie workers, no cleanup cron jobs.

performance-8x gives us 4 vCPU and 8GB RAM. Lighthouse is resource-hungry - it runs a full Chrome instance. Underpowered machines produce inconsistent scores because Chrome competes for CPU time. We tried smaller sizes and the variance was too high.

restart: { policy: "no" } means if Lighthouse crashes, the machine just dies. We handle the failure on the Rails side by checking for timed-out reports.

services: [] means no public ports. The worker doesn't need to accept incoming traffic. It runs Lighthouse and POSTs results back to our API. That's it.

The Worker

Each Fly.io machine runs a Docker container that does roughly this:

  1. Read environment variables (target URL, callback URL, report ID)
  2. Launch headless Chrome
  3. Run Lighthouse audit
  4. POST the JSON results back to the Rails API
  5. Exit (machine auto-destroys)

The callback is a simple webhook. The worker doesn't need to know anything about our database, user accounts, or billing. It just runs a test and reports back.

Handling Results

On the Rails side, each Report record tracks its own status:

class ReportRequest < ApplicationRecord
  has_many :reports

  def check_completion!
    return unless reports.all?(&:completed?)

    update!(status: "completed")
    update_cached_stats!
    check_monitor_alert if site_monitor.present?
  end
end
Enter fullscreen mode Exit fullscreen mode

When a worker POSTs results, the corresponding Report is updated. After each update, we check if all reports for the request are done. If so, we aggregate the results, calculate averages, and update the dashboard.

Each report is independent. If the Sydney worker fails but the other 17 succeed, you still get 17 results. The failed region shows as an error without blocking everything else.

Cost Math

This is the part that makes ephemeral workers compelling. Compare two approaches:

Persistent workers (18 regions, always-on):

  • 18 performance-8x machines running 24/7
  • Based on Fly.io's pricing calculator: ~$2,734/month
  • Mostly sitting idle waiting for audit requests

Ephemeral workers (our approach):

  • Machines run for ~2 minutes per audit
  • performance-8x costs roughly $0.0001344/second
  • One 18-region audit costs about $0.29
  • 100 audits/month = ~$29

At low volume, ephemeral is dramatically cheaper. The crossover point where persistent workers become more cost-effective is well beyond our current scale.

The tradeoff is cold start time. Each machine takes a few seconds to boot. For our use case (users expect a 1-2 minute wait anyway), that's invisible.

The Background Job Layer

We use Solid Queue (Rails 8's built-in job backend) for everything. No Redis, no Sidekiq.

# config/recurring.yml
production:
  monitor_scheduler:
    class: MonitorSchedulerJob
    queue: default
    schedule: every minute
Enter fullscreen mode Exit fullscreen mode

The MonitorSchedulerJob runs every minute, checks which monitors are due for testing, and kicks off the Fly.io machine spawning. Monitor runs are background operations - they don't count toward the user's audit quota.

This keeps the architecture simple. One PostgreSQL database handles the queue (via Solid Queue), the application data, and the cache. No Redis to manage, no separate queue infrastructure to monitor.

What We Learned

Lighthouse needs consistent resources. When we first used shared-cpu machines, scores would vary by 15-20 points between runs of the same URL. Bumping to performance-8x brought variance down to 2-3 points. The extra cost per audit is worth the consistency.

Timeouts need multiple layers. We set timeouts at the HTTP level (30s for API calls), the machine level (stop_config timeout), and the application level (mark reports as failed after 5 minutes). Belt and suspenders.

Region availability isn't guaranteed. Sometimes a Fly.io region is temporarily unavailable. We handle this gracefully - the report for that region shows an error, but the rest of the audit completes normally.

Webhook delivery can fail. If our API is temporarily unreachable when the worker finishes, we lose the result. We're adding a retry mechanism and considering having workers write results to object storage as a fallback.

The Numbers

After running this in production since January 2026:

  • Average audit time: ~2 minutes (single region or all 18)
  • P95 audit time: ~3 minutes
  • Machine boot time: 3-8 seconds depending on region
  • Success rate: ~97% (3% are timeouts or region availability issues)
  • Cost per audit: $0.01-0.29 depending on regions selected

Try It

You can test this yourself at ahojmetrics.com. Free tier gives you 20 audits/month - enough to see how your site performs from Sydney, Tokyo, Sao Paulo, London, and more.

If you have questions about the architecture, ask in the comments. Happy to go deeper on any part of this.


Built with Rails 8.1, Solid Queue, Fly.io Machines API, and PostgreSQL. Frontend is React + TypeScript on Cloudflare Pages.

Top comments (0)