Most performance monitoring tools test your site from one location, or run tests sequentially across regions. That means testing from 18 locations can take 20+ minutes.
We needed something faster. Ahoj Metrics tests from 18 global regions simultaneously in about 2 minutes. Here's how.
The Architecture
The core idea is simple: don't keep workers running. Spawn them on demand, run the test, destroy them.
We use Fly.io's Machines API to create ephemeral containers in specific regions. Each container runs a single Lighthouse audit, sends the results back via webhook, and destroys itself.
Here's how a request flows through the system:
The key design decision: one audit = one ReportRequest, regardless of how many regions you test. Test from 1 region or 18 - it's the same user action.
Spawning Machines with the Fly.io API
Here's the actual code that creates a machine in a specific region:
class FlyMachinesService
API_BASE_URL = "https://api.machines.dev/v1"
def self.create_machine(region:, env:, app_name:)
url = "#{API_BASE_URL}/apps/#{app_name}/machines"
body = {
region: region,
config: {
image: ENV.fetch("WORKER_IMAGE", "registry.fly.io/am-worker:latest"),
size: "performance-8x",
auto_destroy: true,
restart: { policy: "no" },
stop_config: {
timeout: "30s",
signal: "SIGTERM"
},
env: env,
services: []
}
}
response = HTTParty.post(
url,
headers: headers,
body: body.to_json,
timeout: 30
)
if response.success?
Response.new(success: true, data: response.parsed_response)
else
Response.new(
success: false,
error: "API error: #{response.code} - #{response.body}"
)
end
end
end
A few things worth noting:
auto_destroy: true is the magic. The machine cleans itself up after the process exits. No lingering containers, no zombie workers, no cleanup cron jobs.
performance-8x gives us 4 vCPU and 8GB RAM. Lighthouse is resource-hungry - it runs a full Chrome instance. Underpowered machines produce inconsistent scores because Chrome competes for CPU time. We tried smaller sizes and the variance was too high.
restart: { policy: "no" } means if Lighthouse crashes, the machine just dies. We handle the failure on the Rails side by checking for timed-out reports.
services: [] means no public ports. The worker doesn't need to accept incoming traffic. It runs Lighthouse and POSTs results back to our API. That's it.
The Worker
Each Fly.io machine runs a Docker container that does roughly this:
- Read environment variables (target URL, callback URL, report ID)
- Launch headless Chrome
- Run Lighthouse audit
- POST the JSON results back to the Rails API
- Exit (machine auto-destroys)
The callback is a simple webhook. The worker doesn't need to know anything about our database, user accounts, or billing. It just runs a test and reports back.
Handling Results
On the Rails side, each Report record tracks its own status:
class ReportRequest < ApplicationRecord
has_many :reports
def check_completion!
return unless reports.all?(&:completed?)
update!(status: "completed")
update_cached_stats!
check_monitor_alert if site_monitor.present?
end
end
When a worker POSTs results, the corresponding Report is updated. After each update, we check if all reports for the request are done. If so, we aggregate the results, calculate averages, and update the dashboard.
Each report is independent. If the Sydney worker fails but the other 17 succeed, you still get 17 results. The failed region shows as an error without blocking everything else.
Cost Math
This is the part that makes ephemeral workers compelling. Compare two approaches:
Persistent workers (18 regions, always-on):
- 18 performance-8x machines running 24/7
- Based on Fly.io's pricing calculator: ~$2,734/month
- Mostly sitting idle waiting for audit requests
Ephemeral workers (our approach):
- Machines run for ~2 minutes per audit
- performance-8x costs roughly $0.0001344/second
- One 18-region audit costs about $0.29
- 100 audits/month = ~$29
At low volume, ephemeral is dramatically cheaper. The crossover point where persistent workers become more cost-effective is well beyond our current scale.
The tradeoff is cold start time. Each machine takes a few seconds to boot. For our use case (users expect a 1-2 minute wait anyway), that's invisible.
The Background Job Layer
We use Solid Queue (Rails 8's built-in job backend) for everything. No Redis, no Sidekiq.
# config/recurring.yml
production:
monitor_scheduler:
class: MonitorSchedulerJob
queue: default
schedule: every minute
The MonitorSchedulerJob runs every minute, checks which monitors are due for testing, and kicks off the Fly.io machine spawning. Monitor runs are background operations - they don't count toward the user's audit quota.
This keeps the architecture simple. One PostgreSQL database handles the queue (via Solid Queue), the application data, and the cache. No Redis to manage, no separate queue infrastructure to monitor.
What We Learned
Lighthouse needs consistent resources. When we first used shared-cpu machines, scores would vary by 15-20 points between runs of the same URL. Bumping to performance-8x brought variance down to 2-3 points. The extra cost per audit is worth the consistency.
Timeouts need multiple layers. We set timeouts at the HTTP level (30s for API calls), the machine level (stop_config timeout), and the application level (mark reports as failed after 5 minutes). Belt and suspenders.
Region availability isn't guaranteed. Sometimes a Fly.io region is temporarily unavailable. We handle this gracefully - the report for that region shows an error, but the rest of the audit completes normally.
Webhook delivery can fail. If our API is temporarily unreachable when the worker finishes, we lose the result. We're adding a retry mechanism and considering having workers write results to object storage as a fallback.
The Numbers
After running this in production since January 2026:
- Average audit time: ~2 minutes (single region or all 18)
- P95 audit time: ~3 minutes
- Machine boot time: 3-8 seconds depending on region
- Success rate: ~97% (3% are timeouts or region availability issues)
- Cost per audit: $0.01-0.29 depending on regions selected
Try It
You can test this yourself at ahojmetrics.com. Free tier gives you 20 audits/month - enough to see how your site performs from Sydney, Tokyo, Sao Paulo, London, and more.
If you have questions about the architecture, ask in the comments. Happy to go deeper on any part of this.
Built with Rails 8.1, Solid Queue, Fly.io Machines API, and PostgreSQL. Frontend is React + TypeScript on Cloudflare Pages.

Top comments (0)