Posted on Feb 16

I Built a Cron Job Monitor Because Silence Kills Production

#devops #monitoring #infrastructure #saas

I Built a Cron Job Monitor Because Silence Kills Production

Three months ago, my client's daily database backup hadn't run in 11 days. The cron job was still scheduled. No errors in the logs. The monitoring dashboard was green. Everything looked fine.

Until someone tried to restore from a backup that didn't exist.

That's when I learned the hard way: traditional monitoring is terrible at catching things that don't happen.

The Problem with Traditional Monitoring

Most monitoring tools are great at telling you when something breaks:

Server is down? Alert.
API returns 500? Alert.
Disk is full? Alert.

But what about when your nightly backup job silently stops running? Or your data sync task fails to start? Or your cleanup script never executes?

Silence.

Traditional monitoring watches for events. Cron jobs that don't run don't generate events. They just... don't happen. And you don't find out until it's too late.

The "Dead Man's Switch" Approach

After the backup incident, I started thinking differently about monitoring scheduled tasks. Instead of watching for failures, what if we watched for missing successes?

The concept is simple:

Your cron job pings an endpoint when it runs successfully
If the ping doesn't arrive within the expected window, you get alerted
Silence = failure

It's like a dead man's switch on a train. If the operator stops pressing the button (the "heartbeat"), the train stops. If your job stops checking in, you get alerted.

Building CronGuard

I built CronGuard to solve this for myself and my clients. The core idea is dead simple:

Every monitor gets a unique ping URL. Your job hits that URL when it completes. If we don't get a ping within the expected schedule, we alert you.

Here's what a basic integration looks like:

#!/bin/bash

# Your backup script
pg_dump mydb > backup.sql
tar -czf backup-$(date +%Y%m%d).tar.gz backup.sql
aws s3 cp backup-$(date +%Y%m%d).tar.gz s3://my-backups/

# Ping CronGuard when done
curl -fsS https://cronguard.app/api/ping/your-monitor-id

That's it. If the backup fails, the ping doesn't happen. If the cron job stops running, the ping doesn't happen. If the server dies, the ping doesn't happen.

In all cases: you get alerted.

The Technical Choices That Mattered

1. Keep the Ping Endpoint Stupid Simple

The ping endpoint is just an HTTP GET. No authentication required (the URL itself is the secret). No JSON body. No headers. Just:

curl https://cronguard.app/api/ping/abc123

Why? Because I wanted it to work from anywhere:

Bash scripts
Python scripts
Inside Docker containers
Lambda functions
GitHub Actions
Cron jobs on a Raspberry Pi

If you can make an HTTP request, you can monitor your job. No SDK. No dependencies. No authentication dance.

2. Grace Periods Are Critical

Here's a mistake I made early: treating cron schedules as exact.

If a job is scheduled for 03:00, and runs at 03:02, that's not a failure. Servers reboot. Tasks queue. Execution time varies.

CronGuard uses grace periods:

Daily job at 03:00? Alert if no ping by 04:00.
Hourly job? Alert if no ping after 70 minutes.
Every 5 minutes? Alert after 7 minutes.

This eliminated false positives and made the system actually useful.

3. The First Ping Problem

When you create a new monitor, you haven't sent your first ping yet. Should the system immediately alert that the job is "down"?

No. That's annoying.

Solution: monitors are in a "waiting" state until they receive their first ping. After that, the clock starts ticking.

4. Recovery Notifications Matter

Early version: alert when job stops checking in. Done.

Reality: you also want to know when it starts working again.

Now CronGuard sends recovery notifications too:

"Backup job missed check-in (expected by 04:00)"
"Backup job recovered (checked in at 04:15)"

This helps you confirm your fix actually worked.

Lessons from Running It in Production

Lesson 1: Cron Jobs Fail More Than You Think

After deploying CronGuard for my own infrastructure and a handful of clients, I learned that cron jobs are fragile.

Things I've seen cause silent failures:

Server rebooted, cron daemon didn't restart properly
Environment variables missing in cron context
Disk full, job can't write temp files
Database credentials rotated, job can't connect
Dependencies updated, script breaks
Path issues (/usr/local/bin not in cron's PATH)

None of these would trigger traditional monitoring. All of them stopped critical jobs from running.

Lesson 2: Most People Don't Monitor Their Cron Jobs At All

I thought everyone had sophisticated monitoring setups. Turns out, most developers and small teams just... hope their cron jobs work.

They schedule it once, see it run once, and assume it'll run forever. Until it doesn't.

Lesson 3: The Real Value Is Peace of Mind

The best feedback I got wasn't "this caught a bug." It was "I sleep better knowing I'll find out if something stops working."

That's the real value: confidence that silence won't kill production.

When Dead Man's Switch Monitoring Makes Sense

This approach isn't for everything. Here's when it works:

Scheduled tasks: backups, data syncs, cleanup jobs, report generation

Async workers: if you expect jobs to complete regularly

Periodic data ingestion: RSS feeds, API polling, scraping

Real-time services: use traditional uptime monitoring

Event-driven systems: if execution is unpredictable

Alternative Approaches (and Why I Didn't Use Them)

Option 1: Log parsing

Parse cron output for errors. Problem: no output = no detection.

Option 2: Process monitoring

Check if the process is running. Problem: cron spawns processes, they finish.

Option 3: File timestamps

Check modification time on output files. Problem: requires filesystem access, brittle.

Option 4: Traditional uptime monitoring

Ping the endpoint yourself. Problem: doesn't tell you if the job ran, just if the endpoint responds.

Dead man's switch monitoring is the only approach that directly answers: "Did my job complete successfully?"

Try It Yourself

I run CronGuard as a free service for basic monitoring (5 monitors, 5-minute checks). If you've got cron jobs, backups, or scheduled tasks you care about, give it a shot:

cronguard.app

You can literally start monitoring in 30 seconds:

Create a monitor
Copy the ping URL
Add curl -fsS <url> to the end of your script

That's it. Now you'll know if it stops working.

The Bottom Line

Traditional monitoring watches for things that happen. But some of the most critical failures are things that don't happen.

If you've got scheduled tasks keeping your infrastructure alive, you need to monitor for silence.

Because in production, silence kills.

Questions? Running into silent failures with your own infrastructure? Drop a comment. I'd love to hear your war stories about cron jobs that stopped working and how long it took to notice.

Built something similar? Using a different approach? Let me know. I'm always curious how other teams solve this.