I Built a Cron Job Monitor Because Silence Kills Production
Three months ago, my client's daily database backup hadn't run in 11 days. The cron job was still scheduled. No errors in the logs. The monitoring dashboard was green. Everything looked fine.
Until someone tried to restore from a backup that didn't exist.
That's when I learned the hard way: traditional monitoring is terrible at catching things that don't happen.
The Problem with Traditional Monitoring
Most monitoring tools are great at telling you when something breaks:
- Server is down? Alert.
- API returns 500? Alert.
- Disk is full? Alert.
But what about when your nightly backup job silently stops running? Or your data sync task fails to start? Or your cleanup script never executes?
Silence.
Traditional monitoring watches for events. Cron jobs that don't run don't generate events. They just... don't happen. And you don't find out until it's too late.
The "Dead Man's Switch" Approach
After the backup incident, I started thinking differently about monitoring scheduled tasks. Instead of watching for failures, what if we watched for missing successes?
The concept is simple:
- Your cron job pings an endpoint when it runs successfully
- If the ping doesn't arrive within the expected window, you get alerted
- Silence = failure
It's like a dead man's switch on a train. If the operator stops pressing the button (the "heartbeat"), the train stops. If your job stops checking in, you get alerted.
Building CronGuard
I built CronGuard to solve this for myself and my clients. The core idea is dead simple:
Every monitor gets a unique ping URL. Your job hits that URL when it completes. If we don't get a ping within the expected schedule, we alert you.
Here's what a basic integration looks like:
#!/bin/bash
# Your backup script
pg_dump mydb > backup.sql
tar -czf backup-$(date +%Y%m%d).tar.gz backup.sql
aws s3 cp backup-$(date +%Y%m%d).tar.gz s3://my-backups/
# Ping CronGuard when done
curl -fsS https://cronguard.app/api/ping/your-monitor-id
That's it. If the backup fails, the ping doesn't happen. If the cron job stops running, the ping doesn't happen. If the server dies, the ping doesn't happen.
In all cases: you get alerted.
The Technical Choices That Mattered
1. Keep the Ping Endpoint Stupid Simple
The ping endpoint is just an HTTP GET. No authentication required (the URL itself is the secret). No JSON body. No headers. Just:
curl https://cronguard.app/api/ping/abc123
Why? Because I wanted it to work from anywhere:
- Bash scripts
- Python scripts
- Inside Docker containers
- Lambda functions
- GitHub Actions
- Cron jobs on a Raspberry Pi
If you can make an HTTP request, you can monitor your job. No SDK. No dependencies. No authentication dance.
2. Grace Periods Are Critical
Here's a mistake I made early: treating cron schedules as exact.
If a job is scheduled for 03:00, and runs at 03:02, that's not a failure. Servers reboot. Tasks queue. Execution time varies.
CronGuard uses grace periods:
- Daily job at 03:00? Alert if no ping by 04:00.
- Hourly job? Alert if no ping after 70 minutes.
- Every 5 minutes? Alert after 7 minutes.
This eliminated false positives and made the system actually useful.
3. The First Ping Problem
When you create a new monitor, you haven't sent your first ping yet. Should the system immediately alert that the job is "down"?
No. That's annoying.
Solution: monitors are in a "waiting" state until they receive their first ping. After that, the clock starts ticking.
4. Recovery Notifications Matter
Early version: alert when job stops checking in. Done.
Reality: you also want to know when it starts working again.
Now CronGuard sends recovery notifications too:
- "Backup job missed check-in (expected by 04:00)"
- "Backup job recovered (checked in at 04:15)"
This helps you confirm your fix actually worked.
Lessons from Running It in Production
Lesson 1: Cron Jobs Fail More Than You Think
After deploying CronGuard for my own infrastructure and a handful of clients, I learned that cron jobs are fragile.
Things I've seen cause silent failures:
- Server rebooted, cron daemon didn't restart properly
- Environment variables missing in cron context
- Disk full, job can't write temp files
- Database credentials rotated, job can't connect
- Dependencies updated, script breaks
- Path issues (
/usr/local/binnot in cron's PATH)
None of these would trigger traditional monitoring. All of them stopped critical jobs from running.
Lesson 2: Most People Don't Monitor Their Cron Jobs At All
I thought everyone had sophisticated monitoring setups. Turns out, most developers and small teams just... hope their cron jobs work.
They schedule it once, see it run once, and assume it'll run forever. Until it doesn't.
Lesson 3: The Real Value Is Peace of Mind
The best feedback I got wasn't "this caught a bug." It was "I sleep better knowing I'll find out if something stops working."
That's the real value: confidence that silence won't kill production.
When Dead Man's Switch Monitoring Makes Sense
This approach isn't for everything. Here's when it works:
Scheduled tasks: backups, data syncs, cleanup jobs, report generation
Async workers: if you expect jobs to complete regularly
Periodic data ingestion: RSS feeds, API polling, scraping
Real-time services: use traditional uptime monitoring
Event-driven systems: if execution is unpredictable
Alternative Approaches (and Why I Didn't Use Them)
Option 1: Log parsing
Parse cron output for errors. Problem: no output = no detection.
Option 2: Process monitoring
Check if the process is running. Problem: cron spawns processes, they finish.
Option 3: File timestamps
Check modification time on output files. Problem: requires filesystem access, brittle.
Option 4: Traditional uptime monitoring
Ping the endpoint yourself. Problem: doesn't tell you if the job ran, just if the endpoint responds.
Dead man's switch monitoring is the only approach that directly answers: "Did my job complete successfully?"
Try It Yourself
I run CronGuard as a free service for basic monitoring (5 monitors, 5-minute checks). If you've got cron jobs, backups, or scheduled tasks you care about, give it a shot:
You can literally start monitoring in 30 seconds:
- Create a monitor
- Copy the ping URL
- Add
curl -fsS <url>to the end of your script
That's it. Now you'll know if it stops working.
The Bottom Line
Traditional monitoring watches for things that happen. But some of the most critical failures are things that don't happen.
If you've got scheduled tasks keeping your infrastructure alive, you need to monitor for silence.
Because in production, silence kills.
Questions? Running into silent failures with your own infrastructure? Drop a comment. I'd love to hear your war stories about cron jobs that stopped working and how long it took to notice.
Built something similar? Using a different approach? Let me know. I'm always curious how other teams solve this.
Top comments (1)
I like the idea I will give it ago in my lab to see how I get on and will update you once I have had a play.