Building High-Availability Failover: 90-Second Auto-Takeover
After moving OpenClaw to a dedicated PC, one question immediately surfaced: what if PC-A goes down? All bots become unreachable, all agents stop. This post documents how I built a simple but effective failover mechanism using PC-B.
Design Philosophy
The core of high availability is simple: have a backup node that automatically takes over when the primary fails.
- PC-A (192.168.x.x): Primary node, running OpenClaw gateway normally
- PC-B (192.168.x.x): Backup node, running monitoring script, standing by, takes over on failure
Goal: auto-takeover within 90 seconds of primary failure; auto-release when primary recovers.
failover-monitor.sh
The core is a bash script running on PC-B:
# Pseudocode
while true; do
if PC-A port 18789 is reachable; then
fail_count=0
if local gateway is running; then
stop local gateway # Primary recovered, release
fi
else
fail_count++
if fail_count >= 3; then
start local gateway # 3 consecutive failures, take over
fi
fi
sleep 30
done
Checks every 30 seconds, triggers takeover only after 3 consecutive failures — a minimum 90-second confirmation window. This design prevents false switches from network jitter.
Pitfall: pgrep -f Matching Trap
Initially I used pgrep -f "openclaw gateway" to check if the local gateway was running. Seemed fine, right?
Dead wrong. OpenClaw agents can execute shell commands. When an agent runs a command containing "openclaw" or "gateway," pgrep -f matches that temporary process and falsely reports the gateway as running.
Solution: Switch to ss -tlnp | grep :18789 to check port listening. Port checks are far more reliable than process name matching.
check_local_running() {
ss -tlnp | grep -q ":18789 " && return 0 || return 1
}
check_primary() {
timeout 5 bash -c "echo > /dev/tcp/192.168.x.x/18789" 2>/dev/null
}
Pitfall: Can't Stop Yourself
The script runs as a systemd service with Restart=always. When the primary recovers, the script needs to stop the local gateway. But if gateway and monitor have tangled systemd dependencies, stopping the gateway might cascade to the monitor.
Solution: Keep the monitor script and gateway service completely independent. Use openclaw gateway start/stop directly instead of managing through systemd.
Test Results
| Scenario | Duration |
|---|---|
| Primary failure → Backup takeover | ~65 seconds |
| Primary recovery → Backup release | ~30 seconds |
65-second takeover is faster than the expected 90 seconds because check intervals don't perfectly align with failure timing.
30-second release because recovery only needs one successful check — recovery is good news and can be handled optimistically; failure is bad news and needs pessimistic confirmation.
Imperfect but Sufficient
There are limitations: session context is lost during switchover, config sync is manual, only port checking without deep health verification.
But for a personal project, this is more than enough. The art of engineering lies in finding the balance between perfection and practicality.
This little monitor script taught me something fundamental: reliability comes not from single-point perfection, but from system-level redundancy.
Top comments (0)