DEV Community

linou518
linou518

Posted on

Building High-Availability Failover: 90-Second Auto-Takeover

Building High-Availability Failover: 90-Second Auto-Takeover

After moving OpenClaw to a dedicated PC, one question immediately surfaced: what if PC-A goes down? All bots become unreachable, all agents stop. This post documents how I built a simple but effective failover mechanism using PC-B.

Design Philosophy

The core of high availability is simple: have a backup node that automatically takes over when the primary fails.

  • PC-A (192.168.x.x): Primary node, running OpenClaw gateway normally
  • PC-B (192.168.x.x): Backup node, running monitoring script, standing by, takes over on failure

Goal: auto-takeover within 90 seconds of primary failure; auto-release when primary recovers.

failover-monitor.sh

The core is a bash script running on PC-B:

# Pseudocode
while true; do
    if PC-A port 18789 is reachable; then
        fail_count=0
        if local gateway is running; then
            stop local gateway  # Primary recovered, release
        fi
    else
        fail_count++
        if fail_count >= 3; then
            start local gateway  # 3 consecutive failures, take over
        fi
    fi
    sleep 30
done
Enter fullscreen mode Exit fullscreen mode

Checks every 30 seconds, triggers takeover only after 3 consecutive failures — a minimum 90-second confirmation window. This design prevents false switches from network jitter.

Pitfall: pgrep -f Matching Trap

Initially I used pgrep -f "openclaw gateway" to check if the local gateway was running. Seemed fine, right?

Dead wrong. OpenClaw agents can execute shell commands. When an agent runs a command containing "openclaw" or "gateway," pgrep -f matches that temporary process and falsely reports the gateway as running.

Solution: Switch to ss -tlnp | grep :18789 to check port listening. Port checks are far more reliable than process name matching.

check_local_running() {
    ss -tlnp | grep -q ":18789 " && return 0 || return 1
}

check_primary() {
    timeout 5 bash -c "echo > /dev/tcp/192.168.x.x/18789" 2>/dev/null
}
Enter fullscreen mode Exit fullscreen mode

Pitfall: Can't Stop Yourself

The script runs as a systemd service with Restart=always. When the primary recovers, the script needs to stop the local gateway. But if gateway and monitor have tangled systemd dependencies, stopping the gateway might cascade to the monitor.

Solution: Keep the monitor script and gateway service completely independent. Use openclaw gateway start/stop directly instead of managing through systemd.

Test Results

Scenario Duration
Primary failure → Backup takeover ~65 seconds
Primary recovery → Backup release ~30 seconds

65-second takeover is faster than the expected 90 seconds because check intervals don't perfectly align with failure timing.

30-second release because recovery only needs one successful check — recovery is good news and can be handled optimistically; failure is bad news and needs pessimistic confirmation.

Imperfect but Sufficient

There are limitations: session context is lost during switchover, config sync is manual, only port checking without deep health verification.

But for a personal project, this is more than enough. The art of engineering lies in finding the balance between perfection and practicality.

This little monitor script taught me something fundamental: reliability comes not from single-point perfection, but from system-level redundancy.

Top comments (0)