Chapter 8: Monitoring and Debugging
🎯 Learning Objective: Build a comprehensive OpenClaw monitoring system, master performance debugging techniques, and implement rapid fault diagnosis with automated operations
📊 Monitoring System Overview
A production-grade OpenClaw deployment requires comprehensive monitoring:
- 🔍 Real-Time Monitoring: System status, performance metrics, error rates
- 📝 Log Management: Structured logging, centralized collection, intelligent analysis
- ⚠️ Alerting: Anomaly detection, tiered alerts, automated response
- 📈 Visualization: Dashboards, trend analysis, capacity planning
🏗️ Monitoring Architecture
8.1 Monitoring Layer Model
┌─────────────────────────────────────────┐
│ Application Monitoring │
│ Agent perf | Session status | Tools │
├─────────────────────────────────────────┤
│ Gateway Monitoring │
│ Connections | Latency | Throughput │
├─────────────────────────────────────────┤
│ System Monitoring │
│ CPU | Memory | Disk | Network │
├─────────────────────────────────────────┤
│ Infrastructure Monitoring │
│ Servers | Network | Storage │
└─────────────────────────────────────────┘
📋 Built-in OpenClaw Monitoring
8.3 Gateway Status Monitoring
Basic Status Queries
# Full system status
openclaw status
# Detailed monitoring info
openclaw status --all --deep
# JSON output (for scripting)
openclaw status --json
Key Metrics
{
"gateway": {
"uptime": "72h 15m",
"version": "2026.2.9",
"connections": 42,
"requests_total": 15847,
"requests_per_minute": 23.4,
"memory_usage": "512MB",
"cpu_usage": "15%"
},
"agents": {
"total": 8,
"active": 6,
"sessions": 299,
"avg_response_time": "1.2s"
}
}
8.4 Health Checks
Automated Health Checks
# Run a full health check
openclaw doctor --non-interactive
# Check specific components
openclaw doctor --check-channels
openclaw doctor --check-models
openclaw doctor --check-security
📝 Log Management
8.5 Log Configuration
{
"logging": {
"level": "info",
"format": "structured",
"outputs": [
{
"type": "file",
"path": "/var/log/openclaw/gateway.log",
"rotation": {
"maxSize": "100MB",
"maxFiles": 10,
"compress": true
}
}
]
}
}
Log Viewing Commands
# Real-time log stream
openclaw logs --follow
# Filter error logs
openclaw logs --level error --since "1h"
# Agent-specific logs
openclaw logs --agent main --limit 100
# Filter by channel
openclaw logs --channel telegram --since "2026-02-13"
8.6 ELK Stack Integration
# docker-compose-logging.yml
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ports:
- "9200:9200"
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
- /var/log/openclaw:/logs:ro
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
ports:
- "5601:5601"
depends_on:
- elasticsearch
8.7 Structured Logging Best Practices
{
"timestamp": "2026-02-13T11:02:45.123Z",
"level": "INFO",
"component": "gateway",
"agent_id": "main",
"session_id": "abc123",
"channel": "telegram",
"action": "message_received",
"duration_ms": 1250,
"metadata": {
"tool_calls": 3,
"tokens_used": 1847,
"model": "claude-sonnet-4"
}
}
📈 Performance Monitoring and Debugging
8.8 Prometheus Integration
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'openclaw-gateway'
static_configs:
- targets: ['localhost:18789']
metrics_path: '/metrics'
scrape_interval: 10s
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
Custom Metrics Endpoint
# OpenClaw Prometheus metrics endpoint
curl http://localhost:18789/metrics
# Example metrics output
openclaw_gateway_requests_total{channel="telegram"} 15847
openclaw_gateway_response_time_seconds{quantile="0.5"} 1.2
openclaw_gateway_response_time_seconds{quantile="0.95"} 3.5
openclaw_agent_sessions_active{agent="main"} 12
8.9 Performance Debugging Tools
# Agent performance profiling
openclaw agent --profile --message "test query"
# Memory search performance test
openclaw memory benchmark --queries 1000
# Gateway load test
openclaw gateway --benchmark --duration 60s
🚨 Alerting and Notification
8.10 Alert Rule Configuration
Prometheus Alert Rules
# openclaw-alerts.yml
groups:
- name: openclaw.rules
rules:
- alert: HighErrorRate
expr: rate(openclaw_gateway_errors_total[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: HighResponseTime
expr: openclaw_gateway_response_time_seconds{quantile="0.95"} > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High response time"
- alert: AgentDown
expr: openclaw_agent_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Agent is down"
8.11 Intelligent Alerting Strategy
Tiered Alerting and Escalation
{
"alerting": {
"levels": [
{
"name": "info",
"channels": ["log"],
"escalation": false
},
{
"name": "warning",
"channels": ["slack", "email"],
"escalation": {
"after": "15m",
"to": "critical"
}
},
{
"name": "critical",
"channels": ["telegram", "phone", "pager"],
"escalation": {
"after": "5m",
"to": "emergency"
}
}
]
}
}
📊 Visualization Dashboards
8.12 Grafana Dashboard
{
"dashboard": {
"title": "OpenClaw System Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(openclaw_gateway_requests_total[5m])",
"legendFormat": "{{channel}}"
}
]
},
{
"title": "Response Time Percentiles",
"type": "graph",
"targets": [
{
"expr": "openclaw_gateway_response_time_seconds{quantile=\"0.5\"}",
"legendFormat": "p50"
},
{
"expr": "openclaw_gateway_response_time_seconds{quantile=\"0.95\"}",
"legendFormat": "p95"
}
]
},
{
"title": "Active Sessions",
"type": "singlestat",
"targets": [
{
"expr": "sum(openclaw_agent_sessions_active)"
}
]
}
]
}
}
🔧 Fault Diagnosis and Debugging
8.14 Diagnostic Decision Tree
Fault Report →
├── User cannot access?
│ ├── Check Gateway status → openclaw status
│ ├── Check channel connections → openclaw status --channels
│ └── Check network connectivity → ping/traceroute
│
├── Slow response?
│ ├── Check system load → top/htop
│ ├── Check Agent performance → openclaw logs --level performance
│ └── Check memory usage → openclaw memory stats
│
└── Feature malfunction?
├── Check error logs → openclaw logs --level error
├── Check configuration → openclaw doctor
└── Check model status → openclaw status --models
Automated Diagnostic Script
#!/bin/bash
# openclaw-troubleshoot.sh
echo "🔍 OpenClaw Automated Diagnostics"
echo "================================="
# 1. Basic connectivity check
echo "📡 Checking Gateway connectivity..."
if ! curl -s http://localhost:18789/health > /dev/null; then
echo "❌ Gateway not responding, checking service status"
systemctl --user status openclaw-gateway
exit 1
fi
echo "✅ Gateway running normally"
# 2. Channel status check
echo "📱 Checking channel status..."
CHANNELS=$(openclaw status --json | jq -r '.channels | to_entries[] | select(.value.status != "OK") | .key')
if [[ -n "$CHANNELS" ]]; then
echo "⚠️ Channel issues found: $CHANNELS"
else
echo "✅ All channels healthy"
fi
# 3. Performance metrics check
echo "⚡ Checking performance metrics..."
echo " - System load: $(uptime)"
echo " - Memory usage: $(free -h | grep Mem | awk '{print $3"/"$2}')"
# 4. Error log analysis
echo "📋 Checking recent errors..."
ERROR_COUNT=$(openclaw logs --level error --since "1h" --json | jq '. | length')
if [[ "$ERROR_COUNT" -gt 10 ]]; then
echo "⚠️ Found $ERROR_COUNT errors. Recent errors:"
openclaw logs --level error --limit 5
fi
🤖 Automated Operations
8.16 Self-Healing System
#!/bin/bash
# auto-heal.sh — OpenClaw self-healing script
HEALTH_CHECK_URL="http://localhost:18789/health"
MAX_RETRIES=3
check_health() {
local response=$(curl -s -w "%{http_code}" -o /dev/null "$HEALTH_CHECK_URL")
[[ "$response" == "200" ]]
}
restart_gateway() {
echo "$(date): Anomaly detected, preparing to restart..."
openclaw gateway stop --graceful --timeout 30s
sleep 5
openclaw gateway start --background
sleep 10
if check_health; then
echo "$(date): Gateway restart successful"
return 0
else
echo "$(date): Gateway restart failed"
return 1
fi
}
# Main loop
while true; do
if ! check_health; then
for i in $(seq 1 $MAX_RETRIES); do
echo "$(date): Restart attempt ($i/$MAX_RETRIES)"
if restart_gateway; then break; fi
if [[ $i -eq $MAX_RETRIES ]]; then
echo "$(date): Restart failed, sending alert"
curl -X POST "$ALERT_WEBHOOK" \
-d '{"level":"critical","message":"OpenClaw Gateway restart failed"}'
fi
sleep 300
done
fi
sleep 60
done
📋 Chapter Summary
Key Takeaways
- Layered Monitoring: Application, Gateway, System, Infrastructure layers
- Full Observability: Metrics, Logs, Traces — the three pillars
- Intelligent Alerting: Tiered alerts, escalation, silence windows
- Automated Ops: Self-healing, auto-scaling, backup & recovery
Monitoring Checklist
- ✅ Basic Metrics: CPU, memory, disk, network
- ✅ Application Metrics: Request rate, response time, error rate, concurrency
- ✅ Business Metrics: User activity, token usage, channel distribution
- ✅ Security Metrics: Auth failures, anomalous access, permission changes
Practice Tips
- Start with basic monitoring, then progressively enhance
- Establish standardized alert response procedures
- Regularly conduct disaster recovery drills
- Continuously optimize monitoring strategies and alert thresholds
🔗 Related Resources:
Top comments (0)