Lalit Mishra

Posted on Feb 15

Observability II – Server-Side Metrics with Prometheus and Grafana for SFU Health

#python #webrtc #prometheus #grafana

Opening Context – Why Client Metrics Are Not Enough

In the previous installment of this series, we established that the client is the ultimate arbiter of quality. Through getStats(), we can detect freezing video, robotic audio, and rising jitter. However, client-side telemetry suffers from a critical limitation: it describes symptoms, not causes. A sudden spike in packet loss across 500 users in a specific region might be a local ISP outage, or it might be a Selective Forwarding Unit (SFU) undergoing a retransmission storm due to CPU saturation. To the client, the experience is identical: degraded media. To the platform engineer, the remediation paths are diametrically opposed.

Client metrics are effectively lagging indicators of infrastructure health. By the time a client reports a freeze, the degradation has already occurred. To achieve five-nine reliability (99.999%), we require leading indicators—signals that predict failure before user perception is impacted.

Server-side observability provides this causal visibility. While the client tells you that a call failed, the server tells you why. Was the signaling node out of file descriptors? Did the SFU’s event loop stall processing DTLS handshakes? Did a NACK storm saturate the egress bandwidth? This article focuses on instrumenting the backend and media tier to expose these hidden architectural stresses, enabling a shift from reactive firefighting to proactive capacity management.

Observability Pillars in Real-Time Systems

The standard observability triad—Metrics, Logs, and Traces—applies to WebRTC, but the implementation priority differs significantly from traditional REST APIs.

Metrics are the lifeblood of real-time systems. They are aggregatable, cheap to store, and queryable over long time horizons. In a high-throughput SFU environment where thousands of packets are processed per second, metrics provide the only viable way to visualize aggregate health (e.g., "Total NACKs per second" or "Average ICE negotiation time"). This article focuses exclusively on metrics-based observability.

Logs in WebRTC are prohibitively expensive for the media plane. Logging every dropped packet or retransmission request would saturate disk I/O faster than the network interface. Logs must be reserved for control-plane events: signaling errors, authentication failures, and ICE state transitions. They are reactive tools used for post-mortem analysis of specific session failures, not for real-time health monitoring.

Tracing, while powerful in microservices, faces unique challenges in UDP-based media paths. RTP packets lack the convenient header space for trace IDs found in HTTP, and the overhead of sampling distributed traces across a mesh of media servers often outweighs the benefit.

Therefore, for the operational architect, the primary monitoring surface is a robust time-series metric pipeline. We will implement this using Prometheus for ingestion and storage, and Grafana for visualization and correlation.

Instrumenting Python / Quart Backend with prometheus_client

The signaling plane is the orchestrator of WebRTC. If signaling latency increases, room joins become sluggish, and users abandon calls before media flows. We will instrument a Python Quart backend (an async super-set of Flask widely used in modern Python architectures) using the official prometheus_client library.

The goal is to expose a /metrics endpoint that a Prometheus scraper can poll. We need to track three distinct categories of signals:

Throughput: Active connections and message rates.
Latency: Time taken to process joins, offers, and answers.
Errors: Authentication failures and validation errors.

Production Instrumentation Example

We structure the instrumentation as a middleware layer to keep business logic clean. We define our metrics globally to ensure they persist across the application lifecycle.

import time
from quart import Quart, request, Response
from prometheus_client import Counter, Gauge, Histogram, generate_latest, CONTENT_TYPE_LATEST

app = Quart(__name__)

# --- Metric Definitions ---

# Gauges: For values that go up and down (State)
ACTIVE_WEBSOCKETS = Gauge(
    'signaling_websockets_active',
    'Number of currently active WebSocket signaling connections',
    ['region']
)

ACTIVE_ROOMS = Gauge(
    'signaling_rooms_active',
    'Number of active rooms with at least one participant',
    ['region']
)

# Counters: For cumulative events (Throughput/Errors)
SIGNALING_MESSAGES_TOTAL = Counter(
    'signaling_messages_total',
    'Total number of signaling messages processed',
    ['msg_type', 'direction'] # direction: inbound/outbound
)

AUTH_FAILURES_TOTAL = Counter(
    'signaling_auth_failures_total',
    'Total number of authentication failures',
    ['reason']
)

# Histograms: For distribution of duration (Latency)
# Buckets optimized for sub-second signaling operations
ICE_NEGOTIATION_DURATION = Histogram(
    'signaling_ice_negotiation_seconds',
    'Time taken from Offer to ICE Connected state',
    ['status'],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

REQUEST_PROCESSING_TIME = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0]
)

# --- Middleware & Endpoints ---

@app.before_request
async def start_timer():
    request.start_time = time.time()

@app.after_request
async def record_metrics(response):
    if request.endpoint == 'metrics':
        return response

    latency = time.time() - request.start_time
    REQUEST_PROCESSING_TIME.labels(
        method=request.method, 
        endpoint=request.endpoint
    ).observe(latency)

    return response

@app.route('/metrics')
async def metrics():
    # Expose the standard Prometheus metrics endpoint
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

# --- WebSocket Handler Example ---

@app.websocket('/ws/signaling')
async def signaling_socket():
    ACTIVE_WEBSOCKETS.labels(region='us-east-1').inc()
    try:
        while True:
            data = await websocket.receive()
            msg_type = parse_message_type(data) # Hypothetical helper

            SIGNALING_MESSAGES_TOTAL.labels(
                msg_type=msg_type, 
                direction='inbound'
            ).inc()

            # Business logic processing...
            await process_signal(data)

    except Exception:
        # Log error
        pass
    finally:
        ACTIVE_WEBSOCKETS.labels(region='us-east-1').dec()

# --- Simulation of ICE tracking ---

async def handle_ice_completion(session_id, start_time):
    duration = time.time() - start_time
    ICE_NEGOTIATION_DURATION.labels(status='success').observe(duration)

if __name__ == "__main__":
    app.run(port=5000)

Key Architectural Considerations:

Label Cardinality: Notice we use region and msg_type as labels. Never use user_id or room_id as a label. High cardinality labels explode the time-series database (TSDB) index, causing Prometheus to consume excessive memory and crash.
Histogram Buckets: The default Prometheus buckets are optimized for general web traffic (up to 10s). For signaling, we care about the 50ms to 500ms range. We explicitly redefine buckets to capture the granularity of "fast" vs "slow" negotiations.

Monitoring SFU Health – Janus and Mediasoup

The signaling server tells us about user intent; the SFU (Selective Forwarding Unit) tells us about media reality. Whether you use Janus, Mediasoup, or Jitsi, the SFU is a black box that consumes CPU to route UDP packets.

Most SFUs do not expose a /metrics endpoint natively in Prometheus format. They usually offer an Admin API (Janus) or an Observer/Stats API (Mediasoup) that returns JSON. We must implement the "Exporter Pattern"—a lightweight sidecar service that polls the SFU and translates JSON stats into Prometheus metrics.

The "Leading Indicators" of SFU Failure

Before an SFU crashes or drops packets, specific metrics spike:

NACK Count (Negative Acknowledgement): A receiver requests a packet retransmission. A spike here indicates network congestion.
PLI/FIR Count (Picture Loss Indication): A receiver requests a full keyframe. This is expensive. If this spikes, the encoder is working overtime, driving up CPU.
Event Loop Lag: If the SFU is single-threaded (like generic Janus plugins) or uses a worker-per-core model (Mediasoup), measuring the delay in the event loop reveals CPU saturation before top/htop does.

Python Exporter for Janus (Conceptual Implementation)

This script runs alongside the Janus instance, polling its Admin API every 5 seconds.

import time
import requests
from prometheus_client import start_http_server, Gauge

# Define SFU Metrics
SFU_BITRATE_IN = Gauge('sfu_bitrate_in_bits', 'Ingress bitrate')
SFU_BITRATE_OUT = Gauge('sfu_bitrate_out_bits', 'Egress bitrate')
SFU_NACKS_TOTAL = Gauge('sfu_nacks_total', 'Total NACKs received')
SFU_CPU_USAGE = Gauge('sfu_cpu_usage_percent', 'SFU Process CPU Usage')
SFU_HANDLES = Gauge('sfu_active_handles', 'Number of active WebRTC handles')

JANUS_ADMIN_URL = "http://localhost:7088/admin"
ADMIN_SECRET = "janusoverlord"

def fetch_janus_stats():
    payload = {
        "janus": "get_status",
        "transaction": "monitor_req",
        "admin_secret": ADMIN_SECRET
    }

    try:
        response = requests.post(JANUS_ADMIN_URL, json=payload, timeout=2)
        data = response.json()

        if "data" in data:
            # Janus exposes active handles (roughly equivalent to peer connections)
            SFU_HANDLES.set(data["data"].get("sessions", 0))

            # Note: Detailed stream stats often require iterating over sessions
            # In production, you might query specific loop stats or use the event handler

    except Exception as e:
        print(f"Error scraping Janus: {e}")

def main():
    # Start Prometheus exporter server on port 8000
    start_http_server(8000)
    print("SFU Exporter running on :8000")

    while True:
        fetch_janus_stats()
        # In a real exporter, we would also read /proc or use psutil 
        # to get the specific CPU usage of the SFU process.
        time.sleep(5)

if __name__ == "__main__":
    main()

For Mediasoup, the approach is slightly different. Since Mediasoup runs as a Node.js library or Rust worker, you typically hook into the worker.on('newrouter') and transport.on('trace') events to push metrics to a central aggregator rather than polling.

Prometheus Architecture and Scrape Strategy

Prometheus uses a pull model. It wakes up at a configured interval, reaches out to your targets (Signaling API, SFU Exporter), and retrieves the current state of all metrics.

Scrape Configuration (`prometheus.yml`)

The configuration defines who to scrape and how often. For WebRTC, a 15-second scrape interval is standard. 1 minute is too slow (media degrades in seconds); 1 second is too heavy for the TSDB.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # 1. Signaling Tier
  - job_name: 'signaling_backend'
    static_configs:
      - targets: ['10.0.1.5:5000', '10.0.1.6:5000']
    labels:
      service: 'signaling-quart'
      env: 'production'

  # 2. Media Tier (SFU Exporters)
  - job_name: 'sfu_nodes'
    static_configs:
      - targets: ['10.0.2.10:8000', '10.0.2.11:8000']
    labels:
      service: 'janus-gateway'
      region: 'us-east-1'

  # 3. Infrastructure (Node Exporter)
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['10.0.2.10:9100', '10.0.2.11:9100']

Federation Strategy:
In a global WebRTC deployment, you cannot have a single Prometheus instance scraping servers in Tokyo, Frankfurt, and Virginia. The latency is too high. The architecture requires a Prometheus instance in each region scraping local targets. A central "Federated Prometheus" or a solution like Thanos/Cortex then aggregates specific high-level metrics (e.g., "Total Users Global") from the regional instances for a single-pane-of-glass view.

Grafana Dashboards – Correlating Infrastructure and Media Health

Grafana is where metrics become insights. The power of Grafana lies in correlation—overlaying distinct metrics to find causality.

Dashboard Design Philosophy

A WebRTC Health Dashboard should not just list stats. It should visually group related failure domains.

Panel 1: The "Is it Broken?" Graph (The Symptom)

Metric: rate(signaling_auth_failures_total[1m]) vs active_rooms.
Insight: If failures spike while rooms drop, the platform is down.

Panel 2: The "Why is it Broken?" Correlation (The Cause)

Left Y-Axis: CPU Usage (%)
Right Y-Axis: Bitrate Throughput (Mbps)
Let’s assume CPU spikes to 90% but bitrate stays flat. This indicates the SFU is processing logic (perhaps DTLS handshakes or infinite loops) rather than forwarding packets.
If Bitrate spikes and CPU spikes, it is a capacity issue (Autoscaling needed).

Panel 3: The "Network vs. Server" Check

Metric: rate(sfu_nacks_total[1m])
A NACK storm is a classic "death spiral." As the server struggles to retransmit packets, it consumes more CPU and bandwidth, causing more packets to drop, triggering more NACKs. Visualizing NACK rate alongside CPU usage confirms this diagnosis.

Essential PromQL Queries

1. 95th Percentile Signaling Latency
Calculating the "long tail" of user wait times.

histogram_quantile(0.95, sum(rate(signaling_ice_negotiation_seconds_bucket[5m])) by (le))

2. SFU Saturation Index
Detecting when specific SFU cores are overloaded (crucial for single-threaded SFUs).

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)

3. Bitrate per User (Quality Proxy)
If the total bitrate is stable but user count doubles, the average quality per user has halved.

sum(rate(sfu_bitrate_out_bits[1m])) / sum(sfu_active_handles)

Alerting Strategy – From Reactive to Proactive

Alerts should wake you up for imminent failures, not minor fluctuations. We use Alertmanager to define rules based on the PromQL queries above.

Example Alert Rules (`alerts.yml`)

groups:
- name: sfu_health
  rules:
  # Rule 1: CPU Saturation Warning
  # Trigger if CPU > 80% for more than 2 minutes.
  - alert: SFUHighCPU
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "SFU instance {{ $labels.instance }} is under high load"

  # Rule 2: NACK Storm Detection (Leading Indicator)
  # Trigger if NACKs increase by 500% compared to 5 minutes ago.
  - alert: SFUNackStorm
    expr: rate(sfu_nacks_total[1m]) > 5 * (rate(sfu_nacks_total[5m] offset 10m))
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "NACK storm detected on {{ $labels.instance }}. Network congestion imminent."

  # Rule 3: Zero Bitrate (Lagging Indicator)
  # Trigger if users are present but no bits are flowing.
  - alert: SFUSilentFailure
    expr: sfu_active_handles > 0 and rate(sfu_bitrate_out_bits[1m]) == 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "SFU {{ $labels.instance }} has users but zero throughput."

Noise Reduction:
Use the for clause liberally. Real-time networks are jittery. A 10-second spike in NACKs is common; a 1-minute sustained spike is an incident. The offset modifier in Rule 2 is a powerful technique for detecting anomalies relative to a baseline rather than setting arbitrary static thresholds.

Real Incident Walkthrough

Scenario: It is 2:00 AM. Alerts fire for SFUHighCPU on sfu-node-04.

Step 1: Triage (Grafana)
The on-call engineer opens the Grafana dashboard. They see sfu-node-04 CPU at 95%. Other nodes are at 20%. This immediately rules out a global code bug or a region-wide outage. It is a "hotspot" issue.

Step 2: Correlation
Looking at the "Active Rooms" panel, sfu-node-04 is hosting a "mega-room" with 500 participants, while other nodes have 50 rooms with 2 participants each. The load balancer failed to distribute based on participant count and only distributed based on room count.

Step 3: The NACK Spiral
The "NACK Rate" graph for sfu-node-04 is vertical. The CPU is maxed out trying to retransmit packets for 500 users. Packet loss is increasing, causing more NACKs.

Step 4: Resolution
Observability identified the root cause (uneven distribution) in minutes. The engineer manually drains the node or triggers a script to migrate the room (if the architecture supports cascading SFUs).

Step 5: Post-Mortem
The fix is not just restarting the server. It is updating the load balancer logic to use the custom metric sfu_active_handles exposed by our exporter, rather than just round-robin. Observability closed the feedback loop.

Architectural Reference Design

To build a production-grade WebRTC platform, your observability stack must be as robust as your media stack.

The reference architecture flows as follows:

Metric Sources:
- Quart Backend: Exposes /metrics via prometheus_client.
- SFU Nodes: Sidecar Exporters translate internal JSON stats to Prometheus format.
- Infrastructure: Standard node_exporter runs on every VM to capture raw CPU, Memory, and Disk I/O.
Ingestion:
- Regional Prometheus: Scrapes local targets every 15s.
Storage & Aggregation:
- Thanos / Cortex: Aggregates data from Regional Prometheus instances for long-term storage and global querying.
Visualization:
- Grafana: Connects to Thanos/Prometheus. Uses templated dashboards to switch between "Global View" and "Per-Node View."
Alerting:
- Alertmanager: Deduplicates alerts (don't send 50 emails for 50 failing nodes; send one "Cluster Critical" email) and routes them to PagerDuty or Slack.

This architecture ensures that when a user complains about video quality, you are not guessing. You are navigating a data-rich map of your system’s internal physics. In WebRTC, you cannot fix what you cannot measure. Without server-side metrics, you are flying blind; with them, you are navigating with instruments.

Conclusion:

Server-side observability transforms WebRTC operations from reactive troubleshooting to predictive engineering. By instrumenting signaling layers, exporting SFU health, and correlating infrastructure metrics in Prometheus and Grafana, you expose leading indicators—CPU saturation, NACK storms, negotiation latency—that surface failure before users feel it. Reliability at five-nines is not luck; it is disciplined measurement, intelligent alerting, and architectural feedback loops. When metrics guide scaling, balancing, and capacity planning, incidents become data points instead of disasters. Build dashboards that explain causality, alerts that respect context, and systems that speak before they break.

For more in-depth system design insights, explore The Lalit Official on YouTube.

DEV Community

Observability II – Server-Side Metrics with Prometheus and Grafana for SFU Health

Opening Context – Why Client Metrics Are Not Enough

Observability Pillars in Real-Time Systems

Instrumenting Python / Quart Backend with prometheus_client

Monitoring SFU Health – Janus and Mediasoup

Prometheus Architecture and Scrape Strategy

Scrape Configuration (`prometheus.yml`)

Grafana Dashboards – Correlating Infrastructure and Media Health

Dashboard Design Philosophy

Essential PromQL Queries

Alerting Strategy – From Reactive to Proactive

Example Alert Rules (`alerts.yml`)

Real Incident Walkthrough

Architectural Reference Design

Conclusion:

Top comments (0)

Opening Context – Why Client Metrics Are Not Enough

Observability Pillars in Real-Time Systems

Instrumenting Python / Quart Backend with prometheus_client

Monitoring SFU Health – Janus and Mediasoup

Prometheus Architecture and Scrape Strategy

Scrape Configuration (prometheus.yml)

Grafana Dashboards – Correlating Infrastructure and Media Health

Dashboard Design Philosophy

Essential PromQL Queries

Alerting Strategy – From Reactive to Proactive

Example Alert Rules (alerts.yml)

Real Incident Walkthrough

Architectural Reference Design

Conclusion:

Scrape Configuration (`prometheus.yml`)

Example Alert Rules (`alerts.yml`)