DEV Community

Christopher
Christopher

Posted on

Making an Ollama Mesh with Nginx

Load Balancing Ollama with NGINX: Handling Long GPU Jobs and Dead Nodes Gracefully

Running Ollama on a single machine is easy. Running multiple Ollama instances across your LAN—and surviving GPU stalls, reboots, or long outages—is where things get interesting.

This post walks through a production-grade NGINX upstream configuration for Ollama, explains how it behaves under load, and shows how to tune it when one machine might be down for minutes or hours.


Why Ollama Needs Special Load Balancing

Ollama workloads are not typical web traffic:

  • Requests are long-lived
  • Execution time varies wildly (model + prompt dependent)
  • GPUs saturate before CPUs
  • A “slow” request is not a failure
  • Nodes can vanish mid-generation

Classic round-robin load balancing performs poorly here.

What you want instead:

  • Adaptive request distribution
  • Fast eviction of dead nodes
  • Minimal retry thrashing
  • Connection reuse

Baseline NGINX Upstream Configuration

upstream ollama_pool {
    least_conn;

    server 192.168.0.169:11434 max_fails=2 fail_timeout=60s;
    server 192.168.0.156:11434 max_fails=2 fail_timeout=60s;
    server 192.168.0.141:11434 max_fails=2 fail_timeout=60s;

    keepalive 64;
}
Enter fullscreen mode Exit fullscreen mode

least_conn: The Right Algorithm for GPUs

least_conn routes new requests to the backend with the fewest active connections.

Why this works so well for Ollama:

  • LLM requests are long-running
  • Faster GPUs finish sooner
  • Finished nodes naturally get more work

This gives you implicit weighting without hardcoding values.


Failure Handling for Long Downtime Nodes

server 192.168.0.169:11434 max_fails=2 fail_timeout=60s;
Enter fullscreen mode Exit fullscreen mode

Meaning:

  • After 2 failures within 60 seconds
  • The node is marked down for 60 seconds
  • No traffic is sent during that time
  • Afterward, NGINX retries automatically

For known long outages, consider fail_timeout=300s.


Connection Reuse Matters (keepalive)

keepalive 64;
Enter fullscreen mode Exit fullscreen mode

This enables TCP reuse between NGINX and Ollama.

Benefits:

  • Fewer handshakes
  • Lower latency
  • Better streaming stability

⚠️ Requires:

proxy_http_version 1.1;
proxy_set_header Connection "";
Enter fullscreen mode Exit fullscreen mode

Detecting Failures Quickly (Timeouts)

proxy_connect_timeout 2s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
Enter fullscreen mode Exit fullscreen mode

Notes:

  • proxy_connect_timeout handles dead hosts fast
  • proxy_read_timeout must be tuned:
    • Streaming → lower OK
    • Blocking generations → higher needed

Retrying Failed Requests

proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;
Enter fullscreen mode Exit fullscreen mode

NGINX will retry failed requests on other nodes.

⚠️ LLM responses are not guaranteed idempotent.


Planned Maintenance: Disable a Node

server 192.168.0.141:11434 down;
Enter fullscreen mode Exit fullscreen mode

Reload NGINX to immediately remove the node from rotation.


Recommended Production Baseline

Upstream

upstream ollama_pool {
    least_conn;
    server 192.168.0.169:11434 max_fails=2 fail_timeout=60s;
    server 192.168.0.156:11434 max_fails=2 fail_timeout=60s;
    server 192.168.0.141:11434 max_fails=2 fail_timeout=60s;
    keepalive 64;
}
Enter fullscreen mode Exit fullscreen mode

Proxy Location

proxy_http_version 1.1;
proxy_set_header Connection "";

proxy_connect_timeout 2s;

proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;
Enter fullscreen mode Exit fullscreen mode

Tune proxy_read_timeout based on streaming vs blocking usage.


What This Setup Does Not Do

This configuration does not:

  • Share model state
  • Provide session stickiness
  • Add authentication
  • Expose Ollama safely to the internet

Use it on trusted networks or behind TLS + auth.


Final Thoughts

This setup provides:

  • Smart GPU-aware load balancing
  • Automatic failover
  • Graceful handling of dead machines
  • Minimal operational overhead

It’s a solid foundation for any serious Ollama deployment.

Top comments (0)