Load Balancing Ollama with NGINX: Handling Long GPU Jobs and Dead Nodes Gracefully
Running Ollama on a single machine is easy. Running multiple Ollama instances across your LAN—and surviving GPU stalls, reboots, or long outages—is where things get interesting.
This post walks through a production-grade NGINX upstream configuration for Ollama, explains how it behaves under load, and shows how to tune it when one machine might be down for minutes or hours.
Why Ollama Needs Special Load Balancing
Ollama workloads are not typical web traffic:
- Requests are long-lived
- Execution time varies wildly (model + prompt dependent)
- GPUs saturate before CPUs
- A “slow” request is not a failure
- Nodes can vanish mid-generation
Classic round-robin load balancing performs poorly here.
What you want instead:
- Adaptive request distribution
- Fast eviction of dead nodes
- Minimal retry thrashing
- Connection reuse
Baseline NGINX Upstream Configuration
upstream ollama_pool {
least_conn;
server 192.168.0.169:11434 max_fails=2 fail_timeout=60s;
server 192.168.0.156:11434 max_fails=2 fail_timeout=60s;
server 192.168.0.141:11434 max_fails=2 fail_timeout=60s;
keepalive 64;
}
least_conn: The Right Algorithm for GPUs
least_conn routes new requests to the backend with the fewest active connections.
Why this works so well for Ollama:
- LLM requests are long-running
- Faster GPUs finish sooner
- Finished nodes naturally get more work
This gives you implicit weighting without hardcoding values.
Failure Handling for Long Downtime Nodes
server 192.168.0.169:11434 max_fails=2 fail_timeout=60s;
Meaning:
- After 2 failures within 60 seconds
- The node is marked down for 60 seconds
- No traffic is sent during that time
- Afterward, NGINX retries automatically
For known long outages, consider fail_timeout=300s.
Connection Reuse Matters (keepalive)
keepalive 64;
This enables TCP reuse between NGINX and Ollama.
Benefits:
- Fewer handshakes
- Lower latency
- Better streaming stability
⚠️ Requires:
proxy_http_version 1.1;
proxy_set_header Connection "";
Detecting Failures Quickly (Timeouts)
proxy_connect_timeout 2s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
Notes:
-
proxy_connect_timeouthandles dead hosts fast -
proxy_read_timeoutmust be tuned:- Streaming → lower OK
- Blocking generations → higher needed
Retrying Failed Requests
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;
NGINX will retry failed requests on other nodes.
⚠️ LLM responses are not guaranteed idempotent.
Planned Maintenance: Disable a Node
server 192.168.0.141:11434 down;
Reload NGINX to immediately remove the node from rotation.
Recommended Production Baseline
Upstream
upstream ollama_pool {
least_conn;
server 192.168.0.169:11434 max_fails=2 fail_timeout=60s;
server 192.168.0.156:11434 max_fails=2 fail_timeout=60s;
server 192.168.0.141:11434 max_fails=2 fail_timeout=60s;
keepalive 64;
}
Proxy Location
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 2s;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;
Tune proxy_read_timeout based on streaming vs blocking usage.
What This Setup Does Not Do
This configuration does not:
- Share model state
- Provide session stickiness
- Add authentication
- Expose Ollama safely to the internet
Use it on trusted networks or behind TLS + auth.
Final Thoughts
This setup provides:
- Smart GPU-aware load balancing
- Automatic failover
- Graceful handling of dead machines
- Minimal operational overhead
It’s a solid foundation for any serious Ollama deployment.
Top comments (0)