Christopher

Posted on Feb 6

Making an Ollama Mesh with Nginx

#ai #nginx #ollama

Load Balancing Ollama with NGINX: Handling Long GPU Jobs and Dead Nodes Gracefully

Running Ollama on a single machine is easy. Running multiple Ollama instances across your LAN—and surviving GPU stalls, reboots, or long outages—is where things get interesting.

This post walks through a production-grade NGINX upstream configuration for Ollama, explains how it behaves under load, and shows how to tune it when one machine might be down for minutes or hours.

Why Ollama Needs Special Load Balancing

Ollama workloads are not typical web traffic:

Requests are long-lived
Execution time varies wildly (model + prompt dependent)
GPUs saturate before CPUs
A “slow” request is not a failure
Nodes can vanish mid-generation

Classic round-robin load balancing performs poorly here.

What you want instead:

Adaptive request distribution
Fast eviction of dead nodes
Minimal retry thrashing
Connection reuse

Baseline NGINX Upstream Configuration

upstream ollama_pool {
    least_conn;

    server 192.168.0.169:11434 max_fails=2 fail_timeout=60s;
    server 192.168.0.156:11434 max_fails=2 fail_timeout=60s;
    server 192.168.0.141:11434 max_fails=2 fail_timeout=60s;

    keepalive 64;
}

`least_conn`: The Right Algorithm for GPUs

least_conn routes new requests to the backend with the fewest active connections.

Why this works so well for Ollama:

LLM requests are long-running
Faster GPUs finish sooner
Finished nodes naturally get more work

This gives you implicit weighting without hardcoding values.

Failure Handling for Long Downtime Nodes

server 192.168.0.169:11434 max_fails=2 fail_timeout=60s;

Meaning:

After 2 failures within 60 seconds
The node is marked down for 60 seconds
No traffic is sent during that time
Afterward, NGINX retries automatically

For known long outages, consider fail_timeout=300s.

Connection Reuse Matters (`keepalive`)

keepalive 64;

This enables TCP reuse between NGINX and Ollama.

Benefits:

Fewer handshakes
Lower latency
Better streaming stability

⚠️ Requires:

proxy_http_version 1.1;
proxy_set_header Connection "";

Detecting Failures Quickly (Timeouts)

proxy_connect_timeout 2s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;

Notes:

proxy_connect_timeout handles dead hosts fast
proxy_read_timeout must be tuned:
- Streaming → lower OK
- Blocking generations → higher needed

Retrying Failed Requests

proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;

NGINX will retry failed requests on other nodes.

⚠️ LLM responses are not guaranteed idempotent.

Planned Maintenance: Disable a Node

server 192.168.0.141:11434 down;

Reload NGINX to immediately remove the node from rotation.

Recommended Production Baseline

Upstream

upstream ollama_pool {
    least_conn;
    server 192.168.0.169:11434 max_fails=2 fail_timeout=60s;
    server 192.168.0.156:11434 max_fails=2 fail_timeout=60s;
    server 192.168.0.141:11434 max_fails=2 fail_timeout=60s;
    keepalive 64;
}

Proxy Location

proxy_http_version 1.1;
proxy_set_header Connection "";

proxy_connect_timeout 2s;

proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;

Tune proxy_read_timeout based on streaming vs blocking usage.

What This Setup Does Not Do

This configuration does not:

Share model state
Provide session stickiness
Add authentication
Expose Ollama safely to the internet

Use it on trusted networks or behind TLS + auth.

Final Thoughts

This setup provides:

Smart GPU-aware load balancing
Automatic failover
Graceful handling of dead machines
Minimal operational overhead

It’s a solid foundation for any serious Ollama deployment.

DEV Community

Making an Ollama Mesh with Nginx

Load Balancing Ollama with NGINX: Handling Long GPU Jobs and Dead Nodes Gracefully

Why Ollama Needs Special Load Balancing

Baseline NGINX Upstream Configuration

`least_conn`: The Right Algorithm for GPUs

Failure Handling for Long Downtime Nodes

Connection Reuse Matters (`keepalive`)

Detecting Failures Quickly (Timeouts)

Retrying Failed Requests

Planned Maintenance: Disable a Node

Recommended Production Baseline

Upstream

Proxy Location

What This Setup Does Not Do

Final Thoughts

Top comments (0)

Load Balancing Ollama with NGINX: Handling Long GPU Jobs and Dead Nodes Gracefully

Why Ollama Needs Special Load Balancing

Baseline NGINX Upstream Configuration

least_conn: The Right Algorithm for GPUs

Failure Handling for Long Downtime Nodes

Connection Reuse Matters (keepalive)

Detecting Failures Quickly (Timeouts)

Retrying Failed Requests

Planned Maintenance: Disable a Node

Recommended Production Baseline

Upstream

Proxy Location

What This Setup Does Not Do

Final Thoughts

`least_conn`: The Right Algorithm for GPUs

Connection Reuse Matters (`keepalive`)