Thurmon Demich

Posted on Jun 12 • Originally published at bestgpuforllm.com

How to Run Two RTX 3090s for LLM Inference in 2026

#multigpu #rtx3090 #dualgpu #llm

This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.

Two used RTX 3090s for $1,200 total. 48GB combined VRAM. Llama 70B at Q4 running at 18-22 tokens per second. That is the pitch — and it actually works. Dual 3090s are the cheapest way to run 70B-class models locally in 2026, and the setup is simpler than most people expect. No NVLink required. No exotic drivers. Just two cards, the right motherboard, and a beefy PSU.

Why dual 3090s?

The math is straightforward:

Setup	VRAM	Can run 70B Q4?	Cost
1x RTX 4090	24GB	No (~42GB needed)	~$1,600
1x RTX 5090	32GB	No (~42GB needed)	~$2,000
2x RTX 3090 (used)	48GB	Yes	~$1,200
2x RTX 4090	48GB	Yes	~$3,200

A single RTX 4090 maxes out at 24GB — short of the ~42GB needed for Llama 70B at Q4_K_M. The only way to fit 70B on consumer hardware is multiple GPUs. And two used 3090s at $600 each cost less than one new 4090.

What you need

Hardware checklist

Component	Requirement	Why
GPUs	2x RTX 3090	24GB each = 48GB total
Motherboard	2 physical x16 PCIe slots	Both must run at x8 or x16
PSU	850W minimum, 1000W recommended	Each 3090 draws up to 350W
CPU	Any modern 6+ core	Not the bottleneck for inference
RAM	32GB minimum	64GB recommended for large context
Case	Full tower with good airflow	3090s are triple-slot cards — check clearance
PCIe risers	Optional	Can help with spacing if slots are too close

Motherboard notes

This is where most builds fail. Many consumer motherboards have two x16-length slots, but the second slot runs at x4 electrically. That works but costs ~15% performance. Look for boards where both slots run at x8/x8 minimum when populated. ATX boards with Intel Z690/Z790 or AMD X670 chipsets usually support this.

Do NOT buy NVLink bridges. The RTX 3090 supports NVLink, but llama.cpp and Ollama do not use it for LLM inference. They use tensor parallelism over PCIe, which works on any multi-GPU setup. NVLink is wasted money for this use case.

Software setup

Option 1: Ollama (easiest)

Ollama automatically detects multiple GPUs and splits the model across them. No configuration needed.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run a 70B model — Ollama auto-splits across both GPUs
ollama run llama3.1:70b-instruct-q4_K_M

Verify both GPUs are being used:

nvidia-smi
# Both GPUs should show VRAM usage

Option 2: llama.cpp (more control)

llama.cpp gives you explicit control over layer splitting:

# Auto-split across GPUs
./llama-server -m llama-70b-Q4_K_M.gguf --n-gpu-layers 99

# Manual split: 40 layers on GPU 0, 40 on GPU 1
./llama-server -m llama-70b-Q4_K_M.gguf --n-gpu-layers 80 --tensor-split 0.5,0.5

The --tensor-split flag controls how layers are distributed. Equal split (0.5,0.5) is usually optimal for two identical GPUs. If one card is slightly faster or has more free VRAM, adjust the ratio.

Performance expectations

Tested with Llama 3.1 70B at Q4_K_M on dual RTX 3090s:

Metric	Value
Prompt processing	~350 tok/s
Token generation	~18-22 tok/s
VRAM usage (per GPU)	~21GB each
Total VRAM used	~42GB
Power draw (both GPUs)	~500-600W

18-22 tok/s on a 70B model is comfortable for interactive chat. It is not blazing fast, but responses stream smoothly and you will not feel like you are waiting.

For comparison:

Setup	70B Q4 tok/s	Cost
2x RTX 3090	~18-22 tok/s	~$1,200
2x RTX 4090	~30-35 tok/s	~$3,200
Cloud (RunPod A100)	~40-50 tok/s	~$2-4/hr

Dual 4090s are ~60% faster, but at nearly 3x the cost. The 3090 setup is the value play.

VRAM chart available at the original article

What models fit on 48GB?

Model	Q4_K_M VRAM	Fits on 2x 3090?	tok/s
Llama 3.1 70B	~42GB	Yes	~18-22
Qwen 3 72B	~45GB	Tight	~15-18
Llama 4 Scout (109B MoE)	~40GB*	Yes	~25-30
Mixtral 8x22B	~40GB	Yes	~20-25
Any model under 34B	Under 24GB	Yes (single GPU)	Varies

*MoE models like Llama 4 Scout only load active parameters, so the effective VRAM usage is lower than total parameter count suggests.

The 48GB sweet spot opens up the entire 70B class of dense models and many larger MoE models. This is the key advantage over single-GPU setups.

GPU tier list available at the original article

Common issues and fixes

"Only one GPU is being used"

Check that both GPUs are detected: nvidia-smi should show two devices. If Ollama only uses one, try setting CUDA_VISIBLE_DEVICES=0,1 before starting. In llama.cpp, explicitly set --n-gpu-layers 99 to force full GPU offloading.

Thermal throttling

Two 3090s generate serious heat — up to 700W combined. Ensure your case has strong front-to-back airflow. Leave at least one slot gap between the cards if possible. Consider aftermarket GPU coolers or a case with 140mm fans if you see temperatures hitting 83C+ consistently.

PCIe bandwidth bottleneck

If your second slot runs at x4, you will see one GPU process tokens slower than the other. The impact is ~15% on overall throughput. Upgrading to a motherboard with proper x8/x8 bifurcation fixes this. For most users, the 15% loss is acceptable given the cost savings.

Who should NOT do this?

Gamers who occasionally run LLMs. Dual 3090s draw 700W and generate significant heat. If you primarily game, a single RTX 4090 is a better all-rounder (though it cannot do 70B).
Anyone who needs 70B at 30+ tok/s. Dual 3090s cap at ~22 tok/s. If speed is critical, dual 4090s or cloud are your options.
Small form factor builders. Two triple-slot 3090s need a full tower case with good airflow. mITX and mATX builds cannot accommodate this.

For used 3090 buying tips, see our used RTX 3090 buying guide. Planning to run Llama specifically? The best GPU for Llama 70B guide covers all options. PSU sizing for multi-GPU is covered in PSU for dual GPU LLM. And for motherboard compatibility, see best motherboard for dual GPU LLM.

Related guides on Best GPU for LLM

Continue on Best GPU for LLM for the complete guide with interactive calculators and current GPU prices.

DEV Community