DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

Ollama + Open WebUI + pgvector: Sovereign RAG Stack 2026

This article was originally published on aifoss.dev

TL;DR: Three services, one Docker Compose file, zero data leaving your machine. This guide connects Ollama for inference, Open WebUI for the chat frontend, and PostgreSQL with pgvector for document embeddings. The trade-off vs. the simpler two-container setup: more initial config, but persistent RAG data, multi-process safety, and one database to back up.

What you'll have running after this guide:

  • Ollama 0.30.x serving LLMs locally (Qwen2.5, Llama 3.3, Mistral, or any model in the library)
  • Open WebUI 0.9.6 with knowledge bases backed by pgvector — all retrieval stays on-device
  • PostgreSQL 17 + pgvector 0.8.2 storing embeddings persistently, safe for multi-worker Open WebUI deployments

Honest take: If your documents are sensitive enough that they can't touch OpenAI or Anthropic's APIs, this stack is the right call. If you just want local chat with no document search, the basic two-container setup is simpler — stop reading here and follow the Ollama + Open WebUI Linux setup guide instead.

All three tools are open source and free to self-host: Ollama and Open WebUI are MIT licensed; pgvector is PostgreSQL-licensed (BSD-equivalent). No usage limits, no call-home telemetry, no per-query fees.


Why swap the default vector database?

Open WebUI ships with ChromaDB as its vector store. It works for a single user on a single machine. The problem shows up when:

  • You run Open WebUI with multiple uvicorn workers — ChromaDB's PersistentClient uses SQLite under the hood, which isn't fork-safe. Workers inherit the same database connection and corrupt each other's state under concurrent writes.
  • You restart the container and lose RAG context because the Chroma data volume wasn't correctly mounted.
  • You want a single backup to cover everything — chat history, user accounts, and document embeddings — instead of backing up Chroma separately.

Switching to pgvector fixes all three. The extension runs inside the same PostgreSQL instance Open WebUI already needs for its application database. One service, one backup, no extra containers.

For a deeper look at how pgvector compares to Qdrant and ChromaDB at scale, see the vector database comparison.


Hardware floor

Setup RAM GPU What runs
Minimum (CPU only) 16 GB None 7B Q4_K_M at 4–8 tok/s; RAG adds 3–5s retrieval
Comfortable 16 GB RTX 3060 12GB 7B at 28–35 tok/s; 13B at 15–22 tok/s
Recommended 32 GB RTX 4070 12GB 14B at 40–50 tok/s; 32B Q4 at 18–25 tok/s
Heavy RAG / 70B 64 GB RTX 4090 24GB 70B Q4_K_M at 20–30 tok/s with fast embedding

The embedding model (nomic-embed-text, 274MB) runs alongside your inference model. On an 8GB VRAM card, both compete for VRAM and you'll see the inference model partially offloaded to CPU. 12GB+ keeps both fully on-GPU.

CPU-only setups work — expect 10–30s per response instead of 1–3s. If you occasionally need GPU scale for large document batches, RunPod rents A5000s (24GB VRAM) for under $0.30/hr without a long-term commitment.

For hardware build recommendations to pair with this stack, see the GPU server guides on runaihome.com.


Architecture

┌──────────────────────────────────────────────┐
│              Docker bridge network           │
│                                              │
│  ┌──────────────┐    ┌────────────────────┐  │
│  │    Ollama    │◄───│    Open WebUI      │  │
│  │   :11434     │    │      :8080         │  │
│  └──────────────┘    └─────────┬──────────┘  │
│                                │              │
│                  ┌─────────────▼───────────┐  │
│                  │  PostgreSQL 17          │  │
│                  │  + pgvector 0.8.2       │  │
│                  │  :5432                  │  │
│                  └─────────────────────────┘  │
└──────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Open WebUI talks to Ollama for inference and to PostgreSQL for two things: its own application data (users, sessions, settings) and the RAG vector store (embeddings). PostgreSQL handles both roles — no separate Chroma service, no additional volume to manage.


Step 1: Write the Docker Compose file

mkdir ai-stack && cd ai-stack
nano compose.yaml
Enter fullscreen mode Exit fullscreen mode

Paste the following:

services:
  postgres:
    image: pgvector/pgvector:pg17
    restart: unless-stopped
    environment:
      POSTGRES_DB: openwebui
      POSTGRES_USER: openwebui
      POSTGRES_PASSWORD: changeme_strong_password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U openwebui -d openwebui"]
      interval: 10s
      timeout: 5s
      retries: 5

  ollama:
    image: ollama/ollama:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # Uncomment for NVIDIA GPU:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    restart: unless-stopped
    ports:
      - "3000:8080"
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      OLLAMA_BASE_URL: http://ollama:11434
      DATABASE_URL: postgresql://openwebui:changeme_strong_password@postgres:5432/openwebui
      PGVECTOR_DB_URL: postgresql://openwebui:changeme_strong_password@postgres:5432/openwebui
      VECTOR_DB: pgvector
      RAG_EMBEDDING_ENGINE: ollama
      RAG_EMBEDDING_MODEL: nomic-embed-text
    volumes:
      - open_webui_data:/app/backend/data

volumes:
  postgres_data:
  ollama_data:
  open_webui_data:
Enter fullscreen mode Exit fullscreen mode

Three things worth calling out before you run it:

pgvector/pgvector:pg17 ships with the vector extension pre-installed. You don't need to run CREATE EXTENSION vector manually — Open WebUI runs that migration on first boot.

Ollama is bound to 127.0.0.1:11434 — accessible to other containers on the Docker network but not exposed to your LAN. This matters: unauthenticated Ollama instances have shown up in security research repeatedly. If you need LAN access, use a reverse proxy with auth rather than exposing port 11434 directly. See the Ollama security guide for the full explanation.

Change changeme_strong_password in all three places it appears (POSTGRES_PASSWORD, DATABASE_URL, PGVECTOR_DB_URL) before running. Use the same value in all three.


Step 2: Start the stack and pull models

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Docker pulls the three images (roughly 2.5GB total on first run), then starts the services. After 30–60 seconds:

✔ Container ai-stack-postgres-1     Healthy
✔ Container ai-stack-ollama-1       Started
✔ Container ai-stack-open-webui-1   Started
Enter fullscreen mode Exit fullscreen mode

The service_healthy condition in the compose file makes Open WebUI wait for PostgreSQL to accept connections before starting. If you skip the healthcheck and start all three simultaneously, you'll see Open WebUI crash-loop for 15–20 seconds while Postgres initializes — not a real problem, but noisy.

Now pull the inference and embedding models:

# Inference model — swap for any model that fits your VRAM
docker exec ai-stack-ollama-1 ollama pull qwen2.5:7b

# Embedding model Open WebUI will use for RAG
docker exec ai-stack-ollama-1 ollama pull nomic-embed-text
Enter fullscreen mode Exit fullscreen mode

Why nomic-embed-text? It's 274MB, produces 768-dimensional vectors, and scores well on MTEB English retrieval benchmarks. For multilingual documents, mxbai-embed-large (670MB, 1024-dim) outperforms it. For minimal footprint, all-minilm (46MB) works but recall quality

Top comments (0)