Beyond the Chatbot: Mastering Local RAG with Open WebUI and DeepSeek-R1
Setting up a local LLM with Ollama is a great first step, but the real power of self-hosted AI lies in Retrieval-Augmented Generation (RAG). Being able to chat with your own documents, technical manuals, or private research notes without sending a single byte of data to a third-party cloud is the ultimate "power user" move.
In this guide, we’re going to go beyond the basic ollama run and build a production-grade local knowledge base using Open WebUI and DeepSeek-R1. We’ll focus on optimizing retrieval accuracy using hybrid search and re-ranking—techniques that make the difference between a model that "hallucinates" and one that actually knows your data.
Why DeepSeek-R1 for Local RAG?
While Llama 3.1 and Gemma 2 are excellent, DeepSeek-R1 (specifically the distilled versions like 8B or 14B) has gained massive traction in early 2026 for its reasoning capabilities. In a RAG pipeline, a "reasoning" model is significantly better at:
- Synthesizing information from multiple retrieved snippets.
- Identifying when the retrieved context doesn't contain the answer (reducing hallucinations).
- Following complex instructions about how to format the output based on your private data.
Prerequisites
- OS: Linux (Ubuntu 24.04+ or Debian 12 recommended).
- Hardware: 16GB+ RAM (32GB preferred). A dedicated NVIDIA GPU (8GB+ VRAM) is highly recommended for performance.
- Software: Docker and Docker Compose installed.
Step 1: Deploying the Stack
We’ll use Docker Compose to manage both Ollama and Open WebUI. This ensures they can communicate over a private network.
Create a docker-compose.yaml file:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ./ollama:/root/.ollama
restart: unless-stopped
# Uncomment the following lines if you have an NVIDIA GPU
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- ./open-webui:/app/backend/data
depends_on:
- ollama
ports:
- "3000:8080"
environment:
- "OLLAMA_BASE_URL=http://ollama:11434"
- "WEBUI_SECRET_KEY=yoursecretkeyhere" # Change this!
restart: unless-stopped
Run the stack:
docker compose up -d
Step 2: Pulling the Reasoning Model
Once the containers are up, pull the DeepSeek-R1 distilled model. I recommend the 8B version for most consumer hardware as it provides a great balance of speed and reasoning logic.
docker exec -it ollama ollama pull deepseek-r1:8b
Step 3: Configuring the RAG Pipeline
Access Open WebUI at http://localhost:3000. After creating your local admin account, follow these steps to optimize your RAG settings:
1. The Embedding Model
By default, Open WebUI uses a basic embedding model. For better accuracy, go to Settings > Documents. Ensure the embedding model is set to something robust like sentence-transformers/all-MiniLM-L6-v2 or pull a local one via Ollama (e.g., nomic-embed-text).
2. Enabling Hybrid Search
Hybrid search combines Vector Search (semantic meaning) with BM25 (keyword matching). This is crucial for technical documents where specific terms (like function names or error codes) matter.
- In Settings > Documents, toggle on Enable Hybrid Search.
- This will index your files using both methods, drastically improving retrieval for specific keywords.
3. Tuning the Reranking
Reranking is the "secret sauce." It takes the top $N$ results from the search and uses a smaller, specialized model to re-evaluate which ones are actually most relevant to the query.
- Set a Reranking Model (e.g.,
BAAI/bge-reranker-v2-m3is a solid choice). - Set your Top K (the number of chunks retrieved) to 5 or 10.
Step 4: Using Your Knowledge Base
- Click the Documents icon in the sidebar.
- Upload your files (PDFs, Markdown, Text).
- In a new chat, select DeepSeek-R1:8B.
- Type
#followed by the name of your document or collection to tag it. - Ask your question!
Example:
"Based on the #architecture-specs, what is the maximum concurrent connection limit for our load balancer?"
Practical Tip: Chunking Matters
If your documents are huge, the default chunk size of 500 tokens might be too small for complex reasoning. If the model feels like it's missing the "big picture," try increasing the Chunk Size to 1000 with a 200-token Overlap in the Document Settings.
References & Further Reading
Found this helpful? I’m exploring the intersection of Linux automation and self-hosted AI every week. Let’s chat in the comments!
Top comments (0)