DEV Community

Cover image for Beyond the Chatbot: Mastering Local RAG with Open WebUI and DeepSeek-R1
Lyra
Lyra

Posted on

Beyond the Chatbot: Mastering Local RAG with Open WebUI and DeepSeek-R1

Beyond the Chatbot: Mastering Local RAG with Open WebUI and DeepSeek-R1

Setting up a local LLM with Ollama is a great first step, but the real power of self-hosted AI lies in Retrieval-Augmented Generation (RAG). Being able to chat with your own documents, technical manuals, or private research notes without sending a single byte of data to a third-party cloud is the ultimate "power user" move.

In this guide, we’re going to go beyond the basic ollama run and build a production-grade local knowledge base using Open WebUI and DeepSeek-R1. We’ll focus on optimizing retrieval accuracy using hybrid search and re-ranking—techniques that make the difference between a model that "hallucinates" and one that actually knows your data.


Why DeepSeek-R1 for Local RAG?

While Llama 3.1 and Gemma 2 are excellent, DeepSeek-R1 (specifically the distilled versions like 8B or 14B) has gained massive traction in early 2026 for its reasoning capabilities. In a RAG pipeline, a "reasoning" model is significantly better at:

  1. Synthesizing information from multiple retrieved snippets.
  2. Identifying when the retrieved context doesn't contain the answer (reducing hallucinations).
  3. Following complex instructions about how to format the output based on your private data.

Prerequisites

  • OS: Linux (Ubuntu 24.04+ or Debian 12 recommended).
  • Hardware: 16GB+ RAM (32GB preferred). A dedicated NVIDIA GPU (8GB+ VRAM) is highly recommended for performance.
  • Software: Docker and Docker Compose installed.

Step 1: Deploying the Stack

We’ll use Docker Compose to manage both Ollama and Open WebUI. This ensures they can communicate over a private network.

Create a docker-compose.yaml file:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ./ollama:/root/.ollama
    restart: unless-stopped
    # Uncomment the following lines if you have an NVIDIA GPU
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - ./open-webui:/app/backend/data
    depends_on:
      - ollama
    ports:
      - "3000:8080"
    environment:
      - "OLLAMA_BASE_URL=http://ollama:11434"
      - "WEBUI_SECRET_KEY=yoursecretkeyhere" # Change this!
    restart: unless-stopped
Enter fullscreen mode Exit fullscreen mode

Run the stack:

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Step 2: Pulling the Reasoning Model

Once the containers are up, pull the DeepSeek-R1 distilled model. I recommend the 8B version for most consumer hardware as it provides a great balance of speed and reasoning logic.

docker exec -it ollama ollama pull deepseek-r1:8b
Enter fullscreen mode Exit fullscreen mode

Step 3: Configuring the RAG Pipeline

Access Open WebUI at http://localhost:3000. After creating your local admin account, follow these steps to optimize your RAG settings:

1. The Embedding Model

By default, Open WebUI uses a basic embedding model. For better accuracy, go to Settings > Documents. Ensure the embedding model is set to something robust like sentence-transformers/all-MiniLM-L6-v2 or pull a local one via Ollama (e.g., nomic-embed-text).

2. Enabling Hybrid Search

Hybrid search combines Vector Search (semantic meaning) with BM25 (keyword matching). This is crucial for technical documents where specific terms (like function names or error codes) matter.

  • In Settings > Documents, toggle on Enable Hybrid Search.
  • This will index your files using both methods, drastically improving retrieval for specific keywords.

3. Tuning the Reranking

Reranking is the "secret sauce." It takes the top $N$ results from the search and uses a smaller, specialized model to re-evaluate which ones are actually most relevant to the query.

  • Set a Reranking Model (e.g., BAAI/bge-reranker-v2-m3 is a solid choice).
  • Set your Top K (the number of chunks retrieved) to 5 or 10.

Step 4: Using Your Knowledge Base

  1. Click the Documents icon in the sidebar.
  2. Upload your files (PDFs, Markdown, Text).
  3. In a new chat, select DeepSeek-R1:8B.
  4. Type # followed by the name of your document or collection to tag it.
  5. Ask your question!

Example:

"Based on the #architecture-specs, what is the maximum concurrent connection limit for our load balancer?"


Practical Tip: Chunking Matters

If your documents are huge, the default chunk size of 500 tokens might be too small for complex reasoning. If the model feels like it's missing the "big picture," try increasing the Chunk Size to 1000 with a 200-token Overlap in the Document Settings.


References & Further Reading


Found this helpful? I’m exploring the intersection of Linux automation and self-hosted AI every week. Let’s chat in the comments!

Top comments (0)