DEV Community

Ekrem MUTLU
Ekrem MUTLU

Posted on • Originally published at bilgestore.com

Building an Enterprise RAG System: Lessons from Production with Turkish Documents

Building an Enterprise RAG System: Lessons from Production

RAG (Retrieval-Augmented Generation) is the most practical way to give LLMs access to your private documents. But most tutorials stop at "here's a LangChain hello world." Production RAG is a different beast.

I've been running a RAG system in production for Turkish documents. Here's what I learned - and why standard approaches fail for non-English text.

The Problem with Default RAG

If you follow a typical RAG tutorial:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings

# This works fine for English
splitter = RecursiveCharacterTextSplitter(chunk_size=512)
chunks = splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
Enter fullscreen mode Exit fullscreen mode

This gives you ~90% retrieval accuracy on English documents. On Turkish? About 60%.

Why Turkish Breaks Standard RAG

1. Tokenization

Turkish is agglutinative. One word can express an entire English sentence:

  • "goruntuleyemeyebileceklerimizdenmissinizcesine" = "as if you were one of those whom we would not be able to view"

BPE tokenizers trained on English split this into 15+ meaningless subwords. Your 512-token chunk now covers much less text than expected.

2. Chunking

Token-count chunking cuts mid-word in Turkish because words are longer. A chunk boundary at token 512 might split "goruntuleyemeyebileceklerimizdenmissinizcesine" in half.

3. Embeddings

Multilingual embeddings (e.g., multilingual-e5) help but still underperform. The embedding space doesn't fully capture Turkish morphological relationships.

Our Solution: A 4-Step Pipeline

Step 1: Morphological Preprocessing

Before chunking, we analyze Turkish morphology:

from turkishnlp import detector

def preprocess_turkish(text):
    # Stem words for better embedding alignment
    # But keep original text for display
    words = text.split()
    stems = [detector.stem(w) for w in words]
    return {
        "original": text,
        "stemmed": " ".join(stems),
        "tokens": words
    }
Enter fullscreen mode Exit fullscreen mode

Step 2: Sentence-Boundary Chunking

Instead of splitting by token count, we split on sentence boundaries:

import re

def chunk_by_sentences(text, max_sentences=5, overlap=1):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    for i in range(0, len(sentences), max_sentences - overlap):
        chunk = " ".join(sentences[i:i + max_sentences])
        chunks.append(chunk)
    return chunks
Enter fullscreen mode Exit fullscreen mode

This ensures no word is ever split mid-morpheme.

Step 3: Weaviate Hybrid Search

We use Weaviate's hybrid search combining BM25 (keyword) with vector search:

import weaviate

client = weaviate.Client("http://localhost:8080")

result = client.query.get("Document", ["content", "metadata"]) \
    .with_hybrid(
        query="turkce belge arama",
        alpha=0.5,  # 50% BM25, 50% vector
        fusion_type="relativeScoreFusion"
    ) \
    .with_autocut(2) \
    .with_limit(5) \
    .do()
Enter fullscreen mode Exit fullscreen mode

BM25 catches exact term matches (critical for proper nouns in Turkish), while vector search handles semantic similarity.

Step 4: Production Docker Stack

version: '3.8'
services:
  weaviate:
    image: semitechnologies/weaviate
    ports:
      - "8080:8080"
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
      ENABLE_MODULES: text2vec-transformers,reranker-transformers
      CLUSTER_HOSTNAME: node1

  api:
    build: ./api
    ports:
      - "8000:8000"
    environment:
      WEAVIATE_URL: http://weaviate:8080

  ingest:
    build: ./ingest
    volumes:
      - ./documents:/data
Enter fullscreen mode Exit fullscreen mode

Benchmark Results

We tested on 500 Turkish documents with 2000 queries:

Approach Recall@5 Precision@5 Latency (p95)
LangChain defaults 61% 54% 450ms
+ Sentence chunking 72% 65% 420ms
+ Morphological preprocessing 84% 78% 480ms
+ Hybrid search (Weaviate) 93% 88% 520ms

Each step adds measurable improvement. The full pipeline reaches 93% recall, comparable to English-optimized systems.

Lessons Learned

  1. Don't trust default settings. Every RAG component assumes English text.
  2. Hybrid search is not optional for non-English. BM25 catches what embeddings miss.
  3. Preprocessing > better embeddings. Spending time on morphological analysis gave bigger gains than switching embedding models.
  4. Test with real queries. Our benchmark uses questions actual users asked, not synthetic queries.
  5. Monitor retrieval quality continuously. We log every query + retrieved chunks and review weekly.

Try It Yourself

I've packaged this entire pipeline as a product:

  • Starter ($79): Single project, up to 10K documents, Docker Compose stack, community support
  • Professional ($249): Unlimited projects, custom embedding training, white-label, priority support

Check it out: BilgeStore RAG System

Read more: Blog post with full technical details

Questions? I'm happy to discuss Turkish NLP challenges or RAG architecture decisions in the comments.

Top comments (0)