Building an Enterprise RAG System: Lessons from Production with Turkish Documents

#rag #ai #weaviate #nlp

Building an Enterprise RAG System: Lessons from Production

RAG (Retrieval-Augmented Generation) is the most practical way to give LLMs access to your private documents. But most tutorials stop at "here's a LangChain hello world." Production RAG is a different beast.

I've been running a RAG system in production for Turkish documents. Here's what I learned - and why standard approaches fail for non-English text.

The Problem with Default RAG

If you follow a typical RAG tutorial:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings

# This works fine for English
splitter = RecursiveCharacterTextSplitter(chunk_size=512)
chunks = splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()

This gives you ~90% retrieval accuracy on English documents. On Turkish? About 60%.

Why Turkish Breaks Standard RAG

1. Tokenization

Turkish is agglutinative. One word can express an entire English sentence:

"goruntuleyemeyebileceklerimizdenmissinizcesine" = "as if you were one of those whom we would not be able to view"

BPE tokenizers trained on English split this into 15+ meaningless subwords. Your 512-token chunk now covers much less text than expected.

2. Chunking

Token-count chunking cuts mid-word in Turkish because words are longer. A chunk boundary at token 512 might split "goruntuleyemeyebileceklerimizdenmissinizcesine" in half.

3. Embeddings

Multilingual embeddings (e.g., multilingual-e5) help but still underperform. The embedding space doesn't fully capture Turkish morphological relationships.

Our Solution: A 4-Step Pipeline

Step 1: Morphological Preprocessing

Before chunking, we analyze Turkish morphology:

from turkishnlp import detector

def preprocess_turkish(text):
    # Stem words for better embedding alignment
    # But keep original text for display
    words = text.split()
    stems = [detector.stem(w) for w in words]
    return {
        "original": text,
        "stemmed": " ".join(stems),
        "tokens": words
    }

Step 2: Sentence-Boundary Chunking

Instead of splitting by token count, we split on sentence boundaries:

import re

def chunk_by_sentences(text, max_sentences=5, overlap=1):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    for i in range(0, len(sentences), max_sentences - overlap):
        chunk = " ".join(sentences[i:i + max_sentences])
        chunks.append(chunk)
    return chunks

This ensures no word is ever split mid-morpheme.

Step 3: Weaviate Hybrid Search

We use Weaviate's hybrid search combining BM25 (keyword) with vector search:

import weaviate

client = weaviate.Client("http://localhost:8080")

result = client.query.get("Document", ["content", "metadata"]) \
    .with_hybrid(
        query="turkce belge arama",
        alpha=0.5,  # 50% BM25, 50% vector
        fusion_type="relativeScoreFusion"
    ) \
    .with_autocut(2) \
    .with_limit(5) \
    .do()

BM25 catches exact term matches (critical for proper nouns in Turkish), while vector search handles semantic similarity.

Step 4: Production Docker Stack

version: '3.8'
services:
  weaviate:
    image: semitechnologies/weaviate
    ports:
      - "8080:8080"
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      DEFAULT_VECTORIZER_MODULE: text2vec-transformers
      ENABLE_MODULES: text2vec-transformers,reranker-transformers
      CLUSTER_HOSTNAME: node1

  api:
    build: ./api
    ports:
      - "8000:8000"
    environment:
      WEAVIATE_URL: http://weaviate:8080

  ingest:
    build: ./ingest
    volumes:
      - ./documents:/data

Benchmark Results

We tested on 500 Turkish documents with 2000 queries:

Approach	Recall@5	Precision@5	Latency (p95)
LangChain defaults	61%	54%	450ms
+ Sentence chunking	72%	65%	420ms
+ Morphological preprocessing	84%	78%	480ms
+ Hybrid search (Weaviate)	93%	88%	520ms

Each step adds measurable improvement. The full pipeline reaches 93% recall, comparable to English-optimized systems.

Lessons Learned

Don't trust default settings. Every RAG component assumes English text.
Hybrid search is not optional for non-English. BM25 catches what embeddings miss.
Preprocessing > better embeddings. Spending time on morphological analysis gave bigger gains than switching embedding models.
Test with real queries. Our benchmark uses questions actual users asked, not synthetic queries.
Monitor retrieval quality continuously. We log every query + retrieved chunks and review weekly.

Try It Yourself

I've packaged this entire pipeline as a product:

Starter ($79): Single project, up to 10K documents, Docker Compose stack, community support
Professional ($249): Unlimited projects, custom embedding training, white-label, priority support

Check it out: BilgeStore RAG System

Questions? I'm happy to discuss Turkish NLP challenges or RAG architecture decisions in the comments.