Seenivasa Ramadurai

Posted on Feb 14

RAG Chunking Strategies

#ai #architecture #llm #rag

In RAG systems, your LLM is only as smart as its retrieval. And retrieval is only as good as your chunks. A practical guide to every chunking strategy and exactly when to use each one.

How you slice your knowledge determines what your AI can know. Chunking is not preprocessing it's architecture."

I've spent the last couple of years building RAG based applications across multiple domains healthcare,HR Chatbot, enterprise search, customer support and if there's one question I keep coming back to, it's the same one every time: how do I split this document? Most tutorials hand you a code snippet with a fixed chunk size and move on. But after shipping real systems that failed in real ways, I started treating chunking as a first class architectural decision, not an afterthought.

So, What Exactly Is RAG?

Large Language Models(LLM) are extraordinarily capable but they have a hard boundary and their knowledge is frozen at training time. Ask GPT-4 about your internal company policy updated last week, or Mistral about a legal clause in a contract it has never seen, and it will either confess ignorance or, worse, confidently make something up. This is the hallucination problem, and it's the single biggest obstacle to deploying LLMs in production.

Retrieval Augmented Generation (RAG) is the architectural pattern that solves this. Instead of relying solely on what the model memorized during training, RAG gives the LLM a dynamic, queryable knowledge base at inference time(Updating the LLM Knowledge or Providing the Context) . The flow is simple a user asks a question → the system retrieves the most relevant documents from your knowledge base → those documents are injected into the LLM's prompt as context → the model answers using that fresh, grounded information. Your LLM is no longer guessing from memory. It's reading from a source.

RAG is the difference between an LLM that thinks it knows your domain and one that actually reads it every single time.

Why Does Context Window Size Matter?

Every LLM has a context window the maximum number of tokens it can process in a single interaction. Modern models have pushed this dramatically: GPT-4 supports 128K tokens, Gemini goes up to 1M, and open-source models like Mistral and Llama 3 offer 32K–128K windows. On the surface, this sounds like chunking should be a solved problem just stuff the whole document in and let the model figure it out.

Reality is more complicated. First, larger context = higher cost and latency. Sending 100K tokens to a model on every query is expensive, slow, and often unnecessary when only three paragraphs are actually relevant. Second, and more critically, research has consistently shown the "lost in the middle" effect LLMs reliably attend to content at the beginning and end of their context window, but struggle to reason from information buried in the middle. A 100K token context stuffed with an entire document does not guarantee the model finds the right answer. It often buries it.

This is exactly why precise retrieval matters. You don't want to give the model everything you want to give it the right thing. And that means your chunks need to be coherent, targeted, and meaningful enough to retrieve accurately.

Chunking: The Hidden Lever on Hallucination

When a RAG system retrieves the wrong chunk or a chunk that contains half an idea, cut off mid paragraph the LLM receives incomplete or misleading context. It doesn't say "I'm not sure." It fills the gap. It hallucinates. I've watched this happen in production a legal AI retrieving a clause fragment without its qualifying condition, a medical bot answering from a chunk that contained the preamble of a guideline but not the actual recommendation. The model wasn't broken. The chunks were.

Good chunking directly reduces hallucination by ensuring that every retrievable unit of text is complete, contextually self contained, and semantically precise enough to match the right query. It's not a data preprocessing step. It's a quality of reasoning decision.

My Approach: Semantic Chunking + OpenSource Models

Across the domains I've worked in, one pattern has proven itself repeatedly Semantic Chunking grouping text by meaning rather than character count consistently delivers better retrieval precision than fixed-size approaches. It costs a bit more during the indexing phase, but in production, where a wrong answer can erode user trust in minutes, that investment pays back fast.

On the cost side, I've leaned heavily on open source models embeddings like Sentence transformers, When you're embedding millions of document chunks, the difference between a cloud API and a self hosted open source model can be the difference between a sustainable product and an unsustainable one.

This post is everything I wish existed when I started a practical breakdown of every major chunking strategy, when each one earns its place, and which type of application it actually belongs in. No fluff just the patterns I've validated across real projects.

Why Chunking Decides Everything

Imagine handing someone a textbook with all its pages torn out and shuffled randomly, then asking them a question. That's what a poorly chunked RAG system does to an LLM. The model sees fragments devoid of context, forced to guess what came before and after and it hallucinates to fill the gaps.

"If retrieval is the engine of your RAG system, chunking is the fuel. High-quality chunking produces clean, contextual responses. Poor chunking creates noise no matter how powerful your LLM is."

Most developers obsess over their vector database or embedding model choice, but the single biggest lever on RAG performance is almost always how you divide your documents. Every chunk becomes a discrete unit that gets embedded, stored, retrieved, and injected into a prompt. Get this wrong and your brilliant LLM is reasoning from garbage.

There are two failure modes.

Chunks too large:

The vector blurs across multiple topics, retrieval becomes imprecise, and you bloat the LLM's context with irrelevant content.

Chunks too small:

They lack enough context to be meaningful on their own, becoming orphaned fragments that mislead rather than inform. The art lies in finding the Goldilocks zone and that zone is different for every application.

The 6 Core Chunking Strategies

These strategies span a spectrum from dirt cheap and dumb to expensive and intelligent. None is universally best. Your data, your users, and your budget determine the winner.

Which App Needs Which Chunk?

The same chunking strategy that powers a customer support chatbot would be disastrous for a legal contract analyzer. Here's how to match strategy to application type.

Starter Code

Here's how each strategy looks in practice using LangChain, the most common RAG framework.

Fixed Size & Recursive

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=100,
    separators=["\n\n", "\n", " ", ""]  # paragraph → line → word
)

chunks = splitter.split_text(raw_text)
# Returns a list of strings, ready for embedding

Semantic Chunking

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

chunker = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation"
    breakpoint_threshold_amount=95           # split at big topic shifts
)

chunks = chunker.split_text(raw_text)
# Chunks are semantically coherent topics

Hierarchical (Parent-Child)

from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Small chunks for retrieval precision
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Large chunks returned to LLM for full context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

retriever = ParentDocumentRetriever(
    vectorstore=your_vectorstore,
    docstore=your_docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
# Query retrieves small child chunk → returns full parent context

Tuning in Production

Choosing a strategy is step one. The real work is tuning. Here's the iterative loop that separates production grade RAG from toy demos

Step 1 Establish a baseline

Start with Recursive chunking at 512 tokens / 100 token overlap. Run your standard query set against it. Record your metrics: hit rate (was the right chunk retrieved at all?), precision (how much noise came with it?), and answer faithfulness (did the LLM use the context correctly?).

Step 2 Experiment systematically

Change one variable at a time. Try 256, 512, and 1024 token sizes. Try 0%, 10%, and 20% overlap. Try Semantic chunking vs. Recursive. Each experiment should run against the same evaluation set so comparisons are fair.

The "lost in the middle" problem is real. Even with large context windows, LLMs struggle to reason about information buried in the middle of long chunks. Smaller, focused chunks often outperform bigger ones not because they contain more info, but because they force the retrieval system to find exactly the right piece.

Step 3 Human review

Metrics catch a lot, but not everything. Have domain experts review both the retrieved chunks and the final LLM responses. They'll catch subtle issues a chunk that is technically on topic but missing the critical preceding sentence that automated metrics miss entirely.

Step 4 Monitor and iterate

User queries in production are always messier than your test set. Set up retrieval logging, track low confidence responses, and revisit your chunking strategy quarterly. As your document corpus evolves, so should your chunking approach.

Chunking is not a one time configuration. It's an ongoing architectural decision that should be revisited as your data, your users, and your quality bar evolve.

The Takeaway

Chunking is deceptively simple it looks like just splitting strings but it's the most consequential decision in your RAG pipeline. A customer support bot using semantic chunking will answer questions the same fixed size implementation simply cannot. A legal AI using LLM based chunking will reason from complete clauses rather than arbitrary 500 token fragments.

The decision framework is straightforward start simple, measure precisely, and invest in complexity only where the quality gain justifies the cost. For most applications, recursive chunking at 512 tokens is a solid default. For applications where accuracy is a business requirement not just a nice to have semantic or hierarchical chunking pays for itself many times over.

Get the chunks right, and your LLM finally has something worth reasoning from.

Thanks
Sreeni Ramadorai

DEV Community