Beck_Moulton

Posted on Feb 18

Advanced RAG: Parsing Complex Medical PDFs with LayoutLMv3 and LlamaIndex

#ai #python #machinelearning #rag

Let’s be honest: PDFs are where data goes to die. Especially medical check-up reports. They are a nightmare of nested tables, multi-column layouts, and cryptic abbreviations. If you’ve ever tried a naive "chunk-by-character" RAG (Retrieval-Augmented Generation) approach on a medical PDF, you know the pain—your LLM ends up hallucinating because it lost the context of which value belongs to which blood marker.

In this guide, we’re moving past "Hello World" RAG. We’ll build a production-grade pipeline using LayoutLMv3, LlamaIndex, and Qdrant to turn messy pixels into structured, searchable intelligence. We are focusing on layout-aware indexing to ensure your vector search actually understands the difference between a header and a table cell. 🚀

The Architecture: Layout-Aware RAG Pipeline

Standard RAG treats text as a flat stream. Our advanced approach treats it as a structured document. We use vision-based models to "see" the page before we "read" the text.

graph TD
    A[Raw Medical PDF] --> B{Layout Analysis}
    B -- LayoutLMv3 --> C[Identify Tables, Headers, Text]
    C --> D[Unstructured.io Partitioning]
    D --> E[Semantic Chunking & Metadata Enrichment]
    E --> F[(Qdrant Vector Store)]
    G[User Query] --> H[Hybrid Retriever]
    F --> H
    H --> I[LLM Synthesis]
    I --> J[Structured Medical Insight]

Why standard PDF parsers fail medical data

Table Fragmentation: Tables often span multiple pages. A simple split cuts a row in half.
Spatial Context: In a medical report, the "Reference Range" is just as important as the "Result." If they aren't indexed together, the data is useless.
Visual Cues: Bold text often indicates a high/low alert. Standard parsers ignore this visual weight.

Prerequisites

To follow along, you’ll need:

Python 3.10+
unstructured[all-docs] (for layout analysis)
llama-index (for orchestration)
qdrant-client (for vector storage)
Access to an LLM (OpenAI or local via Ollama)

Step 1: Intelligent Partitioning with LayoutLMv3

We use Unstructured.io powered by LayoutLMv3 to detect document elements. Unlike OCR, which just gives you strings, this gives us "Elements" (Tables, Titles, NarrativeText).

from unstructured.partition.pdf import partition_pdf

# This identifies the structural components using a vision model
elements = partition_pdf(
    filename="medical_report_2023.pdf",
    strategy="hi_res",           # Required for LayoutLMv3
    infer_table_structure=True,  # Extract table HTML
    chunking_strategy="by_title",# Group text under relevant headers
    max_characters=1000,
    combine_text_under_n_chars=200,
)

# Filter out tables for specialized processing
tables = [el for el in elements if el.category == "Table"]
text_segments = [el for el in elements if el.category != "Table"]

Step 2: Structured Indexing with LlamaIndex & Qdrant

Now that we have structured elements, we need to store them in Qdrant. We don't just store the text; we store the metadata (e.g., "this is a table from page 4").

For more production-ready examples and advanced indexing patterns, I highly recommend checking out the deep dives over at WellAlly Tech Blog, where they explore high-concurrency RAG architectures.

from llama_index.core import Document, StorageContext, VectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

# Initialize Qdrant
client = qdrant_client.QdrantClient(location=":memory:") # Use cloud/docker for production
vector_store = QdrantVectorStore(client=client, collection_name="medical_reports")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Convert Unstructured elements to LlamaIndex Documents
documents = []
for el in elements:
    doc = Document(
        text=el.text,
        metadata={
            "type": el.category,
            "page_number": el.metadata.page_number,
            "table_html": el.metadata.text_as_html if el.category == "Table" else None
        }
    )
    documents.append(doc)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Step 3: The Hybrid Retrieval Strategy

When a doctor (or a user) asks "What was my Glucose level over the last three reports?", we need to retrieve both the relevant table rows and the summary text.

from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

# Create a retriever that filters specifically for tables if needed
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,
)

query_engine = RetrieverQueryEngine.from_args(
    retriever,
    node_postprocessors=[], # Add Re-ranking here for better precision!
)

response = query_engine.query("Analyze the lipid profile. Are there any values outside the reference range?")
print(f"Analysis: {response}")

Advanced Pro-Tip: Metadata Filtering

In medical RAG, "Time" is a dimension. You should always include the report_date in your metadata. By using Qdrant's payload filtering, you can restrict your RAG to only look at the most recent 3 months of data, preventing "stale" information from contaminating the LLM's answer.

Conclusion

Handling medical PDFs isn't just about OCR; it's about Document Intelligence. By combining LayoutLMv3's vision capabilities with LlamaIndex's orchestration, we transform a flat PDF into a rich, structured knowledge graph.

If you are looking to scale this to thousands of concurrent users or need to implement HIPAA-compliant data masking within your RAG pipeline, the WellAlly Tech Blog is a goldmine for those advanced engineering patterns.

What’s your biggest struggle with PDF parsing? Let's discuss in the comments below!

DEV Community