Most RAG tutorials I found were either "pip install langchain and you're done" or 50-page academic papers. I wanted something in between — a pipeline I could actually explain in an interview, where I understood every line.
So I built one from scratch. No LangChain, no LlamaIndex, no frameworks. Just FastAPI, FAISS, sentence-transformers, and an LLM API.
Here's what I built, what worked, and what broke.
Uploading a PDF
Querying the document
The architecture
PDF --> extract text (pypdf) --> chunk (500 char, 50 overlap) --> embed (MiniLM-L6-v2)
|
v
question --> embed --> FAISS top-k search --> build prompt with chunks --> LLM --> answer + sources
Five Python files, ~300 lines total:
| File | Responsibility |
|---|---|
main.py |
FastAPI app, 3 endpoints, prompt engineering |
pdf_loader.py |
PDF text extraction via pypdf |
rag.py |
Chunking + embedding |
store.py |
FAISS vector store wrapper |
llm.py |
Swappable LLM client (Groq / OpenAI / Anthropic) |
How the upload works
When you POST a PDF to /upload, three things happen:
1. Text extraction — pypdf reads each page and returns the raw text. Pages with no extractable text (scanned images) are skipped.
2. Chunking — each page is split into ~500-character chunks with 50 characters of overlap. The overlap prevents losing context at chunk boundaries.
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
def chunk_pages(pages):
chunks = []
chunk_id = 0
for text, page_num in pages:
start = 0
while start < len(text):
end = min(start + CHUNK_SIZE, len(text))
chunk_text = text[start:end].strip()
if chunk_text:
chunks.append(Chunk(chunk_id=chunk_id, text=chunk_text, page=page_num))
chunk_id += 1
if end == len(text):
break
start = end - CHUNK_OVERLAP
return chunks
3. Embedding — each chunk is embedded into a 384-dimensional vector using all-MiniLM-L6-v2. This runs locally on CPU, no API call needed. Vectors are normalized so we can use inner product as cosine similarity.
def embed_texts(texts):
model = get_embed_model() # lazy-loaded singleton
vectors = model.encode(
texts,
normalize_embeddings=True,
show_progress_bar=False,
convert_to_numpy=True,
)
return vectors.astype("float32")
The vectors and chunk metadata go into a FAISS IndexFlatIP index — brute-force exact search, which is fine for up to ~100k vectors.
How the query works
When you POST a question to /query:
- The question is embedded using the same model
- FAISS finds the top-k most similar chunks by cosine similarity
- The chunks are formatted into a prompt with labels like
[Chunk 3 | Page 2] - The LLM generates an answer grounded in those chunks
- Both the answer and source chunks are returned
The system prompt is deliberately strict:
You are a careful assistant that answers questions strictly
from the provided document context.
Rules:
- Use ONLY the context below. Do not use outside knowledge.
- If the answer is not in the context, say:
"I couldn't find that in the document."
Swappable LLM providers
One thing I'm happy with — the LLM is swappable via a single environment variable:
LLM_PROVIDER=groq # or openai, or anthropic
All three providers share the same interface:
class LLMClient(ABC):
@abstractmethod
def generate(self, system: str, user: str) -> str: ...
You only need an API key for the provider you pick. I used Groq with Llama 3.3 70B for development because it's fast and free-tier friendly.
Testing it: what worked and what didn't
I created a fictional 5-page company document and threw 19 questions at the pipeline. Questions ranged from simple lookups to multi-hop reasoning to negative tests (questions the document can't answer).
What worked well:
- Direct lookups: "What is the list price of the Magpie-7?" — nailed it
- Table data: "What's included in the Standard tier?" — correct
- Negative tests: "What's Zentara's stock ticker?" — correctly said "not in the document"
- Multi-hop: "If I want 1-hour SLA support, what will it cost?" — combined info from the pricing table
What failed:
- "Who is the CEO?" — couldn't find it
- "How many employees does Zentara have?" — couldn't find it
Both answers were on page 1, in a dense "Company snapshot" table: CEO, CTO, HQ, employees, revenue — all packed together.
Why it failed (and what I learned)
The problem wasn't the LLM — it was the retriever. The Company snapshot table had 8+ different facts crammed into one chunk. The embedding for that chunk became a muddy average of all those topics, so it didn't rank highly for any specific question.
This is the classic weakness of pure semantic search. The word "CEO" appears exactly once in the document. A keyword search (BM25) would find it instantly. But vector search relies on semantic similarity, and a short query like "Who is the CEO?" doesn't produce a strong enough match against a chunk that's 80% about revenue, headquarters, and employee count.
The fix: hybrid retrieval — combine BM25 (keyword matching) with vector search. This is what production RAG systems do. It's on my to-do list.
Key design decisions (interview-ready)
If you're building this for interviews, these are the tradeoffs worth knowing:
| Decision | Why |
|---|---|
| Character-based chunking (not token-based) | Simpler, no tokenizer dependency. Production would use tiktoken. |
| Local embeddings (not OpenAI) | Free, offline, no API latency. Lower quality but fine for demos. |
| FAISS IndexFlatIP (not HNSW) | Exact search, no approximation. Fine up to ~100k vectors. |
| Normalized embeddings | Inner product = cosine similarity. One less thing to configure. |
| No streaming | v1 simplification. Streaming is where LLM SDKs diverge the most. |
| No conversation memory | Each query is independent. Adding memory is straightforward but adds complexity. |
What I'd add next
- Hybrid retrieval (BM25 + vector) — catches keyword matches that pure semantic search misses
- Reranker (cross-encoder) — re-scores the top-k results for better precision
- Evaluation set — automated accuracy measurement instead of manual testing
- Streaming — better UX for longer answers
- Conversation memory — follow-up questions
Try it yourself
The repo is here: github.com/santanu2908/chat-with-pdf-rag (v1)
uv sync
cp .env.example .env # set your API key
uv run uvicorn app.main:app --reload
Open http://localhost:8000/docs, upload the included sample PDF (data/sample_test_file.pdf), and start asking questions.
If you've built something similar or have suggestions (especially on hybrid retrieval), I'd love to hear about it in the comments.
I'm Santanu Mohanta — you can connect with me on LinkedIn or check out my other projects on GitHub.




Top comments (7)
"No LangChain, just FastAPI + FAISS" is a choice a lot of people are quietly making, and for good reason - for a straightforward RAG pipeline the framework often adds more abstraction (and debugging pain) than it saves, and rolling it yourself means you actually understand every step and can tune it. Frameworks earn their keep on complex multi-step orchestration; for "embed, store, retrieve, stuff context," raw is frequently cleaner. Knowing WHEN you've outgrown DIY is the real skill.
The payoff of building it raw is exactly what shows up when quality matters: you control chunking, the embedding choice, and can add a re-ranker - the levers that actually drive retrieval quality, which a framework can obscure. Owning those is owning the part that makes RAG good vs mediocre. That control-the-retrieval-quality discipline is what I lean on in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - scoped, well-retrieved context beats framework magic, for quality and cost. Clean build, FastAPI+FAISS is a solid no-framework base. At what point would you reach for a framework - or have you found raw scales fine even as the pipeline grows? Curious where your DIY ceiling is.
Thanks for the thoughtful comment! Completely agree — for a focused pipeline like this, raw gives you the understanding that frameworks abstract away. You can't debug what you don't understand.
My plan is to keep building raw for a few more iterations — hybrid retrieval (BM25 + vector), a reranker, evaluation harness, streaming — basically until I've implemented the core components that actually drive RAG quality. Once I'm comfortable with how each piece works under the hood, I'll move to a framework like LangChain or LlamaIndex for the orchestration layer. At that point, the framework becomes a productivity tool rather than a black box — I'll know exactly what it's doing for me and where to look when something breaks.
So to answer your question: my DIY ceiling is when I've touched enough of the moving parts to have strong intuition about what the framework is abstracting. Not there yet, but getting close. Moonshift sounds like a great example of that approach — scoped retrieval over framework magic.
Exactly, for a focused pipeline the no-framework route is the right call: FastAPI + FAISS gives you full control over chunking, the embedding step, and re-ranking, which IS retrieval quality. The framework only earns its weight once you need the generic abstractions, and most focused RAG never does. The one upgrade I'd put next on your list is an abstain path, when retrieval comes back thin, "I don't have support for this" beats letting the model paper over the gap with a fluent guess. That single check is what makes raw RAG trustworthy. Genuinely good build-from-scratch writeup.
Thanks! Agree on the abstain path — a retrieval-level confidence check before even hitting the LLM would be more robust. Something like a minimum similarity threshold on the top-k results. Good call, adding it to the list.
That's exactly the right ceiling: the framework stops being a black box the moment you've hand-built enough of the moving parts to know what it's hiding. Your ordering is also right, hybrid retrieval and a reranker move quality far more than orchestration sugar does, and the evaluation harness is the piece most people skip and then can't tell whether a change actually helped. Build that early, it turns "feels better" into a number you can defend. By the time you reach for LangChain/LlamaIndex you'll be using it for plumbing while keeping your own judgment on retrieval, which is the healthy split. I went the same way with Moonshift: own the parts that decide quality, let a framework handle the boilerplate. When you add the reranker, are you leaning cross-encoder, or LLM-as-reranker?
Cross-encoder first — lightweight, fast, no extra API calls. Something like
cross-encoder/ms-marco-MiniLM-L-6-v2keeps it local and consistent with the local-embeddings philosophy.LLM-as-reranker is interesting but adds latency and API cost for what's still a small-scale pipeline. If cross-encoder precision isn't enough, that's when I'd experiment with LLM reranking.
Thanks for the validation on the evaluation harness.
Great to see a detailed walk-through! How did you handle real-time query processing with FastAPI?