The Challenge: Making 250-Page Documents Searchable by Meaning.
In Part 1, I showed you QuantTrade AI answering questions like:
What's NVIDIA's China revenue exposure? by pulling exact sections from their 10-K filing.
Here's what happens under the hood when you ask that question:
- Your query gets converted into a 384-dimensional vector (embedding)
- We search through thousands of document chunks stored in PostgreSQL
- Retrieve the top 5 most semantically similar sections
- Feed them to (Opus & GPT) LLM with context
- Get back a cited, evidence-backed answer
- Total time: ~800ms (500ms for chart, 300ms for RAG retrieval)
The secret? pgvector — a PostgreSQL extension that turns your familiar SQL database into a high-performance vector store.
Let me show you exactly how we built it.
Y'll might be wondering, hmm Why pgvector Over Pinecone, Weaviate, or Qdrant?
When I started this project, I evaluated several vector database options:
| Solution | Pros | Cons | Verdict |
|---|---|---|---|
| Pinecone | Managed, fast, great DX | $70/month minimum, vendor lock-in | Too expensive for MVP |
| Weaviate | Open source, feature-rich | Complex setup, separate infrastructure | Overkill for our scale |
| Qdrant | Rust-based, high performance | Another service to manage | Considered as fallback |
| pgvector | Lives in existing PostgreSQL, simple, free | Slightly slower at massive scale | Perfect fit |
For our use case (10,000-50,000 document chunks), pgvector hits the sweet spot between simplicity and performance.
Our architecture fits pgvector because it runs directly inside PostgreSQL, allowing us to store embeddings alongside OHLCV data, sessions, and chat history without managing an extra service. It provides ACID-level transaction safety, remains cost-efficient on platforms like Neon/Render/Supabase, delivers sub-100 ms search performance for our scale, and lets it work entirely with familiar SQL instead of learning a new query language.
Architecture: The RAG Pipeline
Here's the full flow from SEC filing → searchable vectors:
┌─────────────────────────────────────────────────────────────┐
│ 1. INGEST PHASE (Background Celery Task) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Fetch SEC Filing (10-K, 10-Q, 8-K) │
│ Source: SEC EDGAR API │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Chunk Document │
│ Strategy: 1000 tokens/chunk │
│ Overlap: 200 tokens │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Generate Embeddings │
│ Model: all-MiniLM-L6-v2 │
│ Output: 384-dimensional vectors │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Store in PostgreSQL with pgvector │
│ Table: document_chunks │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. QUERY PHASE (Real-time API Request) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ User asks: "NVDA China exposure?" │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Convert query to embedding │
│ Same model: all-MiniLM-L6-v2 │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Vector similarity search │
│ SELECT * │
│ ORDER BY embedding <=> query_embedding │
│ LIMIT 5 │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Feed chunks to LLM │
│ With conversation context │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Stream response to frontend │
└─────────────────────────────────────────────────────────────┘
Lets Build Vector Database
Step 1: Database Schema Setup
First, we enable pgvector and create our tables:
Click to see full Database Schema implementation code
CREATE EXTENSION IF NOT EXISTS vector; -- Enable pgvector
# Store document chunks with embeddings
CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
ticker VARCHAR(10) NOT NULL, -- e.g., 'NVDA'
document_type VARCHAR(20) NOT NULL, -- '10-K', '10-Q', '8-K'
filing_date DATE NOT NULL,
chunk_index INTEGER NOT NULL, -- Position in document
content TEXT NOT NULL, -- The actual text chunk
embedding vector(384), -- 384-dimensional embedding
metadata JSONB, -- Extra context (section name, page, etc.)
created_at TIMESTAMP DEFAULT NOW()
);
# Critical: Index for fast similarity search
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
# Additional indexes for filtering
CREATE INDEX idx_ticker ON document_chunks(ticker);
CREATE INDEX idx_document_type ON document_chunks(document_type);
- vector(384) : Matches our embedding model's output dimension
- ivfflat index : pgvector's approximate nearest neighbor search (99% accuracy, 10x faster)
- vector_cosine_ops : Uses cosine distance (best for normalized embeddings)
- lists : 100 — Index parameter (tuned for ~10K-50K vectors)
Step 2: Document Chunking Strategy
Semantic chunking with overlap
Click to see full code implementation
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_document(text: str, ticker: str, doc_type: str):
"""
Chunk SEC filing into semantically coherent segments.
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Target ~1000 tokens
chunk_overlap=200, # 200 token overlap for context
separators=["\n\n", "\n", ". ", " ", ""], # Respect paragraph boundaries
length_function=len
)
chunks = splitter.split_text(text)
# Add metadata for each chunk
chunk_objects = []
for idx, chunk in enumerate(chunks):
chunk_objects.append({
"ticker": ticker,
"document_type": doc_type,
"chunk_index": idx,
"content": chunk,
"metadata": {
"total_chunks": len(chunks),
"position": f"{idx+1}/{len(chunks)}"
}
})
return chunk_objects
Step 3: Generating Embeddings
We use sentence-transformers with the all-MiniLM-L6-v2 model:
Click to see Generating embedding implementation
from sentence_transformers import SentenceTransformer
from typing import List
import numpy as np
class EmbeddingService:
def __init__(self):
# Lazy initialization to avoid cold start overhead
self._model = None
@property
def model(self):
"""Load model only when needed."""
if self._model is None:
self._model = SentenceTransformer('all-MiniLM-L6-v2')
return self._model
def generate_embeddings(self, texts: List[str]) -> np.ndarray:
"""
Generate 384-dimensional embeddings for text chunks.
Args:
texts: List of text chunks to embed
Returns:
numpy array of shape (len(texts), 384)
"""
# Batch processing for efficiency
embeddings = self.model.encode(
texts,
batch_size=32, # Process 32 chunks at once
show_progress_bar=False,
convert_to_numpy=True,
normalize_embeddings=True # L2 normalization for cosine similarity
)
return embeddings
Performance optimizations:
- Lazy loading : Model loads on first use (saves ~500MB RAM on cold start)
- Batch processing : 32 texts at once (5x faster than one-by-one)
- Normalization : Pre-normalized for cosine similarity
Step 4: Storing Vectors in PostgreSQL
Here's how we insert chunks with embeddings:

Click to see Storing vectors implementation
from sqlalchemy import create_engine, text
from typing import List, Dict
import json
class VectorStore:
def __init__(self, db_url: str):
self.engine = create_engine(db_url)
self.embedding_service = EmbeddingService()
def add_documents(self, chunks: List[Dict]):
"""
Add document chunks with embeddings to PostgreSQL.
"""
# Extract text for embedding generation
texts = [chunk["content"] for chunk in chunks]
# Generate embeddings in batch
embeddings = self.embedding_service.generate_embeddings(texts)
# Insert into database
with self.engine.connect() as conn:
for chunk, embedding in zip(chunks, embeddings):
conn.execute(
text("""
INSERT INTO document_chunks
(ticker, document_type, filing_date, chunk_index,
content, embedding, metadata)
VALUES
(:ticker, :doc_type, :filing_date, :chunk_index,
:content, :embedding, :metadata)
"""),
{
"ticker": chunk["ticker"],
"doc_type": chunk["document_type"],
"filing_date": chunk.get("filing_date"),
"chunk_index": chunk["chunk_index"],
"content": chunk["content"],
"embedding": embedding.tolist(), # Convert numpy to list
"metadata": json.dumps(chunk["metadata"])
}
)
conn.commit()
Step 5: Vector Similarity Search
Now for the magic — retrieving relevant chunks:
Click to see Similarity Search implementation
def similarity_search(
self,
query: str,
ticker: str = None,
k: int = 5
) -> List[Dict]:
"""
Find top-k most similar document chunks to query.
Args:
query: User's question
ticker: Optional stock filter (e.g., 'NVDA')
k: Number of results to return
Returns:
List of matching chunks with similarity scores
"""
# Convert query to embedding
query_embedding = self.embedding_service.generate_embeddings([query])[0]
# Build SQL query
sql = """
SELECT
id,
ticker,
document_type,
filing_date,
content,
metadata,
1 - (embedding <=> :query_embedding) as similarity
FROM document_chunks
WHERE 1=1
"""
params = {"query_embedding": query_embedding.tolist(), "k": k}
# Optional ticker filter
if ticker:
sql += " AND ticker = :ticker"
params["ticker"] = ticker
sql += """
ORDER BY embedding <=> :query_embedding
LIMIT :k
"""
with self.engine.connect() as conn:
results = conn.execute(text(sql), params).fetchall()
return [
{
"id": row[0],
"ticker": row[1],
"document_type": row[2],
"filing_date": row[3],
"content": row[4],
"metadata": json.loads(row[5]),
"similarity": float(row[6])
}
for row in results
]
Output:
Score: 0.847 | China represented 14% of total revenue in Q1
Score: 0.721 | Export controls may impact H200 chip sales
Score: 0.634 | NVIDIA's Data Center revenue was $10.3B in Q2
Performance Metrics: Does It Actually Work?
| Metric | Before Optimization | After Optimization | Improvement |
|---|---|---|---|
| Query latency | 3.2 s | 287 ms | 11× faster |
| Embedding generation | 450 ms (per chunk) | 89 ms (batch of 32) | 5× faster |
| Index build time | 15 min (10K docs) | 15 min | Same (one-time) |
| Memory usage | 2.1 GB | 850 MB | 2.5× reduction |
| Accuracy (Top-5 retrieval) | 94% | 94% | Same |
What's Next?
In Part 3, I'll show you how we make AI predictions explainable using SHAP:
"From Black Box to Glass Box: Building Explainable Risk Scores"
- Why transparency matters when real money is involved
- How SHAP breaks down risk into individual factors
- Making ML predictions auditable with mathematical proofs
YashJoshi2109
/
QuantTrade-AI
QuantCopilot is an AI-powered trading and research terminal that blends real-time, TradingView-style charts with an intelligent copilot chat. It ingests market data, news, and filings, then uses machine learning, RAG, and large language models to explain price moves, summarize documents, and highlight risks.
Quants AI Trading
A comprehensive AI-powered trading and research platform with TradingView-style UI, featuring real-time market data, AI copilot, RAG-powered insights, and backtesting capabilities.
Live app: https://quanttrade.us/
Table of Contents
Core Features
- Symbol Analysis with AI — Real-time charts, indicators, and AI-powered explanations
- Earnings & Filings Deep Dive — RAG-powered analysis of SEC filings and earnings calls
- Personal Watchlist — Portfolio-aware AI insights and risk monitoring
- Strategy Backtesting — Rule-based strategy testing with AI explanations
- Risk Analysis — Comprehensive risk scoring and factor analysis
Architecture
┌─────────────┐
│ Frontend │ Next.js with TradingView-style charts
│ (Web App) │
└──────┬──────┘
│
┌──────▼──────────────────┐
│ API Gateway/Backend │ FastAPI REST API
└──────┬──────────────────┘
│
┌───┴──────────────────────────────┐
│ │
┌──▼──────────┐ ┌─────────▼────────┐
│ Data │ │ RAG/Copilot │
│ Services │ │ Service │
│ │ │ │
│ - Market │ │ - Embeddings │
│ Data…







Top comments (0)