Your invoicing system needs to ingest scanned purchase orders. Your accounting platform handles contracts with cross-page tables. The text inside these PDFs has to come out as structured data, not just a wall of text, or your downstream code has nothing to act on.
In April 2026, LlamaIndex published their ParseBench benchmark showing vision LLMs with specific prompts outperform traditional OCR on layout-heavy documents. The buzz suggests we should all switch to Gemini 3 Flash or GPT-4o with HTML colspan/rowspan prompts. So I ran the comparison live on a messy 2-page purchase order. The results were not what the headlines suggest.
Want to test it on your own documents? Try the OCR Wizard API with a scanned PDF.
Quick comparison
Same 2-page purchase order, 7 line items, repeated shipping-address sub-headers, item 030 split across the page break. Mat.No identifiers (like ALRD00882) are the codes that matter: get one wrong and you ship the wrong product.
| Approach | Latency | Cost | Codes accurate | Layout |
|---|---|---|---|---|
| OCR API alone | 1.14s | ~$0.001 | 7 of 7 | lost |
| GPT-4o-mini + prompts | 22s | $0.0087 | 1 of 7 | preserved |
| GPT-4o full + prompts | 20s | $0.0228 | 1 of 7 | preserved |
| Hybrid (OCR + GPT-4o-mini) | 23s | $0.002 | 7 of 7 | preserved |
What ParseBench got right
The benchmark tested 14 parsing methods and found prompt design matters more than model size. LlamaParse Agentic scored 84.9, Gemini 3 Flash 71, beating dedicated parsers like AWS Textract (47.9), Google DocAI (50.4), and Azure Document Intelligence (59.6).
The trick: ask the model to emit HTML tables with colspan and rowspan attributes. Here is the approach as runnable code:
import base64
from openai import OpenAI
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
SYSTEM_PROMPT = """You are a document parser. Convert PDFs into clean Markdown.
- Convert tables to HTML using <table>, <tr>, <th>, <td>.
- Use colspan and rowspan to preserve merged cells and hierarchical headers.
- Maintain reading order. Output only the parsed content."""
def encode(path):
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "text", "text": "Parse this document. Merge tables split across pages."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encode('page1.png')}"}},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encode('page2.png')}"}},
]},
],
)
print(resp.choices[0].message.content)
On my test, both GPT-4o-mini and GPT-4o full produced a correctly structured table. The layout claim holds up.
What ParseBench did not stress-test
Per-character fidelity on identifiers. Both vision LLM runs invented Mat.No codes that look plausible but do not match the source:
| Source | GPT-4o-mini | GPT-4o full |
|---|---|---|
| ALRD00882 | ALU000892 | ALUM0088 |
| ALRD00913 | ALU000913 | ALUM00913 |
| ALSQ00716 | ALU050716 | (dropped) |
| ALPL00534 | ALPL005034 | ALPL05034 |
GPT-4o-mini also rewrote 12.700 (a tolerance in mm) as 12,700, three orders of magnitude off. It misread 3658 mm as 356 mm. GPT-4o full fixed those numeric mistakes but still hallucinated the identifiers.
This is not a flaw in the prompts. It is what happens when a language model generates text from pixels: alphanumeric codes have no linguistic regularity, so the model substitutes characters from codes it has seen in similar layouts. Bigger models hallucinate less, but they still hallucinate.
See the full item-by-item output comparison in the complete guide.
The hybrid pipeline
Pure OCR reads every character literally with no language prior, which is why it preserved all 7 codes. But it emits text in a broken reading order on messy layouts. Hybrid splits the work: OCR for fidelity, LLM for layout reconstruction.
Step 1, OCR extracts exact text:
import requests
def ocr_pdf(pdf_path):
with open(pdf_path, "rb") as f:
r = requests.post(
"https://ocr-wizard.p.rapidapi.com/ocr-pdf",
headers={"x-rapidapi-key": "YOUR_KEY", "x-rapidapi-host": "ocr-wizard.p.rapidapi.com"},
files={"pdf_file": f},
data={"first_page": 1, "last_page": 10},
)
pages = r.json()["body"]["pages"]
return "\n\n".join(p["fullText"] for p in pages)
Step 2, the LLM reconstructs structure under a prompt that forbids changing values:
from openai import OpenAI
client = OpenAI(api_key="YOUR_OPENAI_KEY")
SYSTEM_PROMPT = """You receive raw OCR text. The OCR is accurate at the character
level but the reading order is broken. Reconstruct the document as clean HTML.
CRITICAL: Every code, number, identifier, email, and date in your output MUST
appear verbatim in the input. Do NOT invent, modify, or correct any value.
Convert tables to HTML with <table>, <tr>, <th>, <td>, colspan and rowspan.
Merge tables split across pages."""
def reconstruct(ocr_text):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"OCR TEXT:\n{ocr_text}\n\nOutput ONLY the HTML."},
],
)
return resp.choices[0].message.content
# Full pipeline
text = ocr_pdf("purchase_order.pdf")
html = reconstruct(text)
On the same purchase order, this preserved all 7 Mat.No codes, fixed the page-break fragmentation, separated the shipping-address blocks, and produced one well-formed HTML table.
Why hybrid costs less than direct vision LLM
Vision LLM input is dominated by image tokens. Two pages plus prompts run about 51,000 tokens. The hybrid sends only the OCR text, about 1,300 tokens. Input cost drops by a factor of 39. At 10,000 documents per month: $20 hybrid, $87 GPT-4o-mini direct, $228 GPT-4o full.
When to use what
- Searchable text only (RAG, archive): OCR alone.
- Structured tables, values must be exact (invoices, contracts): hybrid.
- Charts, graphs, signatures, hand-drawn marks: vision LLM direct, since OCR cannot see what is not text.
- Sub-second latency at high volume: OCR alone.
Sources
- LlamaIndex ParseBench
- Umair Ali Khan, "How to Accurately Extract Everything from Documents Using AI"
Read the full guide with the annotated test document and complete pipeline code on ai-engine.net.
Top comments (2)
"Hybrid OCR + LLM beats pure vision-LLM" is a result more people need to internalize - throwing the whole PDF at GPT-4o vision feels elegant but it's expensive, slow, and hallucinates cell boundaries on dense tables. Deterministic OCR/layout extraction for structure + LLM only for the fuzzy reasoning (header inference, merged cells, normalization) is both cheaper and more accurate. The vision model is a scalpel, not a bulldozer.
This is a perfect instance of the broader principle: don't use the biggest model for the part a cheaper/deterministic tool does better - route by what each layer is actually good at. Same instinct behind Moonshift (a multi-agent pipeline shipping a prompt to a real SaaS) - deterministic where you can, LLM only where you must, which is why a full build stays ~$3 flat. Great empirical post - what's your accuracy delta vs pure GPT-4o vision, and is the OCR layer Tesseract or something newer? The hybrid being both cheaper AND more accurate is the detail that should kill the "just use vision" default.
For your 2 questions:
Accuracy delta: the gap shows up most on alphanumeric identifiers. On the 7 Mat.No codes in the test purchase order, pure GPT-4o-mini got 1 of 7 right and GPT-4o full also got 1 of 7. The hybrid got 7 of 7. Across the broader set of 14 critical values (codes, tolerances, dimensions, standard refs) the hybrid preserved all 14, while vision mangled several in ways that pass a glance: 12.700 became 12,700 (3 orders of magnitude off), 3658 mm became 356 mm, EN 755-2 became EN 755.2. The failure mode is consistent: language models have no prior for random strings, so they substitute plausible neighbors. OCR reads them literally, with no "understanding" to get in the way, which is exactly why it wins there.
OCR layer: not Tesseract. It's a managed cloud OCR API (the OCR Wizard endpoint in the post). Tesseract was actually the baseline it beats on noisy/faxed scans, where Tesseract's accuracy falls off a cliff.
The cost angle reinforces your point too: hybrid runs ~$0.002/doc vs ~$0.009 (mini) and ~$0.023 (full), because we send ~1.3k tokens of OCR text to the LLM instead of ~51k tokens of page images. Cheaper and more accurate, which is the part that should kill the "just use vision" default.