DEV Community

Andrew Judd
Andrew Judd

Posted on • Originally published at judd.dev

Less Than a Penny Per Document

People hear "vision model" and assume expensive.

Fair. I assumed the same thing.

The Bill

Under a penny per document.

GPT-4o charges about $2.50 per million input tokens right now. A document photo is maybe 1,000-2,000 tokens for the image plus a few hundred for the prompt and response. That's $0.003 to $0.008.

Less than one cent.

What Nobody Compares

Textract is cheap per page too. About $1.50 per thousand pages. Per-unit, it's actually cheaper than the vision API.

But per-unit API cost is a terrible comparison.

Here's what the Textract approach actually cost:

My entire Saturday. Pipeline, pre-processing, regex parsers, manual review queue. At any reasonable hourly rate, that's thousands of dollars. Before a single document is correctly processed.

70% manual review. Half the time faster to just retype the thing than hunt for all the errors.

And the vision API approach? Two hours on Sunday morning. Write the integration, test a few documents, tweak the prompt. Done. 5-10% flagged for review, and those are quick fixes. A digit, an abbreviation. Not a full retype.

Numbers

500 documents:

Textract Vision API
API cost ~$0.75 ~$2.50
Dev time ~40 hrs @ $100/hr = $4,000 ~2 hrs @ $100/hr = $200
Manual review ~350 docs @ 5 min = 29 hrs ~35 docs @ 2 min = 1.2 hrs
Maintenance (3 months) ~20 hrs ~0 hrs
Total ~$6,000+ ~$320

Higher per API call. Lower in every other way.

When To Use Which

I'm not going to pretend the vision API is always right. Traditional OCR still makes sense when you've got millions of identical documents from the same template, same layout, same fields in the same spots. Template matching works great there. No need to pay for a model that understands context when there's nothing to understand.

Same thing if you can't make external API calls. Air-gapped networks, edge devices, strict data residency. Tesseract locally and that's that.

And compliance. Your OCR provider might already have the certs you need. Your vision API provider might not.

But handwritten documents? Mixed layouts? Documents where you need structure and not just characters? Anything where time-to-value matters? Vision API every time.

The Quick Test

Look at one of your documents.

Could you hand it to a random person and they'd get it in a few seconds?

If yes - vision model. Less than a penny.

If a template could extract the data - traditional OCR is cheaper at volume.

If a human would struggle with it too - neither approach saves you. That's a data quality problem, not a tool problem.


If you're sitting on a stack of documents that need digitizing and you've either already been down the OCR road or you've been putting it off because you know how it goes - this is worth looking at.

Less than a penny per document. That's what I'm actually paying.

Top comments (2)

Collapse
 
harjjotsinghh profile image
Harjot Singh

Pricing in cost-per-unit (per document) instead of vague monthly totals is the discipline that separates an AI feature with healthy margins from one that quietly loses money at scale. Once you know it's sub-penny per doc, you can actually reason about it: what's the markup, where's the break-even, can this survive 100x volume. Most teams never compute their per-unit AI cost and get blindsided when usage scales the bill nonlinearly. The per-document number is the one that tells you if you have a business or a liability.

Getting to sub-penny is almost always the compounding stack, not one trick - right-sized model for the extraction (a doc task rarely needs frontier reasoning), caching repeated structure, and batching. That per-unit-cost obsession is exactly what I design around in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - knowing and bounding the per-build cost (~$3 flat) is what makes flat pricing safe to offer. Great writeup, the unit-economics framing is the rare valuable one. What got you under a penny - was it mostly model choice for the extraction, or caching/batching the repeated work? Curious which lever did the heavy lifting.

Collapse
 
awjudd profile image
Andrew Judd

Great question. It was the compounding stack, like you described.
Model choice was the biggest single lever. GPT-4o mini handles extraction well. You don't need frontier reasoning to pull structured fields out of a recipe card. That alone dropped the per-document cost dramatically compared to running everything through GPT-4o or Claude Sonnet.
Prompt engineering did the heavy lifting after that. Getting the prompt tight enough that the model returns clean structured output on the first pass. No retries, no correction loops, no back-and-forth. Every retry is another API call at the same token cost. A prompt that works reliably the first time cuts your effective cost in half or more compared to one that needs a second pass 50% of the time.
Caching the extraction results closed the gap. Once you've pulled the structured data from a document, you store it. If that document comes through again, you skip the API call entirely. You're not re-prompting for something you already have.
So: model choice got me from dollars to cents, prompt engineering and caching got me from cents to sub-penny.