DEV Community

My $5/month RAG System Just Got Eyes: Adding Multimodal Search Without Breaking the Bank

Daniel Nwaneri on January 25, 2026

Last month, I showed you how my V2 beat $200 enterprise RAG systems with hybrid search and reranking. The response was incredible but one comment s...

Read full post

Ali-Funk • Jan 25 • Edited

Great timing. I just spent my evening building an image analyzer with AWS Rekognition and Lambda. It is interesting to see how you tackled the 'giving code eyes' problem with Cloudflare and Llama instead.
The pivot from CLIP to text descriptions for RAG is a smart move for accuracy.

What I love most: You added images and you kept costs very low for what it can do!

The section about "CLIP failing" is the most valuable part here for me.
We usually only see the polished wins, not the dead ends.

I just battled some S3 event triggers and encoding bugs myself tonight. Debugging these integration edges is where the real learning happens. I am not used to coding myself since I come from a background of system integration not programming. It was very hard to get it running the way it was supposed to.

Great work @dannwaneri

Daniel Nwaneri • Jan 25

thanks ali. really appreciate you following from chrome tabs to this.

the rekognition + lambda approach is solid - aws has the accuracy edge on pure OCR for sure. curious though: are you planning to make those analyzed images searchable later? that's where i hit the cost wall (rekognition analysis + bedrock embeddings + opensearch was pushing $150/month for what i needed).

the clip → text description pivot was frustrating (wasted a weekend on it) but yeah, for RAG use cases descriptions work better than visual embeddings anyway

how's rekognition handling complex layouts? receipts, forms, diagrams with mixed text/graphics? that's where llama 4's multimodal understanding surprised me . it gets context, not just character extraction

keep building 🔨

Ali-Funk • Jan 25

That $150/month metric is a huge red flag for me. Thanks for the heads-up. I am strictly optimizing for Free Tier right now, so "OpenSearch" is out of the question.

If I make them searchable later, I would probably start small by just dumping the JSON labels into DynamoDB for basic filtering before looking at vector databases.

Regarding complex layouts: I haven't stress-tested that yet. Today was purely detect_labels (identifying objects like 'Laptop', 'Chair') just to prove the concept and see if I could get the pipeline running.

For receipts and forms, I suspect you are right: standard Rekognition/Textract might give me the text, but Llama likely wins on understanding the "semantic glue" between the fields without custom logic.

That was beyond the scope of what I tried out today, but definitely worth considering for a future project.

I will keep the cost wall in mind moving forward. Saving money on projects is critical for me.

Daniel Nwaneri • Jan 25

the free tier optimization constraint is real. i hit the same wall which is why i went all-in on cloudflare.

the dynamodb filtering approach makes sense for basic queries but yeah, the moment you need "find images with similar layouts" or "dashboards showing performance metrics" you're back to needing embeddings.

the rekognition → textract path gives you accurate OCR but you're right about the semantic glue. llama 4 understanding "this is a receipt header vs line item vs total" without custom parsing logic is the unlock

if you ever want to test multimodal search without leaving free tier constraints, the stack i documented is basically: upload image → llama 4 scout (free on workers ai) → bge embeddings (free) → vectorize (free tier: 10M vectors). only costs when you scale past free limits

the $150/month metric came from real estimates when i priced out rekognition + bedrock + opensearch for a client project. aws pricing forced me to find alternatives

keep optimizing for free tier. that's how you learn what's actually essential vs nice-to-have 🔨

Ali-Funk • Jan 25

That specific stack breakdown (Cloudflare + Vectorize with 10M vectors) is a goldmine.

I am currently deep in the AWS ecosystem for my certification journey, but ignoring that kind of Free Tier value would be foolish. Thanks for validating the 'semantic glue' theory regarding Llama vs. Textract.

I will definitely bookmark your article for when I hit the limits of my current JSON/DynamoDB approach. Real-world client estimates like your $150 example are the best reality check.
Consultants would take real money and a lot of it for your kind of intel! Thank you VERY much Daniel

Daniel Nwaneri • Jan 25

appreciate that ali.

yeah the free tier values on workers ai are wild .cloudflare is basically subsidizing the learning curve right now. 10M vectors in vectorize before you pay anything is unreal compared to pinecone/qdrant pricing.

the aws cert journey makes total sense.
when you do hit the dynamodb filtering limits (and you will . everyone does around 5-10k images), the migration path is clean. your rekognition labels → llama 4 descriptions is mostly just swapping the vision model, embeddings flow the same way.

good luck with the cert! aws knowledge transfers well, you're just learning the cheaper edge compute alternative alongside it 🔨

Ali-Funk • Jan 25

Good to know about that 5k-10k threshold. It is always better to know where the ceiling is before you hit your head on it.

I appreciate the insights today. It is rare to get this level of practical architectural advice in a comment section.

I just sent you a connection request on LinkedIn.
Would be great to keep up with your work there.

Daniel Nwaneri • Jan 25

just accepted. appreciate the connect.

these architectural ceiling conversations are exactly why i write these articles. better to share the 5-10k threshold publicly than have everyone rediscover it through painful experience.

looking forward to seeing what you build. hit me up on linkedin if you run into cloudflare workers questions during your aws cert journey 🚀

myroslav mokhammad abdeljawwad • Feb 9

Honestly this feels like the kind of upgrade people actually end up using, not just demoing. Treating images as normal documents instead of spinning up a separate “vision system” is a really smart call.

The OCR plus description combo makes a lot of sense in practice. When I’m searching screenshots, I’m almost never thinking “find similar pixels”. I’m thinking “where was that error message” or “which screenshot had that button text or transaction number”. This matches how people really look for stuff.

I also like that you’re upfront about the tradeoffs. Sure, pure OCR might squeeze out a bit more accuracy on perfect scans, but once you throw in diagrams, messy screenshots, or UI shots, the semantic layer is doing most of the work anyway.

The pricing angle matters more than people admit. Plenty of teams could build something like this, but once it turns into vision API plus embeddings plus vector DB plus storage, it quietly dies in planning. Keeping it simple and cheap is probably why this actually ships.

One thing I’m wondering about as this grows is access control. Screenshots and receipts can get sensitive fast, and once search starts working well, people lean on it harder than they expect.

myroslav mokhammad abdeljawwad • Feb 9

This is a really practical upgrade. I like that you kept images and text in one index instead of building a second system just for “vision”.

The description plus OCR approach makes a lot of sense for real teams. Most of the time we’re not trying to find pixel-similar images, we’re trying to find meaning and visible text, like error messages, UI labels, or a specific transaction ref.

Also respect the tradeoff notes. Dedicated OCR can be a bit more accurate on clean scans, but the semantic understanding is what actually makes search useful, esp for diagrams and messy screenshots.

Keeping the cost low matters too. A lot of orgs can build this, but they don’t ship it once the pricing turns into 3 different vendors and a billing surprise.

Curious how you’re thinking about permissions and sensitive screenshot data as the corpus grows, because once search works people start relying on it fast.

If you want, paste your exact comment here and I’ll “mutate” it slightly so it keeps the same meaning but wont trigger the duplicate check.

myroslav mokhammad abdeljawwad • Feb 9

This is a really solid evolution of the idea. What I like most is that you didnt bolt “vision” on as a separate system, you treated images as first class knowledge that belongs in the same index as text. That’s the part most multimodal RAG posts skip.

The description-over-CLIP takeaway feels very real too. For most real workflows people dont actually want pixel similarity, they want meaning plus text. Being able to search “TypeError undefined map” and have a screenshot come back because OCR caught it is way more useful than “this image looks similar”.

Also appreciate the honesty around tradeoffs. Calling out that dedicated OCR might be slightly better on clean text but worse overall for search is the kind of detail that tells me this was actually tested, not just assembled from docs.

The cost angle matters a lot as well. Most teams I’ve seen never ship multimodal search not because it’s hard, but because once you add up vision, embeddings, storage, reranking, it quietly turns into a finance discussion. Keeping everything inside Workers and a single index is a big win.

One thing I’m curious about as this grows is how you see metadata and access control evolving. Screenshots and receipts get sensitive fast, and once search gets good people rely on it more than they expect.

Overall though this feels very practical. Not “look what AI can do”, but “here’s how you stop losing information your team already has”. Really nice work.

Vinicius Fagundes • Jan 27

Huge on the the 5k-10k threshold. It is always better to know where the ceiling is before try to optimize without checking.
Also.. Liked your dashboard
congrats.