This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
PantryLens — a free, installable Progressive Web App that turns a photo of your fridge or pantry into a complete, ready-to-cook recipe in seconds.
The idea is simple: food waste is a real problem, and one of its biggest causes is not knowing what to cook with what you already have. PantryLens removes that friction entirely. You open the app, snap up to 3 photos of your ingredients, tap Generate Recipe, and watch a full recipe stream to your screen — live, token by token — before you even put your phone down.
Key features:
- 📷 Camera capture, file upload, or drag-and-drop (up to 3 images)
- ⚡ Recipe streams live to the screen as Gemma 4 generates it
- 📱 Installable PWA — works on iOS and Android home screens, no App Store needed
- 🔒 No account, no login, no data stored — your photos never leave the request cycle
- 🆓 Completely free to use
Demo
Live app: https://pantry-lens-one.vercel.app
Install on your mobile home screen: Open the web app in the browser and use the built-in "Add to Home Screen" feature
Code
klee1611
/
PantryLens
Turn your fridge photos into recipes — PWA powered by Google Gemma 4 via OpenRouter. Built for the Google Gemma 4 Hackathon on DEV.
PantryLens
Snap a photo of your fridge or pantry — get a recipe in seconds.
PantryLens is a progressive web app (PWA) that uses AI vision (Google Gemma 4) to identify ingredients from photos and stream a complete recipe directly to the screen, token by token.
How it works
- Capture — take a photo with your camera, upload from your gallery, or drag and drop (up to 3 images)
- Compress — the browser Canvas API resizes each image client-side to ≤1024 px before upload
- Analyze — a Next.js Edge Function proxies the images to OpenRouter's vision model
- Stream — the recipe streams back token-by-token, rendered progressively as Markdown
Demo
demo.mp4
Pin PWA to mobile home screen
Tech stack
| Layer | Technology |
|---|---|
| Framework | Next.js 16 (App Router, Edge Runtime) |
| UI | React 19, Tailwind CSS, react-markdown |
| AI | OpenRouter — google/gemma-4-26b-a4b-it
|
| Rate limiting | Upstash Redis (sliding window, 5 req/IP/hour) |
| Testing | Jest 29, Testing Library, |
The project is fully open source under the MIT license. The stack:
| Layer | Technology |
|---|---|
| Framework | Next.js 16 (App Router, Edge Runtime) |
| UI | React 19, Tailwind CSS, react-markdown |
| AI model | Gemma 4 via OpenRouter |
| Rate limiting | Upstash Redis (sliding window, 5 req/IP/hour) |
| Deploy | Vercel |
How I Used Gemma 4
The model: google/gemma-4-26b-a4b-it
I used Gemma 4 26B A4B — the 26B total parameter Mixture-of-Experts model with 4B active parameters per forward pass, instruction-tuned. I accessed it through OpenRouter using the model ID google/gemma-4-26b-a4b-it to meet the following requirements:
- Vision capability — PantryLens is fundamentally a vision task. The model needs to look at a photo of a messy fridge shelf and accurately identify eggs, half-used condiments, wilting vegetables, and leftover containers. Gemma 4's multimodal architecture handles this reliably, even for cluttered or poorly lit photos.
- Instruction following — generating a well-structured recipe means strictly following a Markdown template with headings, bullet lists, and numbered steps. The instruction-tuned variant follows these formatting constraints consistently, which is critical for the progressive rendering to look correct as it streams.
- Speed vs. quality balance — the MoE architecture means the model activates only 4B parameters per token while drawing on the full 26B parameter knowledge base. In practice, this produces recipe quality close to the denser models at latency that works well for streaming UX.
Compare different models of Gemma 4 for this use case:
Why reject the Edge 2B (E2B) model? While E2B is phenomenal for air-gapped, localized execution, its compressed parameter count lacks the deep "world knowledge" required for cooking.
Why reject the Full Precision 31B model? Uncompressed FP16/BF16 models demand massive VRAM clusters. In a serverless architecture, this translates to an unacceptable Time-to-First-Token (TTFT), virtually guaranteeing HTTP 504 gateway timeouts before the proxy can return data.
The Optimal Route (26B-A4B): Utilizing the 4-bit Activation-Aware Quantized (A4B) variant delivered the exact sweet spot. It retains near-frontier reasoning capabilities for complex visual extraction, but the 4-bit memory footprint drastically accelerates upstream inference, enabling ultra-low latency token generation.
The architecture: an opaque streaming proxy
The most interesting engineering decision was how the model is integrated. The frontend never touches the API key or sees the system prompt. Here's the full flow:
The Next.js API route runs on the Vercel Function using Edge Runtime for quick cold starts and low geographical latency. The 300-second timeout limit is acceptable since the response is unlikely to keep streaming for that long.
Client-side image compression was another non-obvious requirement. Vercel Function has a payload limit of 4.5 MB. Raw iPhone photos are 4–12 MB each — three of them would blow the limit before a single byte of AI processing begins. The solution: compress each image on-device using the browser's Canvas API (resize to ≤1024px, JPEG at 75% quality) before the upload. A 10 MB photo becomes ~150 KB. This happens invisibly in under a second on any modern device, including older iPhones.
Server-Sent Events (SSE)
Waiting for a dense 26-billion parameter model to synchronously generate a complex, structured Markdown response will take anywhere from 10 to 25 seconds. In a standard Vercel serverless environment, this guarantees an HTTP 504 Gateway Timeout before the response ever completes.
To bypass this, we must establish a zero-buffer data pipe. I decided to go with SSE (Server-Sent Events):
The Handshake: The PWA makes a standard HTTPS POST request to the Vercel Edge proxy.
The Proxy Stream: The Edge proxy securely forwards the payload to OpenRouter (stream: true).
Token-by-Token Egress: As the Gemma 4 model yields individual tokens (often representing single characters or words), the Edge function immediately pipes those chunks back to the client over the still-open HTTP connection via the text/event-stream protocol.
This provides an immediate Time-to-First-Token (TTFT). The Vercel function never buffers the full response in memory, completely bypassing the gateway timeout constraint. On the frontend, the PWA intercepts these SSE chunks via the native ReadableStream API, feeding them directly into react-markdown for a live, progressive render.
The system prompt
Getting Gemma 4 to output clean, consistently structured Markdown for streaming required careful prompt engineering. The key lessons:
- Explicit blank-line rules: without being told explicitly that every ### heading must have a blank line before and after it, the model occasionally merged headings and body text onto the same line, which broke the Markdown renderer.
- Numbered list enforcement: instructions would sometimes collapse into a single paragraph unless the prompt explicitly stated "each step must be on its own line — never merge steps into a paragraph."
- Pantry staples assumption: by default the model sometimes refused to generate a recipe if it couldn't see salt or oil in the photo. Telling it to assume common pantry staples are always on hand (even if not visible) fixed this.
- Single-response constraint: without explicit instruction to never ask follow-up questions, the model occasionally responded with clarifying questions ("Are these all the ingredients?") instead of generating the recipe immediately.
The final prompt includes a section of explicit formatting rules before the template:
FORMATTING RULES — follow exactly:
- The recipe title uses ## (two hashes). It must be alone on its own line.
- Each section header uses ### (three hashes). It must be alone on its own line.
- Every heading must have one blank line before it AND one blank line after it.
- The instructions section is a numbered list. Each step is on its own line. Never merge steps into a paragraph.
Application Design Wrapping Up
The architecture here — Vercel Function as opaque AI proxy, client-side compression, streaming passthrough — is a reusable pattern for any vision AI application that needs to:
- Keep API credentials out of the frontend
- Handle large image payloads within serverless limits
- Deliver a live "typing" UX without server timeouts via SSE
PantryLens is the simplest possible version of this pattern, which makes the code easy to read and adapt.

Top comments (0)