Jacob Nastaskin

Posted on Feb 18

Building FridgeChef: What I Learned Training a Custom Computer Vision Model with Roboflow

#webdev #computervision #machinelearning #roboflow

I spend too much time staring at my fridge trying to figure out what to make for dinner. So I built an app to solve it: take a photo of your fridge, detect what's inside, and get recipe suggestions. Along the way, I got hands-on experience with Roboflow's full computer vision pipeline, from zero-shot detection to training and iterating on a custom model.

This post covers what I built, what worked, what didn't, and what I learned about the gap between out-of-the-box models and custom-trained ones.

The App: FridgeChef

FridgeChef is a full-stack web application that:

Accepts a photo of a fridge
Runs computer vision to detect food items
Lets the user review and correct detections
Generates recipe suggestions based on the detected ingredients

The frontend is built with Next.js, React, TypeScript, and Tailwind CSS. The backend uses Python and FastAPI, chosen for ease of integration with Roboflow's inference and supervision packages.

FridgeChef is open source — check out the GitHub repo.

Starting with Zero-Shot Detection: YOLO-World

My first approach used YOLO-World, a zero-shot object detection model. Zero-shot means it can detect any object described by a text prompt without being trained specifically on that object. There's no data collection, no annotation, no training. You just give it a list of classes and it goes to work.

Roboflow's Inference and Supervision libraries make using YOLO-World straightforward:

import cv2
import supervision as sv
from inference.models.yolo_world.yolo_world import YOLOWorld

image = cv2.imread("image.jpeg")
model = YOLOWorld(model_id="yolo_world/l")
classes = ["person", "backpack", "dog", "eye", "nose", "ear", "tongue"]
results = model.infer("image.jpeg", text=classes, confidence=0.03)

detections = sv.Detections.from_inference(results[0])

bounding_box_annotator = sv.BoundingBoxAnnotator()
label_annotator = sv.LabelAnnotator()
labels = [classes[class_id] for class_id in detections.class_id]

annotated_image = bounding_box_annotator.annotate(scene=image, detections=detections)
annotated_image = label_annotator.annotate(scene=annotated_image, detections=detections, labels=labels)
sv.plot_image(annotated_image)

For FridgeChef, I created 110 food-related classes: "apple", "avocado", "butter", "yogurt", "ham", and so on. This list could easily be expanded and refined, for example, "apple" could be split into "fuji apple" and "granny smith apple", but I kept things simple for the initial proof of concept.

YOLO-World got me to a working demo fast. But when I tested it on actual fridge photos, it struggled. It confused colors for identities; anything orange-colored got labeled "orange" regardless of what it actually was. It over-detected, returning 30+ detections for a fridge with ~20 items. Confidence scores never exceeded 70%.

This is actually the exact problem Roboflow's platform exists to solve: generic models don't work well enough for specific use cases, so you train a custom model using their pipeline.

Training a Custom Model with Roboflow

Data Collection

Instead of collecting and labeling thousands of images from scratch, I started by forking the Fridge Detection dataset from Roboflow Universe, their open-source dataset community. This gave me 3,148 pre-annotated images with 30 food classes and bounding boxes already drawn and labeled.

This is one of the things that makes the Roboflow ecosystem powerful: you can bootstrap a custom model using community datasets and then supplement with your own data.

Training V1

The forked dataset came with a skewed train/valid/test split of 95/3/2. I noted this as something to fix in a future version but decided to proceed and establish a baseline first.

I trained the model using RF-DETR (Roboflow's own state-of-the-art object detection model) at the Nano size for faster iteration. Training ran on Roboflow's hosted infrastructure, no local GPU needed.

V1 metrics (on the validation set):

Metric	V1
mAP@50	98.0%
Precision	97.3%
Recall	98.3%

These numbers look great on paper, but they're inflated. With 95% of images in training and only 103 images in validation, the model was essentially being tested on a tiny, easy subset. More on this shortly.

Integrating the trained model into FridgeChef was straightforward using the Roboflow Inference SDK:

from inference_sdk import InferenceHTTPClient

self.client = InferenceHTTPClient(
    api_url="https://detect.roboflow.com",
    api_key=os.getenv("ROBOFLOW_API_KEY", ""),
)
result = self.client.infer(image, model_id="fridge-detection-eelvd/1")

Training V2: Adding Data and Fixing the Split

For V2, I took photos of my own fridge and a few others, manually annotated them in Roboflow's annotation tool, and added them to the dataset. I also rebalanced the split to a proper 70/20/10.

V2 metrics:

Metric	V1	V2
mAP@50	98.0%	92.6%
Precision	97.3%	92.9%
Recall	98.3%	92.3%

V2's metrics are lower but they're more honest. By moving ~800 images from training into validation and test, the model trained on less data and was evaluated against a larger, more representative set. V1's 98% mAP was like getting an A+ on a test where you've already seen 95% of the questions. V2's 93% is a more legitimate exam.

It's worth noting that some images V1 trained on ended up in V2's validation set after rebalancing. Since V2 was trained from V1's checkpoint, there's a minor data leakage concern, but transfer learning retains general feature representations, not memorized recall of specific images, so the practical effect on metrics is negligible.

Wiring It All Up

The app lets users select which model to use and compare results side by side. After detection, users can review and correct the ingredient list before requesting recipe suggestions.

The recipe generation step calls an LLM (GPT-4o-mini) with the detected ingredients and returns structured suggestions including prep time, difficulty, and step-by-step instructions.

Results: Testing on Unseen Images

I tested all three models on two new images not contained in any training, validation, or test set. The evaluation metrics were straightforward: what percentage of detections were correctly labeled, and what percentage of actual items in the image were detected at all.

Model	Annotation Accuracy	Detection Accuracy
YOLO-World	59.3%	60.4%
Custom V1	38.6%	40.5%
Custom V2	39.3%	22.7%

Analysis

YOLO-World was the strongest overall performer, finding more items and labeling them more accurately. This is expected, it was trained on a dataset orders of magnitude larger than the custom models, with a much broader vocabulary of classes. Its main weaknesses were precision-related: it over-detected and confused visually similar items, particularly anything with similar colors or shapes. YOLO-World also ran noticeably faster than the custom models, which rely on Roboflow's hosted inference API. For a production application, this latency difference would be an important consideration.

Custom V1 detected more items than V2 due to training on a larger portion of the dataset (~2,994 images vs ~2,200). However, its annotation accuracy was comparable to V2, suggesting the additional training data improved recall without significantly improving the model's ability to correctly identify what it found. It also had several detections exceeding 70%, something neither YOLO-World nor V2 achieved.

Custom V2 was the most conservative, producing far fewer detections. With fewer training images, it's more cautious about what it flags.

A common theme: none of the models generated detections with confidence above 90% and only Custom V1 had detections above 70%. This underscores that fridge food detection is an inherently difficult problem. Items are cluttered, partially occluded, and appear at irregular angles.

These results are based on a small sample and aren't statistically rigorous. A more thorough evaluation would require a larger, standardized test set. But they illustrate the key tradeoffs between zero-shot breadth and custom model specificity.

Lessons Learned

Food detection in fridges is a genuinely hard problem. Items obstruct each other, packages vary wildly, and even humans struggle to identify everything without moving things around. A video-based approach that scans the entire fridge would likely outperform single-image detection.

Dataset splits matter more than you think. The difference between V1's impressive-looking 98% mAP and V2's more honest 93% came entirely from how the data was split. If I were starting over, I'd set a proper split from day one and keep it consistent across versions for apples-to-apples comparison.

The real value of a platform like Roboflow is the iteration speed. Going from "I have some images" to "I have a deployed model I can call via API" took an afternoon. The Universe toolchain for bootstrapping data, Annotate for labeling, hosted training, and one-click deployment removes the infrastructure burden and lets you focus on the actual problem: is my data good enough?

Classification nuance matters. Should the model detect each individual apple or a "group of apples"? Should it distinguish fuji from granny smith? What about a bowl of grapes — should these be considered one item or separate entities? These aren't technical questions, they're product questions that depend on the downstream use case. For recipe suggestions, you probably want "apples (3)" rather than three separate bounding boxes.

More data is the primary lever. The gap between YOLO-World and the custom models isn't a model architecture problem, it's a data problem. YOLO-World was trained on millions of images; the custom models had ~3,000. A useful next step would be scraping grocery catalog images, which would provide clean, well-lit product photos with metadata like brand and category. Even with that, you'd still lack the variety of angles and lighting conditions seen at inference time, but it would be a significant improvement.

What I'd Build Next

Expand the dataset significantly — targeted data collection for the items and scenarios where the model currently fails
Video upload support — let users scan the full fridge rather than capturing a single frame
Manual annotation in-app — let users correct bounding boxes directly, feeding corrections back into the training pipeline (similar to what Roboflow's annotation tool already provides)
Grocery catalog data pipeline — scrape product images with metadata to build a richer, more structured dataset
Smart inventory approach — for the specific problem of "what's in my fridge," barcode scanning as items enter and leave would ultimately be more reliable than computer vision alone

Tech Stack

Frontend: Next.js, React, TypeScript, Tailwind CSS
Backend: Python, FastAPI
Computer Vision: Roboflow Inference, Supervision, RF-DETR, YOLO-World
Recipe Generation: OpenAI GPT-4o-mini
Model Training: Roboflow Train (hosted)
Dataset: Roboflow Universe (Fridge Detection, forked and extended)

FridgeChef is open source — check out the GitHub repo. Built with Roboflow.

DEV Community