DEV Community

Cover image for Building FridgeChef: What I Learned Training a Custom Computer Vision Model with Roboflow
Jacob Nastaskin
Jacob Nastaskin

Posted on

Building FridgeChef: What I Learned Training a Custom Computer Vision Model with Roboflow

I spend too much time staring at my fridge trying to figure out what to make for dinner. So I built an app to solve it: take a photo of your fridge, detect what's inside, and get recipe suggestions. Along the way, I got hands-on experience with Roboflow's full computer vision pipeline, from zero-shot detection to training and iterating on a custom model.

This post covers what I built, what worked, what didn't, and what I learned about the gap between out-of-the-box models and custom-trained ones.


The App: FridgeChef

FridgeChef is a full-stack web application that:

  1. Accepts a photo of a fridge
  2. Runs computer vision to detect food items
  3. Lets the user review and correct detections
  4. Generates recipe suggestions based on the detected ingredients

The frontend is built with Next.js, React, TypeScript, and Tailwind CSS. The backend uses Python and FastAPI, chosen for ease of integration with Roboflow's inference and supervision packages.

FridgeChef is open source — check out the GitHub repo.

FridgeChef Demo


Starting with Zero-Shot Detection: YOLO-World

My first approach used YOLO-World, a zero-shot object detection model. Zero-shot means it can detect any object described by a text prompt without being trained specifically on that object. There's no data collection, no annotation, no training. You just give it a list of classes and it goes to work.

Roboflow's Inference and Supervision libraries make using YOLO-World straightforward:

import cv2
import supervision as sv
from inference.models.yolo_world.yolo_world import YOLOWorld

image = cv2.imread("image.jpeg")
model = YOLOWorld(model_id="yolo_world/l")
classes = ["person", "backpack", "dog", "eye", "nose", "ear", "tongue"]
results = model.infer("image.jpeg", text=classes, confidence=0.03)

detections = sv.Detections.from_inference(results[0])

bounding_box_annotator = sv.BoundingBoxAnnotator()
label_annotator = sv.LabelAnnotator()
labels = [classes[class_id] for class_id in detections.class_id]

annotated_image = bounding_box_annotator.annotate(scene=image, detections=detections)
annotated_image = label_annotator.annotate(scene=annotated_image, detections=detections, labels=labels)
sv.plot_image(annotated_image)
Enter fullscreen mode Exit fullscreen mode

For FridgeChef, I created 110 food-related classes: "apple", "avocado", "butter", "yogurt", "ham", and so on. This list could easily be expanded and refined, for example, "apple" could be split into "fuji apple" and "granny smith apple", but I kept things simple for the initial proof of concept.

YOLO-World got me to a working demo fast. But when I tested it on actual fridge photos, it struggled. It confused colors for identities; anything orange-colored got labeled "orange" regardless of what it actually was. It over-detected, returning 30+ detections for a fridge with ~20 items. Confidence scores never exceeded 70%.

This is actually the exact problem Roboflow's platform exists to solve: generic models don't work well enough for specific use cases, so you train a custom model using their pipeline.


Training a Custom Model with Roboflow

Data Collection

Instead of collecting and labeling thousands of images from scratch, I started by forking the Fridge Detection dataset from Roboflow Universe, their open-source dataset community. This gave me 3,148 pre-annotated images with 30 food classes and bounding boxes already drawn and labeled.

This is one of the things that makes the Roboflow ecosystem powerful: you can bootstrap a custom model using community datasets and then supplement with your own data.

Training V1

The forked dataset came with a skewed train/valid/test split of 95/3/2. I noted this as something to fix in a future version but decided to proceed and establish a baseline first.

I trained the model using RF-DETR (Roboflow's own state-of-the-art object detection model) at the Nano size for faster iteration. Training ran on Roboflow's hosted infrastructure, no local GPU needed.

V1 metrics (on the validation set):

Metric V1
mAP@50 98.0%
Precision 97.3%
Recall 98.3%

These numbers look great on paper, but they're inflated. With 95% of images in training and only 103 images in validation, the model was essentially being tested on a tiny, easy subset. More on this shortly.

Integrating the trained model into FridgeChef was straightforward using the Roboflow Inference SDK:

from inference_sdk import InferenceHTTPClient

self.client = InferenceHTTPClient(
    api_url="https://detect.roboflow.com",
    api_key=os.getenv("ROBOFLOW_API_KEY", ""),
)
result = self.client.infer(image, model_id="fridge-detection-eelvd/1")
Enter fullscreen mode Exit fullscreen mode

Training V2: Adding Data and Fixing the Split

For V2, I took photos of my own fridge and a few others, manually annotated them in Roboflow's annotation tool, and added them to the dataset. I also rebalanced the split to a proper 70/20/10.

V2 metrics:

Metric V1 V2
mAP@50 98.0% 92.6%
Precision 97.3% 92.9%
Recall 98.3% 92.3%

V2's metrics are lower but they're more honest. By moving ~800 images from training into validation and test, the model trained on less data and was evaluated against a larger, more representative set. V1's 98% mAP was like getting an A+ on a test where you've already seen 95% of the questions. V2's 93% is a more legitimate exam.

It's worth noting that some images V1 trained on ended up in V2's validation set after rebalancing. Since V2 was trained from V1's checkpoint, there's a minor data leakage concern, but transfer learning retains general feature representations, not memorized recall of specific images, so the practical effect on metrics is negligible.


Wiring It All Up

The app lets users select which model to use and compare results side by side. After detection, users can review and correct the ingredient list before requesting recipe suggestions.

Comparison Feature

The recipe generation step calls an LLM (GPT-4o-mini) with the detected ingredients and returns structured suggestions including prep time, difficulty, and step-by-step instructions.

Recipe List

Recipe Details


Results: Testing on Unseen Images

I tested all three models on two new images not contained in any training, validation, or test set. The evaluation metrics were straightforward: what percentage of detections were correctly labeled, and what percentage of actual items in the image were detected at all.

Model Annotation Accuracy Detection Accuracy
YOLO-World 59.3% 60.4%
Custom V1 38.6% 40.5%
Custom V2 39.3% 22.7%

Analysis

YOLO-World was the strongest overall performer, finding more items and labeling them more accurately. This is expected, it was trained on a dataset orders of magnitude larger than the custom models, with a much broader vocabulary of classes. Its main weaknesses were precision-related: it over-detected and confused visually similar items, particularly anything with similar colors or shapes. YOLO-World also ran noticeably faster than the custom models, which rely on Roboflow's hosted inference API. For a production application, this latency difference would be an important consideration.

Custom V1 detected more items than V2 due to training on a larger portion of the dataset (~2,994 images vs ~2,200). However, its annotation accuracy was comparable to V2, suggesting the additional training data improved recall without significantly improving the model's ability to correctly identify what it found. It also had several detections exceeding 70%, something neither YOLO-World nor V2 achieved.

Custom V2 was the most conservative, producing far fewer detections. With fewer training images, it's more cautious about what it flags.

A common theme: none of the models generated detections with confidence above 90% and only Custom V1 had detections above 70%. This underscores that fridge food detection is an inherently difficult problem. Items are cluttered, partially occluded, and appear at irregular angles.

These results are based on a small sample and aren't statistically rigorous. A more thorough evaluation would require a larger, standardized test set. But they illustrate the key tradeoffs between zero-shot breadth and custom model specificity.


Lessons Learned

Food detection in fridges is a genuinely hard problem. Items obstruct each other, packages vary wildly, and even humans struggle to identify everything without moving things around. A video-based approach that scans the entire fridge would likely outperform single-image detection.

Dataset splits matter more than you think. The difference between V1's impressive-looking 98% mAP and V2's more honest 93% came entirely from how the data was split. If I were starting over, I'd set a proper split from day one and keep it consistent across versions for apples-to-apples comparison.

The real value of a platform like Roboflow is the iteration speed. Going from "I have some images" to "I have a deployed model I can call via API" took an afternoon. The Universe toolchain for bootstrapping data, Annotate for labeling, hosted training, and one-click deployment removes the infrastructure burden and lets you focus on the actual problem: is my data good enough?

Classification nuance matters. Should the model detect each individual apple or a "group of apples"? Should it distinguish fuji from granny smith? What about a bowl of grapes — should these be considered one item or separate entities? These aren't technical questions, they're product questions that depend on the downstream use case. For recipe suggestions, you probably want "apples (3)" rather than three separate bounding boxes.

More data is the primary lever. The gap between YOLO-World and the custom models isn't a model architecture problem, it's a data problem. YOLO-World was trained on millions of images; the custom models had ~3,000. A useful next step would be scraping grocery catalog images, which would provide clean, well-lit product photos with metadata like brand and category. Even with that, you'd still lack the variety of angles and lighting conditions seen at inference time, but it would be a significant improvement.


What I'd Build Next

  • Expand the dataset significantly — targeted data collection for the items and scenarios where the model currently fails
  • Video upload support — let users scan the full fridge rather than capturing a single frame
  • Manual annotation in-app — let users correct bounding boxes directly, feeding corrections back into the training pipeline (similar to what Roboflow's annotation tool already provides)
  • Grocery catalog data pipeline — scrape product images with metadata to build a richer, more structured dataset
  • Smart inventory approach — for the specific problem of "what's in my fridge," barcode scanning as items enter and leave would ultimately be more reliable than computer vision alone

Tech Stack

  • Frontend: Next.js, React, TypeScript, Tailwind CSS
  • Backend: Python, FastAPI
  • Computer Vision: Roboflow Inference, Supervision, RF-DETR, YOLO-World
  • Recipe Generation: OpenAI GPT-4o-mini
  • Model Training: Roboflow Train (hosted)
  • Dataset: Roboflow Universe (Fridge Detection, forked and extended)

FridgeChef is open source — check out the GitHub repo. Built with Roboflow.

Top comments (0)