Have you ever found a perfect dataset for your object detection project, only to realize the ground truth is in the form of binary segmentation masks (black-and-white images) instead of YOLO bounding boxes (.txt files)?
If you are training a YOLO model (v5, v8, v10, v11, etc.), you need coordinates in the format:
<class_id> <x_center> <y_center> <width> <height> (normalized between 0 and 1).
Converting pixel-level masks into YOLO coordinates manually is a nightmare. Luckily, you can automate this using OpenCV and Python in just a few lines of code.
In this tutorial, we will walk through the exact steps and math required to build a robust conversion pipeline.
The Core Concept: Contours to Bounding Boxes
To convert a binary mask into a bounding box, we need to:
- Find the boundaries of the white pixels (foreground object).
- Compute the minimum enclosing rectangle around those boundaries.
- Normalize the pixel coordinates into YOLO's standard format.
Here is what the visual flow looks like:

(Derived from the ISIC skin lesion dataset: left is the mask region, right is the calculated bounding box).
Step 1: Extract the Contours using OpenCV
OpenCV provides a powerful function called cv2.findContours that detects boundaries of binary shapes.
import cv2
import numpy as np
# Load mask in grayscale
mask = cv2.imread('path_to_mask.png', cv2.IMREAD_GRAYSCALE)
# Find all external boundaries
contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
if not contours:
print("No objects found in the mask!")
else:
# Select the largest contour (assuming a single primary object)
largest_contour = max(contours, key=cv2.contourArea)
Step 2: Calculate Bounding Box Coordinates
Once we have the contour points, we can compute two types of bounding boxes:
Option A: Standard Axis-Aligned Bounding Box
This is the standard box used by most YOLO models.
# x, y are top-left coordinates; w, h are width and height in pixels
x_min, y_min, w_pixel, h_pixel = cv2.boundingRect(largest_contour)
# Calculate center coordinates in pixel space
x_center = x_min + (w_pixel / 2.0)
y_center = y_min + (h_pixel / 2.0)
Option B: Rotated Bounding Box (Minimum Area)
If your object is tilted or elongated and you want to use oriented/rotated bounding boxes:
# Returns ((x_center, y_center), (width, height), angle_of_rotation)
rect = cv2.minAreaRect(largest_contour)
box_points = cv2.boxPoints(rect) # Coordinates of the 4 corners
Step 3: Normalize Coordinates to YOLO Format
YOLO labels must be normalized relative to the overall image dimensions so that they scale correctly regardless of image resolution.
$$x_{norm} = \frac{x_{center}}{img_width}, \quad y_{norm} = \frac{y_{center}}{img_height}$$
$$w_{norm} = \frac{w_{pixel}}{img_width}, \quad h_{norm} = \frac{h_{pixel}}{img_height}$$
Here is the helper function to calculate and save this:
def normalize_to_yolo(x_center, y_center, w_pixel, h_pixel, img_w, img_h):
x_norm = x_center / img_w
y_norm = y_center / img_h
w_norm = w_pixel / img_w
h_norm = h_pixel / img_h
return x_norm, y_norm, w_norm, h_norm
# Normalize coordinates (assuming image is 640x640)
x_n, y_n, w_n, h_n = normalize_to_yolo(x_center, y_center, w_pixel, h_pixel, 640, 640)
# Print or write to a YOLO label file (e.g. class 0)
print(f"0 {x_n:.6f} {y_n:.6f} {w_n:.6f} {h_n:.6f}")
The Reverse: YOLO Bounding Boxes back to Masks
What if you want to reconstruct binary masks from your bounding boxes for validation or visualization? You can draw filled rectangles onto a black canvas of the original image dimensions.
# Initialize a black canvas (grayscale)
reconstructed_mask = np.zeros((img_height, img_width), dtype=np.uint8)
# Denormalize coordinates
x_min = int((x_n - w_n/2) * img_width)
y_min = int((y_n - h_n/2) * img_height)
x_max = int((x_n + w_n/2) * img_width)
y_max = int((y_n + h_n/2) * img_height)
# Draw a filled white rectangle on the canvas
cv2.rectangle(reconstructed_mask, (x_min, y_min), (x_max, y_max), 255, thickness=-1)
# Save mask
cv2.imwrite('reconstructed_mask.png', reconstructed_mask)
Streamlining the Workflow
While writing this manually is great for one-off tasks, managing it across hundreds of images, handling missing metadata, splitting datasets into train/test folders, and generating the data.yaml config file can take hours.
If you want a lightweight package that packages all of this (including batch conversions, CSV/JSON metadata parsing, and visualizations), I've open-sourced a helper package called segment-toolkit that does it in a few commands.
How to use it:
- Install it via pip:
pip install segment-toolkit
- Convert a directory of masks to YOLO labels:
segment-toolkit mask-to-yolo \
--image-dir datasets/images/ \
--mask-dir datasets/masks/ \
--output-dir datasets/labels/
- Split into standard train/test structures for YOLO training:
segment-toolkit split \
--images datasets/images/ \
--labels datasets/labels/ \
--output final_dataset/ \
--ratio 0.8
Whether you write your own script using the OpenCV math above or use the open-source toolkit, automating this step saves days of annotation work.
- Source Code: GitHub - mask-to-yolo-toolkit
- PyPI Package: segment-toolkit
How do you handle dataset conversions in your machine learning workflows? Let me know in the comments below!
Top comments (1)
it have 422 downloads in the first week. Thanks for the support — if you run into issues or want a feature, open an issue on GitHub.