Dataset Creation
Mask-RCNN
TensorFlow Object Detection
Computer Vision
Machine Learning

How to create own dataset for using Mask-RCNN models from the Tensorflow Object Detection API?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Training Mask R-CNN with the TensorFlow Object Detection API requires more than images and class labels. Because Mask R-CNN performs instance segmentation, your dataset must include per-object masks in addition to class information and bounding boxes.

Know What the Model Expects

A usable custom dataset for Mask R-CNN needs:

  • images
  • class labels
  • one instance mask per object
  • bounding boxes that match the masks
  • a label map that assigns integer ids to classes

If you only have image-level labels or plain object-detection boxes, that is not enough for a segmentation model.

Collect and Annotate the Data

Your images should represent the actual conditions in which the model will run:

  • real lighting
  • real backgrounds
  • realistic object sizes
  • enough variation in pose and occlusion

For each object instance, create a segmentation annotation. Many annotation tools can export polygons or masks. Internally, what matters is that you can produce a binary mask for each object instance.

Derive a Bounding Box from a Mask

Even though the final target is segmentation, the pipeline still needs boxes. A simple NumPy helper can compute the box from a binary mask:

python
1import numpy as np
2
3
4def bbox_from_mask(mask: np.ndarray):
5    rows, cols = np.where(mask > 0)
6    if len(rows) == 0 or len(cols) == 0:
7        raise ValueError("mask is empty")
8    ymin, ymax = rows.min(), rows.max()
9    xmin, xmax = cols.min(), cols.max()
10    return xmin, ymin, xmax, ymax
11
12
13mask = np.array([
14    [0, 0, 0, 0],
15    [0, 1, 1, 0],
16    [0, 1, 1, 0],
17    [0, 0, 0, 0],
18], dtype=np.uint8)
19
20print(bbox_from_mask(mask))

That is the core geometry you need for each instance.

Create the Label Map

The Object Detection API uses a label map file to define class ids:

text
1item {
2  id: 1
3  name: "cat"
4}
5
6item {
7  id: 2
8  name: "dog"
9}

Keep the ids stable. Changing class ids midway through a project creates hard-to-debug training and evaluation errors.

Build TFRecord Examples

The training pipeline typically consumes TFRecord files. For each image, store:

  • encoded image bytes
  • image width and height
  • normalized bounding boxes
  • class ids
  • instance masks

The exact writer code can be lengthy, but the important design point is that one training example may contain multiple objects, each with its own box and mask. Your conversion script should validate that every mask and bounding box aligns with the same image dimensions before writing the record.

Configure the Pipeline

Once the data exists, update the pipeline config for:

  • label map path
  • train TFRecord path
  • eval TFRecord path
  • number of classes
  • fine-tune checkpoint

The model config must match the dataset. If the label map says three classes and the pipeline config says two, training will fail or behave incorrectly.

Split Train and Validation Sets Properly

Do not evaluate on the same images you used for training. Create a clean split:

  • training set
  • validation or evaluation set

Keep the class distribution reasonably balanced. If one class appears only a handful of times in evaluation, the metrics will not be very informative.

Common Pitfalls

  • Preparing only bounding boxes and forgetting that Mask R-CNN also needs instance masks.
  • Using masks whose dimensions do not exactly match the source image dimensions.
  • Letting class ids drift between label map, conversion script, and pipeline config.
  • Training on a tiny or overly clean dataset and expecting good real-world segmentation.
  • Skipping validation of the generated TFRecord files before starting long training runs.

Summary

  • Mask R-CNN datasets need per-instance masks, not just boxes and labels.
  • Derive or validate bounding boxes so they match the masks and source image.
  • Keep the label map and pipeline config consistent with the dataset.
  • Write train and evaluation TFRecord files carefully and verify them early.
  • Good segmentation performance depends as much on annotation quality and data diversity as on the model choice.

Course illustration
Course illustration

All Rights Reserved.