How to create own dataset for using Mask-RCNN models from the Tensorflow Object Detection API?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Training Mask R-CNN with the TensorFlow Object Detection API requires more than images and class labels. Because Mask R-CNN performs instance segmentation, your dataset must include per-object masks in addition to class information and bounding boxes.
Know What the Model Expects
A usable custom dataset for Mask R-CNN needs:
- images
- class labels
- one instance mask per object
- bounding boxes that match the masks
- a label map that assigns integer ids to classes
If you only have image-level labels or plain object-detection boxes, that is not enough for a segmentation model.
Collect and Annotate the Data
Your images should represent the actual conditions in which the model will run:
- real lighting
- real backgrounds
- realistic object sizes
- enough variation in pose and occlusion
For each object instance, create a segmentation annotation. Many annotation tools can export polygons or masks. Internally, what matters is that you can produce a binary mask for each object instance.
Derive a Bounding Box from a Mask
Even though the final target is segmentation, the pipeline still needs boxes. A simple NumPy helper can compute the box from a binary mask:
That is the core geometry you need for each instance.
Create the Label Map
The Object Detection API uses a label map file to define class ids:
Keep the ids stable. Changing class ids midway through a project creates hard-to-debug training and evaluation errors.
Build TFRecord Examples
The training pipeline typically consumes TFRecord files. For each image, store:
- encoded image bytes
- image width and height
- normalized bounding boxes
- class ids
- instance masks
The exact writer code can be lengthy, but the important design point is that one training example may contain multiple objects, each with its own box and mask. Your conversion script should validate that every mask and bounding box aligns with the same image dimensions before writing the record.
Configure the Pipeline
Once the data exists, update the pipeline config for:
- label map path
- train TFRecord path
- eval TFRecord path
- number of classes
- fine-tune checkpoint
The model config must match the dataset. If the label map says three classes and the pipeline config says two, training will fail or behave incorrectly.
Split Train and Validation Sets Properly
Do not evaluate on the same images you used for training. Create a clean split:
- training set
- validation or evaluation set
Keep the class distribution reasonably balanced. If one class appears only a handful of times in evaluation, the metrics will not be very informative.
Common Pitfalls
- Preparing only bounding boxes and forgetting that Mask R-CNN also needs instance masks.
- Using masks whose dimensions do not exactly match the source image dimensions.
- Letting class ids drift between label map, conversion script, and pipeline config.
- Training on a tiny or overly clean dataset and expecting good real-world segmentation.
- Skipping validation of the generated TFRecord files before starting long training runs.
Summary
- Mask R-CNN datasets need per-instance masks, not just boxes and labels.
- Derive or validate bounding boxes so they match the masks and source image.
- Keep the label map and pipeline config consistent with the dataset.
- Write train and evaluation TFRecord files carefully and verify them early.
- Good segmentation performance depends as much on annotation quality and data diversity as on the model choice.

