Create MS COCO style dataset

MS COCO

dataset creation

computer vision

data annotation

machine learning

Create MS COCO style dataset

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Creating an MS COCO-style dataset means more than putting images into folders. You need image metadata, category definitions, and object annotations stored in the JSON shape that COCO-based tooling expects. Once the schema is correct, the real work becomes annotation quality and ID consistency.

The Core COCO Sections

A minimal COCO annotation file usually contains these top-level sections:

'images'
'annotations'
'categories'

Each section has a distinct role.

images stores file-level metadata such as image ID, filename, width, and height. categories defines class IDs and names. annotations connects an object instance to an image and category, often with bounding boxes, area, and iscrowd.

A Minimal Example

A small valid COCO-style file looks like this:

json

1{
2  "images": [
3    {"id": 1, "file_name": "img1.jpg", "width": 640, "height": 480}
4  ],
5  "annotations": [
6    {
7      "id": 1,
8      "image_id": 1,
9      "category_id": 1,
10      "bbox": [100, 120, 80, 60],
11      "area": 4800,
12      "iscrowd": 0
13    }
14  ],
15  "categories": [
16    {"id": 1, "name": "person", "supercategory": "human"}
17  ]
18}

The bbox format is [x, y, width, height]. Many mistakes happen because people accidentally use [x1, y1, x2, y2] instead.

Build the Dataset in a Stable Workflow

A practical workflow is:

collect and name image files consistently
define your category list early and keep IDs stable
annotate objects with boxes, masks, or keypoints as needed
export or generate COCO JSON files for train, validation, and test splits

The most important operational rule is to freeze IDs early. Changing category IDs or reusing image IDs later can create subtle training bugs that are hard to detect.

Programmatically Generate a Small COCO File

If you are converting annotations from another source, generating COCO JSON in Python is straightforward.

python

1import json
2
3coco = {
4    "images": [
5        {"id": 1, "file_name": "img1.jpg", "width": 640, "height": 480}
6    ],
7    "annotations": [
8        {
9            "id": 1,
10            "image_id": 1,
11            "category_id": 1,
12            "bbox": [100, 120, 80, 60],
13            "area": 80 * 60,
14            "iscrowd": 0,
15        }
16    ],
17    "categories": [
18        {"id": 1, "name": "person", "supercategory": "human"}
19    ],
20}
21
22with open("annotations.json", "w", encoding="utf-8") as f:
23    json.dump(coco, f, indent=2)

This is enough to create a starter annotation file for many object-detection pipelines.

Plan for Splits and Additional Annotation Types

Most real datasets need separate train, validation, and test files. Decide that split policy early. If near-duplicate images land in both train and validation, your evaluation becomes misleading.

You should also decide which COCO features you actually need:

bounding boxes only
segmentation masks
keypoints
captions or other extra metadata

Not every project needs the full COCO feature set. The right dataset is the smallest correct one for the model and task.

Annotation Quality Matters More Than Schema Compliance

A JSON file can be perfectly valid and still produce a bad model. Common quality problems include:

inconsistent box tightness
overlapping class definitions
missing objects in crowded scenes
different labeling rules across annotators

Format conversion is easy compared with enforcing annotation consistency. If your team uses multiple annotators, write explicit labeling rules before the dataset grows.

Validate Before Training

Before training any model, run sanity checks such as:

every annotation.image_id points to a real image
every annotation.category_id points to a real category
box widths and heights are positive
image dimensions match the actual files on disk
IDs are unique where required

A small validator script can save hours of debugging later when a training pipeline fails on malformed annotations.

Common Pitfalls

A common mistake is storing boxes as corner coordinates instead of COCO width-height boxes. Another is changing category IDs midway through annotation work.

It is also easy to focus entirely on conversion and ignore annotation quality. A valid COCO file with inconsistent labels is still a weak dataset.

Finally, do not delay split planning until the end. Leakage between training and validation sets can make model quality look better than it really is.

Summary

A COCO-style dataset is defined by structured JSON plus the image files it references.
The essential sections are images, annotations, and categories.
COCO bounding boxes use [x, y, width, height].
Stable IDs and clear labeling rules are critical.
Good dataset quality depends as much on annotation discipline as on schema correctness.