Crop image to bounding box in Tensorflow Object Detection API
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Cropping an image to a detection result in TensorFlow usually means converting a model's bounding box into valid image coordinates. The main source of confusion is that TensorFlow Object Detection outputs are commonly normalized and ordered as ymin, xmin, ymax, xmax, which is easy to mix up with other box formats.
Convert Detection Boxes to Pixel Coordinates
If you want a crop from one image and one box, first convert the normalized coordinates into integer pixel edges. Then compute the crop height and width from those edges.
The important detail is that tf.image.crop_to_bounding_box does not take ending coordinates. It needs a top-left offset plus a height and width.
Clamp and Validate Before Cropping
Model output is not always perfectly clean. A box can be partly outside the image, reversed by a buggy post-processing step, or so small that integer rounding removes its size. Clamp the values before cropping.
That defensive step prevents many runtime errors and makes debugging much easier.
Use crop_and_resize for Batches
When you need many crops or fixed-size outputs, tf.image.crop_and_resize is usually the better tool. It accepts normalized boxes directly and produces tensors with a consistent shape.
This is especially useful when a detector feeds another model stage that expects a fixed input size.
Separate Visualization from Model Input
Not every crop has the same goal. For debugging and annotation tools, you may want the raw pixel crop without resizing so you can inspect the object exactly as it appeared. For training pipelines, consistent output size is often more important than preserving the original crop dimensions.
That distinction usually determines whether crop_to_bounding_box or crop_and_resize is the better fit.
Common Pitfalls
The most common bug is reading the box as xmin, ymin, xmax, ymax when TensorFlow detections are usually ymin, xmin, ymax, xmax. Swapping those fields produces incorrect crops that can still look almost valid, which makes the bug easy to miss.
Another common error is passing normalized values straight into crop_to_bounding_box. That function expects integer pixel offsets and sizes, not fractions.
People also forget that the API wants height and width, not bottom-right coordinates. If you pass ymax and xmax directly, the crop dimensions will be wrong.
Finally, do not skip bounds checking. Real predictions can include tiny negative offsets or coordinates slightly above 1.0, especially after custom transformations.
Summary
- TensorFlow Object Detection boxes are commonly normalized and ordered as
ymin, xmin, ymax, xmax. - Convert normalized coordinates to pixel coordinates before using
crop_to_bounding_box. - Compute crop height and width from the box edges explicitly.
- Use
crop_and_resizewhen you need batched crops or fixed output dimensions. - Most cropping errors come from coordinate-format mistakes, not from TensorFlow itself.

