pandas
data manipulation
data analysis
python
binning

Binning a column with pandas

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Binning turns a continuous column into grouped intervals, which is useful for reporting, visualization, and feature engineering. In pandas, the two main tools are pd.cut for fixed ranges and pd.qcut for quantile-based groups. Choosing the right one depends on whether you care about business-defined boundaries or evenly sized buckets.

Use pd.cut for Fixed Interval Ranges

pd.cut is the direct choice when you already know the boundaries you want. For example, age groups, order value tiers, or temperature bands often come from domain rules rather than the data distribution itself.

python
1import pandas as pd
2
3df = pd.DataFrame({
4    "age": [19, 22, 29, 33, 37, 45, 54, 61]
5})
6
7bins = [0, 18, 30, 50, 100]
8labels = ["child", "young_adult", "adult", "senior"]
9
10df["age_group"] = pd.cut(df["age"], bins=bins, labels=labels)
11print(df)

This produces a categorical column where each row is assigned to one labeled interval. Fixed bins are easy to explain because the cut points remain stable across datasets.

Use pd.qcut for Quantile-Based Buckets

If you want groups with roughly the same number of rows, use pd.qcut. It computes edges from the actual distribution.

python
1import pandas as pd
2
3df = pd.DataFrame({
4    "score": [10, 12, 14, 15, 20, 25, 30, 60]
5})
6
7df["quartile"] = pd.qcut(df["score"], q=4, labels=["Q1", "Q2", "Q3", "Q4"])
8print(df)

This is often better for ranking or segmentation tasks because each bucket contains a similar share of observations. The tradeoff is that the boundaries are data-dependent, so they can shift between runs.

Control Edge Inclusion and Labels

Pandas lets you control whether the right edge is included and whether the lowest value belongs to the first interval.

python
1import pandas as pd
2
3values = pd.Series([0, 10, 20, 30])
4bucketed = pd.cut(
5    values,
6    bins=[0, 10, 20, 30],
7    include_lowest=True,
8    right=True
9)
10
11print(bucketed)

These options matter when values land exactly on a boundary. If your business rule says that 30 belongs in the upper tier, make that explicit rather than relying on the default behavior without checking it.

Labels also deserve care. Human-readable labels are useful for reporting, but interval objects are useful for debugging because they show the exact range that pandas created.

Missing Values and Out-of-Range Data

Rows can become missing after binning for two common reasons. First, the original value may already be missing. Second, the value may fall outside the supplied bin edges.

python
1import pandas as pd
2
3df = pd.DataFrame({"amount": [5, 15, 500, None]})
4df["band"] = pd.cut(df["amount"], bins=[0, 10, 20, 100])
5print(df)

Here, 500 and the missing input will produce missing output because neither fits into a valid interval. That is not a pandas bug. It means your bins do not cover the full range of the data.

When you need total coverage, define a wider final bin or clean the data before binning.

Binning for Analysis Versus Modeling

Binning is easy to overuse. For dashboard summaries, grouped intervals are often clearer than raw numbers. For statistical models, however, aggressive binning can throw away useful signal and create artificial boundaries.

That means binning should be an intentional modeling choice, not a reflex. If you are preparing data for machine learning, compare results with and without the discretized feature.

It is also worth remembering that pandas returns a categorical result. That is efficient and useful, but if you later export the column or feed it into another library, be clear about whether you need string labels, ordered categories, or encoded integers.

Common Pitfalls

  • Using pd.cut when the goal is equal-sized groups by row count. That is usually a pd.qcut problem.
  • Forgetting that exact boundary handling depends on parameters such as right and include_lowest.
  • Defining bins that do not cover the full data range and then being surprised by missing results.
  • Assuming the output is plain text rather than a categorical series.
  • Binning too early in a modeling pipeline and losing useful numeric detail.

Summary

  • Use pd.cut when you want fixed, domain-driven intervals.
  • Use pd.qcut when you want roughly equal-sized groups based on quantiles.
  • Check edge inclusion rules so boundary values land where you expect.
  • Expect missing output when values are missing or outside the bin range.
  • Treat binning as a deliberate transformation, not just a formatting step.

Course illustration
Course illustration

All Rights Reserved.