pandas
data analysis
python
unique values
groupby

Count unique values per groups with Pandas

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Counting unique values within groups is one of the most common pandas aggregation patterns. The core tool is groupby(...).nunique(), but the real details that matter are null handling, grouping level, and whether you are counting a single column or unique combinations across several columns.

Use groupby with nunique

For the basic case, group by one key and count distinct values in another column.

python
1import pandas as pd
2
3df = pd.DataFrame({
4    'city': ['Toronto', 'Toronto', 'Montreal', 'Montreal', 'Montreal'],
5    'fruit': ['apple', 'banana', 'apple', 'apple', 'pear'],
6})
7
8result = df.groupby('city')['fruit'].nunique()
9print(result)

This returns the number of distinct fruits for each city. It is the standard answer when you need grouped cardinality.

Be Explicit About Null Handling

nunique excludes missing values by default. Sometimes that is correct. Sometimes a missing value should count as its own category.

python
1df = pd.DataFrame({
2    'team': ['A', 'A', 'B', 'B'],
3    'user': ['u1', None, 'u2', None],
4})
5
6print(df.groupby('team')['user'].nunique(dropna=True))
7print(df.groupby('team')['user'].nunique(dropna=False))

That dropna choice should be treated as part of the business logic, not as a default you forget to notice.

Named Aggregation Helps in Real Reports

Grouped unique counts are often part of a broader summary. Named aggregation keeps the output readable.

python
1summary = (
2    df.groupby('team')
3      .agg(
4          unique_users=('user', 'nunique'),
5          total_rows=('user', 'size'),
6      )
7      .reset_index()
8)
9
10print(summary)

This creates stable column names that work better in downstream reporting and pipelines.

Count Unique Combinations Across Columns

Sometimes you do not want unique values from one column. You want unique pairs or tuples across several columns. In that case, remove duplicates on the relevant subset and then group.

python
1df = pd.DataFrame({
2    'region': ['east', 'east', 'east', 'west'],
3    'store': ['s1', 's1', 's2', 's1'],
4    'product': ['p1', 'p1', 'p2', 'p1'],
5})
6
7combo_counts = (
8    df.drop_duplicates(subset=['region', 'store', 'product'])
9      .groupby('region')
10      .size()
11)
12
13print(combo_counts)

That pattern avoids overcounting repeated identical rows.

Broadcast the Group Count Back to Each Row

Sometimes the goal is feature engineering rather than aggregation output. In that case, transform is useful because it keeps the original row shape.

python
1df = pd.DataFrame({
2    'city': ['Toronto', 'Toronto', 'Montreal'],
3    'fruit': ['apple', 'banana', 'apple'],
4})
5
6df['city_unique_fruit_count'] = df.groupby('city')['fruit'].transform('nunique')
7print(df)

This gives every row access to the unique-count statistic of its group.

Watch Performance on Large Data

Grouped uniqueness can be expensive on large tables. A few practical habits help:

  • select only needed columns before grouping
  • convert repetitive group keys to categorical when appropriate
  • use the simplest grouping level that answers the question

Performance tuning starts with reducing unnecessary work, not just with changing syntax.

Grouped distinct counts are also a useful data-quality signal. Sudden jumps in per-group uniqueness often reveal upstream deduplication failures, parsing changes, or identifier corruption before those issues show up in business-facing dashboards.

If the grouped result will later be merged into another table, reset the index deliberately so the grouping key becomes an ordinary column again. That small step often prevents confusing join behavior in downstream code.

Common Pitfalls

Using count when the requirement was distinct count is the classic mistake.

Forgetting the null policy can lead to reports that disagree even when the code looks almost identical.

Counting unique combinations by a single column instead of by a deduplicated subset gives the wrong business metric.

Summary

  • Use groupby(...).nunique() for grouped distinct counts.
  • Decide explicitly whether missing values should count as unique.
  • Use named aggregation for readable summaries.
  • Deduplicate subsets first when you need unique multi-column combinations.
  • Use transform('nunique') when each row needs the group-level count.

Course illustration
Course illustration

All Rights Reserved.