Select rows in pandas MultiIndex DataFrame

python

pandas

dataframe

multiindex

data-analysis

Select rows in pandas MultiIndex DataFrame

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A MultiIndex (hierarchical index) DataFrame in pandas has two or more index levels, allowing you to represent higher-dimensional data in a 2D table. Selecting rows from a MultiIndex DataFrame uses different syntax than a single-index DataFrame — you can select by level, cross-section, or boolean conditions using .loc, .xs(), IndexSlice, and query().

Creating a MultiIndex DataFrame

python

1import pandas as pd
2import numpy as np
3
4# Method 1: From tuples
5index = pd.MultiIndex.from_tuples([
6    ('US', 'NYC'), ('US', 'LA'), ('US', 'Chicago'),
7    ('UK', 'London'), ('UK', 'Manchester'),
8    ('DE', 'Berlin'), ('DE', 'Munich')
9], names=['country', 'city'])
10
11df = pd.DataFrame({
12    'population': [8_336_817, 3_979_576, 2_693_976, 8_982_000, 553_230, 3_748_148, 1_472_000],
13    'area_km2': [783, 1_302, 606, 1_572, 115, 892, 310]
14}, index=index)
15
16print(df)
17#                    population  area_km2
18# country city
19# US      NYC         8336817       783
20#         LA          3979576      1302
21#         Chicago     2693976       606
22# UK      London      8982000      1572
23#         Manchester   553230       115
24# DE      Berlin      3748148       892
25#         Munich      1472000       310

Select by First Level (Outer Index)

python

1# All rows for 'US'
2df.loc['US']
3#          population  area_km2
4# city
5# NYC       8336817       783
6# LA        3979576      1302
7# Chicago   2693976       606
8
9# Multiple countries
10df.loc[['US', 'UK']]

Select by Both Levels

python

1# Specific row: country='US', city='NYC'
2df.loc[('US', 'NYC')]
3# population    8336817
4# area_km2          783
5
6# Multiple specific rows
7df.loc[[('US', 'NYC'), ('UK', 'London')]]

Select by Second Level with xs()

xs() (cross-section) selects rows by a value at a specific level:

python

1# All rows where city='London' (regardless of country)
2df.xs('London', level='city')
3#          population  area_km2
4# country
5# UK        8982000      1572
6
7# All rows where country='DE'
8df.xs('DE', level='country')
9#         population  area_km2
10# city
11# Berlin   3748148       892
12# Munich   1472000       310

Slicing with IndexSlice

pd.IndexSlice enables intuitive slicing on MultiIndex DataFrames:

python

1idx = pd.IndexSlice
2
3# All cities in 'US'
4df.loc[idx['US', :], :]
5# Same as df.loc['US']
6
7# All countries, but only cities starting with 'M' (requires sorted index)
8df = df.sort_index()  # Must sort for slicing
9df.loc[idx[:, 'London':'Munich'], :]
10
11# Specific level combinations
12df.loc[idx[['US', 'UK'], :], :]

Boolean Filtering

python

1# Filter by column values
2df[df['population'] > 3_000_000]
3
4# Filter by index level using get_level_values
5df[df.index.get_level_values('country') == 'US']
6
7# Combine conditions
8mask = (df.index.get_level_values('country').isin(['US', 'UK'])) & \
9       (df['population'] > 1_000_000)
10df[mask]

Using query() with MultiIndex

python

1# query() can reference index levels by name
2df.query('country == "US"')
3
4df.query('country == "US" and population > 3000000')
5
6# Use backticks for level names with special characters
7df.query('`country` in ["US", "UK"] and area_km2 > 500')

Resetting and Setting Index

python

1# Move index levels to columns
2df_flat = df.reset_index()
3#   country       city  population  area_km2
4# 0      US        NYC     8336817       783
5# 1      US         LA     3979576      1302
6
7# Filter on flat DataFrame, then set index back
8result = df_flat[df_flat['country'] == 'US'].set_index(['country', 'city'])

Selecting Specific Columns with MultiIndex

python

1# Select one column for one country
2df.loc['US', 'population']
3# city
4# NYC        8336817
5# LA         3979576
6# Chicago    2693976
7
8# Select specific rows and columns
9df.loc[('US', 'NYC'), 'population']
10# 8336817

GroupBy Operations on MultiIndex

python

1# Aggregate by first level
2df.groupby(level='country').sum()
3#          population  area_km2
4# country
5# DE        5220148      1202
6# UK        9535230      1687
7# US       15010369      2691
8
9# Aggregate by second level
10df.groupby(level='city').mean()

Swapping and Reordering Levels

python

1# Swap levels
2df_swapped = df.swaplevel()
3#                    population  area_km2
4# city       country
5# NYC        US       8336817       783
6# LA         US       3979576      1302
7
8# Now select by city first
9df_swapped.loc['London']
10
11# Reorder levels
12df_reordered = df.reorder_levels(['city', 'country'])

Common Pitfalls

Unsorted index for slicing: Slicing with loc on a MultiIndex requires the index to be sorted. Call df.sort_index() before slicing, or you get a UnsortedIndexError.
Tuple vs list in loc: df.loc[('US', 'NYC')] selects one row (tuple = single key). df.loc[['US', 'UK']] selects multiple first-level keys (list = multiple keys). Mixing these up gives unexpected results.
Dropping levels: After selecting by one level (e.g., df.loc['US']), that level is dropped from the result. Use drop_level=False in xs() to keep it: df.xs('US', level='country', drop_level=False).
Performance with large DataFrames: Repeated get_level_values() calls in loops are slow. For bulk filtering, reset the index, filter on columns, and set the index back.
String vs numeric index: If index levels are strings, df.loc[1] will fail. Ensure you use the correct type for indexing. Check with df.index.get_level_values(0).dtype.

Summary

Use df.loc['key'] to select by the outer (first) level of a MultiIndex
Use df.loc[('level1', 'level2')] to select by both levels
Use df.xs('value', level='name') to select by any level, including inner levels
Use pd.IndexSlice for intuitive slicing across multiple levels
Use df.query('level_name == "value"') for readable filtering on index levels
Always sort the index with df.sort_index() before slicing