pandas
DataFrame
metadata
Python
data analysis

Adding meta-information/metadata to pandas DataFrame

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Pandas DataFrames do not have a dedicated metadata system, but there are several ways to attach extra information like data source, creation date, units, or column descriptions. The main approaches are using the attrs dictionary (pandas 1.0+), setting custom attributes directly on the DataFrame, or storing metadata alongside the data in a wrapper structure. Each approach has different trade-offs for persistence and survival through operations like copy, merge, and serialization.

Since pandas 1.0, every DataFrame has an attrs dictionary that is preserved through most pandas operations:

python
1import pandas as pd
2
3df = pd.DataFrame({'temperature': [20.5, 22.1, 19.8], 'humidity': [45, 50, 42]})
4
5# Attach metadata
6df.attrs['source'] = 'Weather Station A'
7df.attrs['created_at'] = '2025-01-15'
8df.attrs['units'] = {'temperature': 'Celsius', 'humidity': 'percent'}
9df.attrs['author'] = 'Data Team'
10
11print(df.attrs['source'])  # 'Weather Station A'
12
13# attrs survives copy
14df2 = df.copy()
15print(df2.attrs['source'])  # 'Weather Station A'

attrs is propagated through copy(), slicing, and many pandas operations. However, it is not guaranteed to survive all operations — complex transforms like merge() or groupby() may drop it.

Custom Attributes on the DataFrame

You can set arbitrary attributes directly on a DataFrame instance:

python
1df = pd.DataFrame({'price': [10.5, 20.3, 15.0]})
2
3df.data_source = 'sales_db'
4df.collection_date = '2025-03-01'
5df.version = 2
6
7print(df.data_source)  # 'sales_db'

This works but the attributes are lost on any operation that returns a new DataFrame:

python
df_filtered = df[df['price'] > 12]
print(hasattr(df_filtered, 'data_source'))  # False — attribute is gone

Custom attributes only survive on the exact same object. Any pandas operation that creates a new DataFrame (filtering, sorting, merging) loses them.

Storing Metadata in a Wrapper Class

For metadata that must survive all operations, wrap the DataFrame:

python
1class AnnotatedDataFrame:
2    def __init__(self, df, metadata=None):
3        self.df = df
4        self.metadata = metadata or {}
5
6    def __repr__(self):
7        meta_str = ', '.join(f'{k}={v}' for k, v in self.metadata.items())
8        return f"AnnotatedDataFrame(metadata={{{meta_str}}})\n{self.df}"
9
10# Usage
11adf = AnnotatedDataFrame(
12    df=pd.DataFrame({'value': [1, 2, 3]}),
13    metadata={
14        'source': 'experiment_42',
15        'created': '2025-01-15',
16        'description': 'Sample measurements from Lab B'
17    }
18)
19
20# Metadata always available
21print(adf.metadata['source'])  # 'experiment_42'
22
23# Operate on the DataFrame, metadata stays with the wrapper
24adf.df = adf.df[adf.df['value'] > 1]
25print(adf.metadata['source'])  # Still 'experiment_42'

Column-Level Metadata

To describe individual columns, store descriptions alongside the DataFrame:

python
1df = pd.DataFrame({
2    'temp_c': [20.5, 22.1],
3    'rh_pct': [45, 50],
4    'wind_ms': [3.2, 5.1]
5})
6
7# Column metadata as a dictionary
8column_meta = {
9    'temp_c': {'description': 'Temperature', 'unit': 'Celsius', 'sensor': 'T-100'},
10    'rh_pct': {'description': 'Relative Humidity', 'unit': '%', 'sensor': 'H-200'},
11    'wind_ms': {'description': 'Wind Speed', 'unit': 'm/s', 'sensor': 'W-300'}
12}
13
14df.attrs['columns'] = column_meta
15
16# Access column info
17print(df.attrs['columns']['temp_c']['unit'])  # 'Celsius'

Persisting Metadata to Disk

HDF5 (Best for Metadata)

python
1# Save with metadata
2df.attrs['source'] = 'Weather API'
3df.to_hdf('data.h5', key='weather')
4
5# HDF5 stores attrs automatically with the HDFStore
6store = pd.HDFStore('data.h5')
7store.get_storer('weather').attrs.metadata = {'source': 'Weather API', 'version': 3}
8store.close()
9
10# Read back
11store = pd.HDFStore('data.h5')
12metadata = store.get_storer('weather').attrs.metadata
13store.close()

Parquet (Partial Support)

python
1# Parquet preserves pandas metadata via schema metadata
2df.attrs['source'] = 'experiment_1'
3df.to_parquet('data.parquet')
4
5# attrs may not survive round-trip — depends on engine
6df_loaded = pd.read_parquet('data.parquet')
7# df_loaded.attrs may or may not contain 'source'

JSON Sidecar File

python
1import json
2
3# Save DataFrame and metadata separately
4df.to_csv('data.csv', index=False)
5with open('data_meta.json', 'w') as f:
6    json.dump({
7        'source': 'Weather API',
8        'created': '2025-01-15',
9        'columns': {'temp_c': 'Celsius', 'rh_pct': 'percent'}
10    }, f)
11
12# Load both
13df = pd.read_csv('data.csv')
14with open('data_meta.json') as f:
15    df.attrs = json.load(f)

Metadata Survival Through Operations

Operationattrs preservedCustom attributes preserved
df.copy()YesNo
df[df['col'] > 0]YesNo
df.head()YesNo
df.merge(other)NoNo
df.groupby().agg()NoNo
pd.concat([df1, df2])NoNo
df.to_csv() / read_csv()NoNo
df.to_parquet() / read_parquet()PartialNo

Common Pitfalls

  • Assuming attrs survives all operations: attrs is propagated through simple operations like copy() and slicing, but merge(), groupby(), concat(), and pivot_table() do not preserve it. Always re-attach metadata after complex transforms.
  • Setting custom attributes and expecting persistence: df.my_attr = 'value' works on the current object but is lost whenever pandas creates a new DataFrame. This happens on nearly every operation.
  • Storing metadata in CSV files: CSV has no mechanism for metadata. The data is saved but attrs and custom attributes are lost. Use HDF5, Parquet, or a sidecar JSON file instead.
  • Mutating attrs on a view vs a copy: Slicing may return a view or a copy depending on the operation. Modifying attrs on a view can unexpectedly modify the original DataFrame's attrs too.
  • Using __dict__ for metadata: While df.__dict__ stores custom attributes, directly manipulating it is fragile and not part of the pandas API. Use attrs for metadata that pandas explicitly supports.

Summary

  • Use df.attrs (pandas 1.0+) for metadata that survives copy and slicing operations
  • Custom attributes (df.source = 'x') are lost on any operation that creates a new DataFrame
  • Wrap DataFrames in a custom class when metadata must survive all operations
  • HDF5 is the best format for persisting metadata alongside data
  • CSV and most serialization formats do not preserve metadata — use sidecar files
  • Always re-attach metadata after merge(), groupby(), concat(), and similar transforms

Course illustration
Course illustration

All Rights Reserved.