Python
Data Structures
Dictionaries
Key Comparison
Programming Tips

Comparing dictionaries based on a combination of keys

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When comparing Python dictionaries based on a subset of their keys, you extract the relevant keys from each dictionary and compare only those values. This is useful for matching records that share an identity (like name + date) but may differ in other fields. Approaches include extracting key subsets with comprehensions, using operator.itemgetter, converting to tuples for hashing, and using pandas for large-scale record matching.

Comparing Two Dictionaries on Specific Keys

python
1def dicts_equal_on_keys(d1, d2, keys):
2    """Check if two dicts have the same values for the given keys."""
3    return all(d1.get(k) == d2.get(k) for k in keys)
4
5record_a = {"name": "Alice", "age": 30, "city": "NYC", "score": 95}
6record_b = {"name": "Alice", "age": 30, "city": "LA", "score": 88}
7
8# Compare on name and age only
9print(dicts_equal_on_keys(record_a, record_b, ["name", "age"]))  # True
10
11# Compare on all fields
12print(record_a == record_b)  # False (city and score differ)

Extracting a Key Subset

python
1def subset(d, keys):
2    """Extract a subset of keys from a dictionary."""
3    return {k: d[k] for k in keys if k in d}
4
5record = {"name": "Alice", "age": 30, "city": "NYC", "email": "[email protected]"}
6
7identity = subset(record, ["name", "age"])
8print(identity)  # {'name': 'Alice', 'age': 30}
9
10# Compare subsets
11print(subset(record_a, ["name", "age"]) == subset(record_b, ["name", "age"]))  # True

Finding Matching Records in Two Lists

python
1from operator import itemgetter
2
3employees = [
4    {"name": "Alice", "dept": "Engineering", "id": 1},
5    {"name": "Bob", "dept": "Sales", "id": 2},
6    {"name": "Charlie", "dept": "Engineering", "id": 3},
7]
8
9contractors = [
10    {"name": "Alice", "dept": "Engineering", "rate": 150},
11    {"name": "Diana", "dept": "Sales", "rate": 120},
12]
13
14# Find employees that match contractors on (name, dept)
15match_keys = ["name", "dept"]
16
17contractor_set = {
18    tuple(itemgetter(*match_keys)(c)) for c in contractors
19}
20# {('Alice', 'Engineering'), ('Diana', 'Sales')}
21
22matches = [
23    e for e in employees
24    if tuple(itemgetter(*match_keys)(e)) in contractor_set
25]
26print(matches)
27# [{'name': 'Alice', 'dept': 'Engineering', 'id': 1}]

Finding Differences Between Dictionaries

python
1def diff_on_keys(d1, d2, keys):
2    """Find keys where two dicts differ."""
3    return {k: (d1.get(k), d2.get(k)) for k in keys if d1.get(k) != d2.get(k)}
4
5old = {"name": "Alice", "age": 30, "city": "NYC", "salary": 100000}
6new = {"name": "Alice", "age": 31, "city": "NYC", "salary": 110000}
7
8changes = diff_on_keys(old, new, ["age", "city", "salary"])
9print(changes)
10# {'age': (30, 31), 'salary': (100000, 110000)}

Grouping by Composite Keys

python
1from collections import defaultdict
2
3records = [
4    {"date": "2024-01-01", "product": "A", "region": "East", "sales": 100},
5    {"date": "2024-01-01", "product": "A", "region": "West", "sales": 150},
6    {"date": "2024-01-01", "product": "B", "region": "East", "sales": 200},
7    {"date": "2024-01-02", "product": "A", "region": "East", "sales": 120},
8]
9
10# Group by (date, product)
11grouped = defaultdict(list)
12for r in records:
13    key = (r["date"], r["product"])
14    grouped[key].append(r)
15
16for key, group in grouped.items():
17    total = sum(r["sales"] for r in group)
18    print(f"{key}: total sales = {total}")
19
20# ('2024-01-01', 'A'): total sales = 250
21# ('2024-01-01', 'B'): total sales = 200
22# ('2024-01-02', 'A'): total sales = 120

Deduplicating by Composite Key

python
1def deduplicate(records, keys):
2    """Keep the first record for each unique combination of keys."""
3    seen = set()
4    result = []
5    for record in records:
6        key = tuple(record.get(k) for k in keys)
7        if key not in seen:
8            seen.add(key)
9            result.append(record)
10    return result
11
12data = [
13    {"name": "Alice", "date": "2024-01-01", "amount": 100},
14    {"name": "Alice", "date": "2024-01-01", "amount": 150},  # Duplicate
15    {"name": "Bob", "date": "2024-01-01", "amount": 200},
16]
17
18unique = deduplicate(data, ["name", "date"])
19print(unique)
20# [{'name': 'Alice', 'date': '2024-01-01', 'amount': 100},
21#  {'name': 'Bob', 'date': '2024-01-01', 'amount': 200}]

Using Pandas for Large-Scale Comparison

python
1import pandas as pd
2
3df1 = pd.DataFrame([
4    {"name": "Alice", "dept": "Eng", "salary": 100000},
5    {"name": "Bob", "dept": "Sales", "salary": 80000},
6])
7
8df2 = pd.DataFrame([
9    {"name": "Alice", "dept": "Eng", "bonus": 10000},
10    {"name": "Charlie", "dept": "Eng", "bonus": 8000},
11])
12
13# Merge on composite key (name, dept)
14merged = pd.merge(df1, df2, on=["name", "dept"], how="inner")
15print(merged)
16#     name dept  salary  bonus
17# 0  Alice  Eng  100000  10000
18
19# Find records in df1 not in df2
20left_only = pd.merge(df1, df2, on=["name", "dept"], how="left", indicator=True)
21missing = left_only[left_only["_merge"] == "left_only"]
22print(missing[["name", "dept"]])
23#   name   dept
24# 1  Bob  Sales

Common Pitfalls

  • Using unhashable values as composite keys: Tuple keys like (d["name"], d["list_field"]) fail if any value is unhashable (lists, dicts). Convert unhashable values to strings or frozensets before creating the key tuple: tuple(str(d[k]) for k in keys).
  • Forgetting about missing keys: If a dictionary is missing one of the comparison keys, d[k] raises KeyError. Use d.get(k) which returns None for missing keys, or validate that all required keys exist before comparing.
  • Case-sensitive string comparison: {"name": "Alice"} and {"name": "alice"} are not equal. If case-insensitive matching is needed, normalize values first: d.get(k, "").lower() for string keys.
  • Comparing floating-point values: Direct equality (==) on float values may fail due to precision. Use math.isclose(d1[k], d2[k]) for floating-point comparisons, or round to a fixed number of decimal places before comparing.
  • O(n*m) performance with nested loops: Comparing every record in list A against every record in list B is O(n*m). Convert one list to a set of key tuples first for O(n+m) performance, or use pandas merge for large datasets.

Summary

  • Compare dictionaries on specific keys with all(d1.get(k) == d2.get(k) for k in keys)
  • Extract key subsets with {k: d[k] for k in keys} for partial dictionary comparison
  • Convert composite keys to tuples for O(1) set lookups and efficient record matching
  • Use collections.defaultdict for grouping records by composite keys
  • Use pandas.merge(on=[keys]) for efficient large-scale record matching and comparison
  • Handle edge cases: missing keys (use .get()), case sensitivity, and unhashable values

Course illustration
Course illustration

All Rights Reserved.