Algorithm to produce a difference of two collections of intervals

intervals

algorithm

collections

difference

data processing

Algorithm to produce a difference of two collections of intervals

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Computing the difference between two interval collections means taking every range from collection A and removing the parts covered by collection B. This appears in scheduling, calendar availability, genomic ranges, firewall rules, and timeline processing. The clean solution is to normalize both inputs and then walk them with two pointers.

Define the Operation Clearly

Assume each interval is closed on the left and open on the right, written as start, end. That convention avoids ambiguity at boundaries and is common in programming.

If A contains 1, 10 and B contains 3, 5, the result is 1, 3 and 5, 10. If B fully covers an interval from A, nothing remains. If they do not overlap, the interval from A passes through unchanged.

The algorithm becomes much easier if intervals inside each collection are already sorted and non-overlapping.

Step 1: Normalize Each Collection

Before subtracting, merge overlaps inside A and inside B. Without normalization, the subtraction logic has to reason about many repeated cases.

python

1def merge_intervals(intervals):
2    if not intervals:
3        return []
4
5    intervals = sorted(intervals)
6    merged = [list(intervals[0])]
7
8    for start, end in intervals[1:]:
9        last = merged[-1]
10        if start <= last[1]:
11            last[1] = max(last[1], end)
12        else:
13            merged.append([start, end])
14
15    return [tuple(item) for item in merged]

After this step, both collections are predictable: sorted by start position and internally disjoint.

Step 2: Sweep with Two Pointers

Now traverse A and B together. For each interval in A, advance through overlapping intervals in B and emit the remaining pieces.

python

1def subtract_intervals(a_intervals, b_intervals):
2    a_intervals = merge_intervals(a_intervals)
3    b_intervals = merge_intervals(b_intervals)
4
5    result = []
6    j = 0
7
8    for a_start, a_end in a_intervals:
9        current = a_start
10
11        while j < len(b_intervals) and b_intervals[j][1] <= a_start:
12            j += 1
13
14        k = j
15        while k < len(b_intervals) and b_intervals[k][0] < a_end:
16            b_start, b_end = b_intervals[k]
17
18            if b_start > current:
19                result.append((current, min(b_start, a_end)))
20
21            current = max(current, b_end)
22            if current >= a_end:
23                break
24            k += 1
25
26        if current < a_end:
27            result.append((current, a_end))
28
29    return result
30
31
32print(subtract_intervals([(1, 10), (12, 18)], [(3, 5), (8, 14)]))
33# [(1, 3), (5, 8), (14, 18)]

The variable current marks the left edge of the remaining part of the current interval from A. Every overlap in B either trims the front, splits the interval, or removes the rest entirely.

Why This Works Efficiently

Because both lists are sorted, the algorithm never restarts scanning B from the beginning. Each interval is visited in order, so the sweep phase is linear after sorting.

Time complexity:

Merging each collection costs O(n log n) and O(m log m) because of sorting.
The subtraction sweep costs O(n + m).

That is usually optimal for unsorted input.

Boundary Choices Matter

You must decide whether touching boundaries count as overlap. With half-open intervals, 1, 3 and 3, 5 do not overlap. With closed intervals, boundary handling is different and you may need to emit 1, 2 instead of 1, 3 for integer ranges.

If your domain is dates, timestamps, or floating-point spans, write the boundary rule down first. Many interval bugs are not algorithmic mistakes. They are definition mistakes.

Common Variations

Some systems need the symmetric difference instead of plain subtraction. Others need to preserve metadata, such as which rule or schedule entry produced each segment. In those cases, the same sweep-line idea still applies, but the output structure becomes richer than simple start and end pairs.

Another variation is online processing, where intervals arrive as a stream. If the input is not known in advance, balanced trees or segment structures may be more appropriate than a batch sweep.

Common Pitfalls

The biggest mistake is skipping normalization. If either collection contains internal overlaps, the subtraction logic gets duplicated segments or misses gaps.

Another common bug is mixing closed and half-open semantics. The code may look correct but still produce one-unit errors at boundaries.

Be careful with pointer advancement. If you move the pointer for collection B too aggressively, later intervals from A may miss an overlap that should still be considered.

Summary

Normalize both collections before subtracting.
A two-pointer sweep is the simplest efficient approach for sorted intervals.
Track the current unconsumed start of each interval from A as overlaps are processed.
Define boundary semantics up front, especially for touching intervals.
The overall complexity is dominated by sorting, followed by a linear sweep.