CUDA stream compaction algorithm
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Stream compaction means filtering an input array so that only elements matching a predicate remain, while preserving their original order. On CUDA, the classic parallel solution is a three-stage pipeline: compute a flag for each element, run a prefix scan over the flags, and scatter the selected elements into their compacted output positions.
Why Compaction Needs More Than A Filter Kernel
On the CPU, filtering often looks like a simple loop with a write cursor. On the GPU, thousands of threads decide in parallel whether an element survives, so you need a way to give every surviving element a unique output index without race conditions.
That is why stream compaction usually has these stages:
- map each element to a keep or discard flag
- compute prefix sums over the flags
- scatter kept elements to their output indices
The scan stage is the key step because it turns local boolean decisions into global write positions.
Step 1: Build The Flags
Suppose the predicate is "keep positive numbers".
Input:
Flags:
Each thread can compute one flag independently.
Step 2: Exclusive Scan
An exclusive scan of the flags gives the destination index for every kept element.
Flags:
Exclusive scan:
Now any thread with flag 1 writes its input element to the index shown by the scan value.
The output becomes:
A Practical CUDA Example With Thrust
In real code, using a tested library is often better than hand-writing a full scan pipeline from scratch. Thrust provides this directly.
This already performs the conceptual compaction pipeline internally.
Hand-Written Kernel Structure
If you implement compaction manually, the rough structure is:
- launch a kernel to compute flags
- run a parallel exclusive scan on the flags
- launch a scatter kernel using flags and scanned offsets
The scatter kernel looks conceptually like this:
The scan itself is the harder part and is usually where shared memory and block-level synchronization matter most.
Performance Considerations
Good CUDA compaction depends on more than asymptotic complexity. Real performance also depends on:
- memory coalescing
- minimizing branch divergence in the predicate kernel
- efficient scan implementation
- reducing global-memory traffic between stages
For many workloads, a library scan or copy_if implementation is the correct engineering choice unless you are studying the algorithm itself.
Common Pitfalls
The most common mistake is trying to compact in one kernel with a shared counter and atomics for every surviving element. That can work, but it often scales poorly and loses the clean ordering properties of the scan-based method.
Another mistake is forgetting that stable output order requires the prefix-scan logic. Without it, parallel writes become nondeterministic.
A third issue is overengineering a custom scan when a battle-tested primitive from Thrust or CUB would be faster, safer, and easier to maintain.
Summary
- CUDA stream compaction is usually implemented as flag, scan, and scatter.
- The scan stage assigns unique output positions to surviving elements.
- A library primitive such as
thrust::copy_ifis often the best practical solution. - Manual implementations need careful scan logic, memory layout, and synchronization.
- Preserving order cleanly is one of the main reasons the scan-based design is preferred.

