CUDA
stream compaction
parallel computing
GPU algorithms
efficient computing

CUDA stream compaction algorithm

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Stream compaction means filtering an input array so that only elements matching a predicate remain, while preserving their original order. On CUDA, the classic parallel solution is a three-stage pipeline: compute a flag for each element, run a prefix scan over the flags, and scatter the selected elements into their compacted output positions.

Why Compaction Needs More Than A Filter Kernel

On the CPU, filtering often looks like a simple loop with a write cursor. On the GPU, thousands of threads decide in parallel whether an element survives, so you need a way to give every surviving element a unique output index without race conditions.

That is why stream compaction usually has these stages:

  • map each element to a keep or discard flag
  • compute prefix sums over the flags
  • scatter kept elements to their output indices

The scan stage is the key step because it turns local boolean decisions into global write positions.

Step 1: Build The Flags

Suppose the predicate is "keep positive numbers".

Input:

text
[3, -1, 0, 7, 4, -2]

Flags:

text
[1, 0, 0, 1, 1, 0]

Each thread can compute one flag independently.

Step 2: Exclusive Scan

An exclusive scan of the flags gives the destination index for every kept element.

Flags:

text
[1, 0, 0, 1, 1, 0]

Exclusive scan:

text
[0, 1, 1, 1, 2, 3]

Now any thread with flag 1 writes its input element to the index shown by the scan value.

The output becomes:

text
[3, 7, 4]

A Practical CUDA Example With Thrust

In real code, using a tested library is often better than hand-writing a full scan pipeline from scratch. Thrust provides this directly.

cpp
1#include <thrust/device_vector.h>
2#include <thrust/copy.h>
3#include <thrust/execution_policy.h>
4#include <iostream>
5
6struct is_positive {
7    __host__ __device__ bool operator()(int x) const {
8        return x > 0;
9    }
10};
11
12int main() {
13    thrust::device_vector<int> input{3, -1, 0, 7, 4, -2};
14    thrust::device_vector<int> output(input.size());
15
16    auto end_it = thrust::copy_if(
17        thrust::device,
18        input.begin(), input.end(),
19        output.begin(),
20        is_positive()
21    );
22
23    output.resize(end_it - output.begin());
24
25    for (int x : output) {
26        std::cout << x << " ";
27    }
28    std::cout << "\n";
29}

This already performs the conceptual compaction pipeline internally.

Hand-Written Kernel Structure

If you implement compaction manually, the rough structure is:

  1. launch a kernel to compute flags
  2. run a parallel exclusive scan on the flags
  3. launch a scatter kernel using flags and scanned offsets

The scatter kernel looks conceptually like this:

cpp
1__global__ void scatter(const int* input, const int* flags, const int* offsets, int* output, int n) {
2    int i = blockIdx.x * blockDim.x + threadIdx.x;
3    if (i < n && flags[i]) {
4        output[offsets[i]] = input[i];
5    }
6}

The scan itself is the harder part and is usually where shared memory and block-level synchronization matter most.

Performance Considerations

Good CUDA compaction depends on more than asymptotic complexity. Real performance also depends on:

  • memory coalescing
  • minimizing branch divergence in the predicate kernel
  • efficient scan implementation
  • reducing global-memory traffic between stages

For many workloads, a library scan or copy_if implementation is the correct engineering choice unless you are studying the algorithm itself.

Common Pitfalls

The most common mistake is trying to compact in one kernel with a shared counter and atomics for every surviving element. That can work, but it often scales poorly and loses the clean ordering properties of the scan-based method.

Another mistake is forgetting that stable output order requires the prefix-scan logic. Without it, parallel writes become nondeterministic.

A third issue is overengineering a custom scan when a battle-tested primitive from Thrust or CUB would be faster, safer, and easier to maintain.

Summary

  • CUDA stream compaction is usually implemented as flag, scan, and scatter.
  • The scan stage assigns unique output positions to surviving elements.
  • A library primitive such as thrust::copy_if is often the best practical solution.
  • Manual implementations need careful scan logic, memory layout, and synchronization.
  • Preserving order cleanly is one of the main reasons the scan-based design is preferred.

Course illustration
Course illustration

All Rights Reserved.