Bit counting in a contiguous memory chunk
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Bit counting over a contiguous memory chunk means counting how many 1 bits appear across a byte buffer. This comes up in bitmap processing, compression, networking, and low-level analytics, so the best implementation depends not only on the bit trick you use but also on how efficiently you scan memory.
Start With a Simple Byte-Wise Scan
The most direct solution is to walk the buffer byte by byte and add the population count of each byte to a running total.
Here is a portable C example using Brian Kernighan's bit-clearing trick for a single byte:
This is easy to reason about and correct on any platform that supports standard C integer types.
Use Built-In Popcount When the Compiler Provides It
Modern compilers often expose a popcount builtin that can map to efficient CPU instructions when available. That is usually better than hand-writing the inner loop.
The same idea can be extended to 32-bit or 64-bit chunks if your platform, alignment, and aliasing rules let you do so safely.
A Lookup Table Is Still Practical
Another classic solution is a 256-entry lookup table for every possible byte value.
This trades a tiny amount of memory for a very small and predictable inner loop.
Memory Access Matters Too
A contiguous memory chunk is friendly to the CPU cache, which is good news. In large buffers, sequential scanning is usually the right baseline. The popcount algorithm is only part of the performance story. Cache behavior, alignment, and loop overhead also matter.
That is why a theoretically clever bit trick can lose in practice if it forces awkward loads or extra repacking steps. If your data is already a byte buffer, count it directly instead of first converting it into some other representation.
Wider Words Can Reduce Overhead
If the buffer is aligned and your platform allows it, processing 64-bit chunks can reduce loop overhead compared with byte-wise iteration. But the tradeoff is portability and complexity.
For many real programs, a good compromise is:
- keep the buffer scan sequential
- use compiler popcount support
- benchmark before attempting wider-word reinterpretation
The memory walk is often more important than the specific micro-optimization.
Common Pitfalls
- Optimizing the popcount math while ignoring the cost of reading the buffer itself.
- Repacking or copying the buffer before counting bits and losing any speed advantage.
- Assuming a hand-written trick will beat the compiler's builtin on modern CPUs.
- Casting to wider integer types without considering alignment and portability.
- Benchmarking tiny synthetic inputs and then drawing conclusions about large real buffers.
Summary
- Bit counting over a contiguous memory chunk usually starts with a sequential scan and a running total.
- A byte-wise implementation is portable and easy to verify.
- Compiler builtins are often the best practical speedup when available.
- Lookup tables remain a useful middle ground between portability and performance.
- For large buffers, memory access patterns matter almost as much as the popcount operation itself.

