C 64-bit loop performance on x86
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
When writing performance-critical loops in C on x86-64, the choice of data types and loop structure can significantly affect execution speed. Switching between 32-bit and 64-bit loop variables, array indexing, and pointer arithmetic can produce surprising performance differences due to how the CPU pipeline, registers, and instruction encoding interact on x86-64 architecture.
x86-64 Architecture Basics
The x86-64 architecture extends x86 in ways that directly impact loop performance:
- 16 general-purpose registers (up from 8 in x86-32):
rax-rdx,rsi,rdi,rbp,rsp,r8-r15 - Larger address space: 64-bit pointers can address up to 2^64 bytes of memory
- REX prefix: Instructions operating on 64-bit registers require an extra byte (the REX prefix), slightly increasing code size
- Default operand size is 32-bit: Most arithmetic instructions default to 32-bit operands; 64-bit requires the REX prefix
32-bit vs 64-bit Loop Counters
A common performance observation is that using a 32-bit loop counter (int or unsigned int) can be faster than a 64-bit counter (long or size_t):
Why 32-bit Can Be Faster
- Smaller instruction encoding: 32-bit operations do not need the REX prefix, producing smaller code that fits better in the instruction cache.
- Implicit zero-extension: Writing to a 32-bit register automatically zeros the upper 32 bits, which the CPU can optimize.
- Loop alignment: Smaller loop bodies are more likely to fit within alignment boundaries, reducing fetch penalties.
When 64-bit Is Necessary
You must use 64-bit counters when:
- Iterating over arrays larger than 2^31 elements
- The counter is used as an array index with pointer arithmetic on large buffers
- The loop counter itself can exceed INT_MAX
Compiler Optimization and Sign Extension
A subtle issue arises when using int (signed 32-bit) as an index into a 64-bit pointer:
Using unsigned int avoids sign extension (zero extension is free on x86-64):
With size_t or long, no extension is needed at all since the index is already 64-bit, but the REX prefix cost returns.
Benchmarking Example
Compile with optimization to see the real difference:
Results vary by CPU microarchitecture, but the 32-bit version is often 0-5% faster for tight loops.
Compiler Flags That Affect Loop Performance
Auto-vectorization can dwarf the 32-bit vs 64-bit difference by processing multiple elements per cycle using SIMD instructions.
Common Pitfalls
- Premature optimization: The 32-bit vs 64-bit loop counter difference is typically small (0-5%). Profile before optimizing and focus on algorithmic improvements first.
- Signed overflow is undefined behavior: Using
intas a loop counter with values near INT_MAX causes undefined behavior in C. The compiler may optimize based on the assumption that signed overflow never happens, producing unexpected results. - Ignoring auto-vectorization: Modern compilers can vectorize loops with both 32-bit and 64-bit counters. If a 64-bit counter prevents vectorization, the performance loss is much larger than the REX prefix cost.
- Benchmark methodology: Always compile with optimization flags (
-O2or-O3), warm up the cache, and run multiple iterations. Unoptimized code has entirely different bottlenecks.
Summary
- 32-bit loop counters can be slightly faster due to smaller instruction encoding (no REX prefix)
- The difference is typically 0-5% and depends on the CPU microarchitecture
- Use
unsigned intfor indices when possible to avoid sign-extension overhead - Use 64-bit counters when arrays can exceed 2^31 elements
- Focus on enabling auto-vectorization (
-O2 -march=native) for much larger performance gains

