C 64-bit loop performance on x86

C programming

64-bit computing

loop optimization

x86 architecture

performance analysis

C 64-bit loop performance on x86

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When writing performance-critical loops in C on x86-64, the choice of data types and loop structure can significantly affect execution speed. Switching between 32-bit and 64-bit loop variables, array indexing, and pointer arithmetic can produce surprising performance differences due to how the CPU pipeline, registers, and instruction encoding interact on x86-64 architecture.

x86-64 Architecture Basics

The x86-64 architecture extends x86 in ways that directly impact loop performance:

16 general-purpose registers (up from 8 in x86-32): rax-rdx, rsi, rdi, rbp, rsp, r8-r15
Larger address space: 64-bit pointers can address up to 2^64 bytes of memory
REX prefix: Instructions operating on 64-bit registers require an extra byte (the REX prefix), slightly increasing code size
Default operand size is 32-bit: Most arithmetic instructions default to 32-bit operands; 64-bit requires the REX prefix

32-bit vs 64-bit Loop Counters

A common performance observation is that using a 32-bit loop counter (int or unsigned int) can be faster than a 64-bit counter (long or size_t):

1// Version A: 32-bit counter
2void sum_array_32(const int* arr, int n, long long* result) {
3    long long sum = 0;
4    for (int i = 0; i < n; i++) {
5        sum += arr[i];
6    }
7    *result = sum;
8}
9
10// Version B: 64-bit counter
11void sum_array_64(const int* arr, long n, long long* result) {
12    long long sum = 0;
13    for (long i = 0; i < n; i++) {
14        sum += arr[i];
15    }
16    *result = sum;
17}

Why 32-bit Can Be Faster

Smaller instruction encoding: 32-bit operations do not need the REX prefix, producing smaller code that fits better in the instruction cache.
Implicit zero-extension: Writing to a 32-bit register automatically zeros the upper 32 bits, which the CPU can optimize.
Loop alignment: Smaller loop bodies are more likely to fit within alignment boundaries, reducing fetch penalties.

When 64-bit Is Necessary

You must use 64-bit counters when:

Iterating over arrays larger than 2^31 elements
The counter is used as an array index with pointer arithmetic on large buffers
The loop counter itself can exceed INT_MAX

Compiler Optimization and Sign Extension

A subtle issue arises when using int (signed 32-bit) as an index into a 64-bit pointer:

1// The compiler may need to sign-extend i to 64 bits for address calculation
2void access(int* arr, int i) {
3    arr[i] = 0;  // i must be sign-extended to 64-bit for pointer math
4}

Using unsigned int avoids sign extension (zero extension is free on x86-64):

1// Zero-extension is implicit — no extra instruction needed
2void access(int* arr, unsigned int i) {
3    arr[i] = 0;  // i is zero-extended automatically
4}

With size_t or long, no extension is needed at all since the index is already 64-bit, but the REX prefix cost returns.

Benchmarking Example

1#include <stdio.h>
2#include <time.h>
3
4#define N 100000000
5
6void bench_32(volatile int* arr) {
7    for (int i = 0; i < N; i++) {
8        arr[i & 0xFF] += i;
9    }
10}
11
12void bench_64(volatile int* arr) {
13    for (long i = 0; i < N; i++) {
14        arr[i & 0xFF] += i;
15    }
16}
17
18int main() {
19    int arr[256] = {0};
20    clock_t start;
21
22    start = clock();
23    bench_32(arr);
24    printf("32-bit: %.3f ms\n", (double)(clock() - start) / CLOCKS_PER_SEC * 1000);
25
26    start = clock();
27    bench_64(arr);
28    printf("64-bit: %.3f ms\n", (double)(clock() - start) / CLOCKS_PER_SEC * 1000);
29
30    return 0;
31}

Compile with optimization to see the real difference:

bash

gcc -O2 -o bench bench.c
./bench

Results vary by CPU microarchitecture, but the 32-bit version is often 0-5% faster for tight loops.

Compiler Flags That Affect Loop Performance

bash

1# Enable auto-vectorization (SSE/AVX)
2gcc -O2 -march=native -ftree-vectorize loop.c
3
4# Show vectorization reports
5gcc -O2 -march=native -fopt-info-vec-optimized loop.c
6
7# Unroll loops aggressively
8gcc -O2 -funroll-loops loop.c

Auto-vectorization can dwarf the 32-bit vs 64-bit difference by processing multiple elements per cycle using SIMD instructions.

Common Pitfalls

Premature optimization: The 32-bit vs 64-bit loop counter difference is typically small (0-5%). Profile before optimizing and focus on algorithmic improvements first.
Signed overflow is undefined behavior: Using int as a loop counter with values near INT_MAX causes undefined behavior in C. The compiler may optimize based on the assumption that signed overflow never happens, producing unexpected results.
Ignoring auto-vectorization: Modern compilers can vectorize loops with both 32-bit and 64-bit counters. If a 64-bit counter prevents vectorization, the performance loss is much larger than the REX prefix cost.
Benchmark methodology: Always compile with optimization flags (-O2 or -O3), warm up the cache, and run multiple iterations. Unoptimized code has entirely different bottlenecks.

Summary

32-bit loop counters can be slightly faster due to smaller instruction encoding (no REX prefix)
The difference is typically 0-5% and depends on the CPU microarchitecture
Use unsigned int for indices when possible to avoid sign-extension overhead
Use 64-bit counters when arrays can exceed 2^31 elements
Focus on enabling auto-vectorization (-O2 -march=native) for much larger performance gains