What is the fastest way to transpose a matrix in C?

Matrix Transposition

C++ Programming

Algorithms

Code Optimization

Computational Efficiency

What is the fastest way to transpose a matrix in C?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Matrix transpose performance is usually limited by memory access patterns, not arithmetic. A naive element-by-element transpose works correctly but can be slow on large matrices due to cache misses. In C, blocked transpose and careful loop ordering are typically the fastest practical approaches.

Baseline Transpose Implementation

Start with a simple reference implementation for correctness.

1#include <stddef.h>
2
3void transpose_naive(const float *src, float *dst, size_t rows, size_t cols) {
4    for (size_t i = 0; i < rows; ++i) {
5        for (size_t j = 0; j < cols; ++j) {
6            dst[j * rows + i] = src[i * cols + j];
7        }
8    }
9}

This is easy to verify, but memory writes to dst are strided and cache-unfriendly.

Improve Cache Locality with Blocking

Blocked transpose processes small tiles that fit better in cache.

1#include <stddef.h>
2
3void transpose_blocked(const float *src, float *dst, size_t rows, size_t cols) {
4    const size_t B = 32;  // tune per hardware
5
6    for (size_t ii = 0; ii < rows; ii += B) {
7        for (size_t jj = 0; jj < cols; jj += B) {
8            size_t i_max = (ii + B < rows) ? ii + B : rows;
9            size_t j_max = (jj + B < cols) ? jj + B : cols;
10
11            for (size_t i = ii; i < i_max; ++i) {
12                for (size_t j = jj; j < j_max; ++j) {
13                    dst[j * rows + i] = src[i * cols + j];
14                }
15            }
16        }
17    }
18}

Blocking reduces cache thrashing and usually gives substantial speedups for large matrices.

In-Place Transpose for Square Matrices

If the matrix is square and stored in one buffer, swap symmetric elements.

1#include <stddef.h>
2
3void transpose_inplace_square(float *a, size_t n) {
4    for (size_t i = 0; i < n; ++i) {
5        for (size_t j = i + 1; j < n; ++j) {
6            float tmp = a[i * n + j];
7            a[i * n + j] = a[j * n + i];
8            a[j * n + i] = tmp;
9        }
10    }
11}

This avoids extra output buffer allocation, but works only for square shapes.

Benchmarking Methodology

Always benchmark with realistic matrix sizes and compiler optimization flags.

bash

gcc -O3 -march=native transpose.c -o transpose

Measure multiple runs, warm caches, and report median timing. Performance conclusions from tiny matrices are often misleading.

Additional Optimization Ideas

For high-end workloads, consider vectorized intrinsics and parallelization with OpenMP. However, blocked scalar code is often a strong baseline and easier to maintain.

If using parallel loops, ensure each thread works on separate blocks to reduce false sharing.

When matrix sizes are known at compile time, specialized kernels can outperform generic loops.

Practical Benchmark Harness in C

A small benchmark harness helps compare naive and blocked versions on your hardware.

1#include <stdio.h>
2#include <stdlib.h>
3#include <time.h>
4
5static double seconds_now(void) {
6    return (double)clock() / CLOCKS_PER_SEC;
7}
8
9void bench(void (*fn)(const float*, float*, size_t, size_t), const float *src, float *dst, size_t r, size_t c, const char *name) {
10    double t0 = seconds_now();
11    fn(src, dst, r, c);
12    double t1 = seconds_now();
13    printf("%s: %.6f s
14", name, t1 - t0);
15}

Use large matrices and multiple repetitions. Record median times and verify output correctness after each run to ensure optimizations do not introduce indexing bugs.

Common Pitfalls

A common pitfall is focusing only on algorithmic complexity. Both naive and blocked transpose are linear in element count, yet cache behavior changes real speed dramatically.

Another issue is choosing block size without profiling. Optimal tile size depends on cache hierarchy and data type.

Developers also benchmark debug builds and draw incorrect conclusions.

A final problem is incorrect index math for non-square matrices, causing silent data corruption.

Summary

Transpose speed is mostly driven by memory access locality.
Naive loops are correct but often slow for large matrices.
Blocked transpose improves cache usage and practical throughput.
In-place transpose is efficient for square matrices only.
Benchmark with optimized builds and representative matrix sizes.