How much faster is NCHW compared to NHWC in TensorFlow/cuDNN?

TensorFlow

cuDNN

NCHW

NHWC

performance comparison

How much faster is NCHW compared to NHWC in TensorFlow/cuDNN?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

There is no single fixed speedup for NCHW versus NHWC in TensorFlow with cuDNN. The faster layout depends on the TensorFlow version, the specific operators in the model, the GPU generation, and whether TensorFlow must insert transpose operations to satisfy the kernels it chooses.

What the two layouts mean

The names describe the order of dimensions in a 4D image tensor:

'NHWC means batch, height, width, channels.'
'NCHW means batch, channels, height, width.'

TensorFlow APIs often default to channels_last, which corresponds to NHWC. Low-level GPU discussions historically focused on NCHW, because many cuDNN convolution paths were optimized for channel-first layouts on older stacks.

Why the answer is benchmark-dependent

Older advice often said that NCHW is faster on NVIDIA GPUs. That was true often enough to become common folklore, but modern TensorFlow guidance is more nuanced. Current TensorFlow performance guidance explicitly recommends preferring channel-last NHWC layouts for many workloads because they work well with Tensor Cores and avoid extra transpose overhead.

In practice, the real question is not "is NCHW always faster?" but "does my exact model run faster after layout conversions, kernel selection, mixed precision, and memory traffic are all accounted for?"

That is why two teams can report different winners and both be correct.

Benchmark both layouts in your own model

A simple benchmark is more reliable than a rule of thumb. The code below compares two Conv2D layers, one using channels_last and one using channels_first:

python

1import time
2import tensorflow as tf
3
4def benchmark(data_format, x, steps=50):
5    layer = tf.keras.layers.Conv2D(
6        filters=64,
7        kernel_size=3,
8        padding="same",
9        data_format=data_format,
10    )
11
12    @tf.function
13    def step(tensor):
14        return layer(tensor)
15
16    for _ in range(10):
17        _ = tf.reduce_sum(step(x)).numpy()
18
19    start = time.perf_counter()
20    for _ in range(steps):
21        _ = tf.reduce_sum(step(x)).numpy()
22
23    return (time.perf_counter() - start) / steps
24
25nhwc_input = tf.random.normal([32, 224, 224, 64])
26nchw_input = tf.random.normal([32, 64, 224, 224])
27
28nhwc_time = benchmark("channels_last", nhwc_input)
29nchw_time = benchmark("channels_first", nchw_input)
30
31print("NHWC:", nhwc_time)
32print("NCHW:", nchw_time)

This kind of measurement is more useful than a generic multiplier because it reflects your own hardware and TensorFlow build.

When each layout tends to make sense

On modern TensorFlow stacks, NHWC is often the safer default, especially when you are using high-level Keras APIs and want TensorFlow to optimize the graph naturally. It also tends to be the better choice when the rest of the ecosystem, preprocessing pipeline, or deployment format already uses channel-last tensors.

NCHW can still make sense in specialized situations:

legacy code that already stores tensors in channel-first format,
custom CUDA or cuDNN integrations,
or narrowly benchmarked workloads where channel-first kernels are measurably faster end to end.

Even in those cases, you have to account for layout conversion costs. A theoretically faster convolution kernel can lose in real workloads if every layer boundary requires transposes.

Common Pitfalls

The most common mistake is quoting a fixed number such as "NCHW is 30 percent faster" without stating hardware, TensorFlow version, model shape, and precision mode. That claim is not portable.

Another problem is benchmarking one layer in isolation and assuming the result applies to an entire model. Real networks include normalization, pooling, reshapes, residual connections, and input pipelines, all of which affect the total result.

Be careful with framework defaults too. If most of your code uses channels_last and you force channels_first in only one part of the model, TensorFlow may add expensive layout conversions.

Finally, remember that CPU and GPU behavior are not the same. A layout that helps one execution target may hurt another.

Summary

There is no universal speedup number for NCHW versus NHWC in TensorFlow/cuDNN.
Modern TensorFlow guidance often favors NHWC because it reduces transpose overhead and works well with Tensor Cores.
Historical advice about NCHW being faster came from older GPU and cuDNN optimization patterns.
The right answer for your model is to benchmark the full workload on your actual hardware.
Treat layout choice as a performance experiment, not as a permanent rule.