Keras
TensorFlow
GPU
CPU
machine learning performance

Keras Tensorflow backend slower on GPU than on CPU when training certain networks

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Yes, this can happen, and it is usually not a bug. GPUs win when the workload is large enough and regular enough to offset transfer, scheduling, and kernel-launch overhead. For small models, tiny batches, or input pipelines that cannot keep the GPU busy, the CPU can be faster.

Why a GPU Can Lose

A GPU is optimized for large parallel operations such as big matrix multiplications and convolutions. If your model does not generate much work per step, the fixed cost of using the GPU becomes visible.

Common reasons include:

  • the model is small
  • the batch size is too small
  • the input pipeline cannot feed data quickly enough
  • some ops fall back to the CPU
  • the timing includes one-time startup or graph compilation overhead

In those cases, the GPU spends too much time waiting or launching tiny kernels and not enough time doing useful math.

Small Models and Tiny Batches

This is the classic case. A shallow multilayer perceptron with small dense layers may not generate enough arithmetic work to amortize GPU overhead.

If your batch size is 8 or 16, the GPU may be severely underutilized. The CPU, with lower dispatch overhead and fast cache behavior for small workloads, can finish sooner.

A first experiment is to increase batch size and feed data through a tf.data pipeline:

python
1import numpy as np
2import tensorflow as tf
3
4x = np.random.rand(10000, 128).astype("float32")
5y = np.random.randint(0, 2, size=(10000, 1)).astype("float32")
6
7dataset = tf.data.Dataset.from_tensor_slices((x, y))
8dataset = dataset.shuffle(10000).batch(512).prefetch(tf.data.AUTOTUNE)
9
10model = tf.keras.Sequential([
11    tf.keras.layers.Dense(256, activation="relu"),
12    tf.keras.layers.Dense(128, activation="relu"),
13    tf.keras.layers.Dense(1, activation="sigmoid"),
14])
15
16model.compile(optimizer="adam", loss="binary_crossentropy")
17model.fit(dataset, epochs=3)

This does not guarantee the GPU will win, but it removes two common causes of poor performance: tiny batches and a non-prefetched input pipeline.

Input Pipeline Bottlenecks

Sometimes the model is fine, but the data loader is slow. If Python preprocessing, image decoding, or disk reads dominate the step time, the GPU sits idle waiting for the next batch.

Symptoms include:

  • low GPU utilization
  • high CPU usage in preprocessing threads
  • training time barely changes when you switch GPU models

This is why you should profile the whole training step, not just assume the compute device is the bottleneck.

CPU Fallbacks and Unsupported Ops

Not every TensorFlow operation runs efficiently on every GPU path. If parts of the graph execute on the CPU, you can end up paying device-transfer overhead without getting full GPU acceleration.

You can inspect device placement during debugging:

python
tf.debugging.set_log_device_placement(True)

If you see frequent movement between CPU and GPU or critical ops staying on the CPU, the slowdown becomes easier to explain.

Benchmarking Mistakes

Timing only the first epoch is a common mistake. The first epoch may include:

  • model tracing
  • memory allocation
  • autotuning
  • dataset warm-up

A fair comparison should usually ignore warm-up and compare steady-state throughput.

Another mistake is measuring wall-clock time around code that forces unnecessary synchronization between the CPU and GPU. That can make the GPU path look worse than it is.

Practical Tuning Steps

When GPU training looks slower, check these in order:

  1. increase batch size
  2. use tf.data with prefetch
  3. reduce Python-side preprocessing in the training loop
  4. verify that the important ops actually run on the GPU
  5. benchmark after warm-up, not only at startup

If the model is genuinely tiny, the correct conclusion may simply be that the CPU is the better device for that workload.

Common Pitfalls

The most common mistake is assuming a GPU must always be faster for every neural network. That is not how hardware tradeoffs work.

Another mistake is blaming TensorFlow when the real issue is an inefficient input pipeline or batch size.

A third pitfall is measuring overall training time without separating data loading, startup overhead, and actual compute.

Summary

  • A GPU can be slower than a CPU when the model or batch size is too small.
  • Input pipeline bottlenecks often hide the benefits of GPU compute.
  • CPU fallbacks and device transfers can erase expected gains.
  • Benchmark after warm-up and inspect device placement before drawing conclusions.
  • For some small networks, the CPU is simply the right tool.

Course illustration
Course illustration

All Rights Reserved.