TensorFlow
CPU optimization
machine learning configuration
parallel computing
performance tuning

Configuring Tensorflow to use all CPU's

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

TensorFlow usually uses multiple CPU cores automatically, but use all CPUs is more subtle than it sounds. The best performance depends on how TensorFlow schedules individual operations, how many operations run in parallel, and whether other libraries such as NumPy, oneDNN, or your data loader are competing for the same cores.

A good configuration starts with explicit thread settings, then measures actual throughput. On CPU workloads, more threads do not always mean faster training.

Understanding TensorFlow CPU Threading

TensorFlow exposes two important controls: intra_op_parallelism_threads, which sets threads used inside one operation, and inter_op_parallelism_threads, which sets how many independent operations can run at the same time.

If both values are left at default, TensorFlow picks settings based on the machine and runtime. That is often fine, but you may want to set them manually for repeatable performance.

Configuring Thread Counts

You should configure threading before building tensors or models:

python
1import os
2import tensorflow as tf
3
4cpu_count = os.cpu_count() or 1
5
6tf.config.threading.set_intra_op_parallelism_threads(cpu_count)
7tf.config.threading.set_inter_op_parallelism_threads(2)
8
9print("Logical CPU count:", cpu_count)
10print("Intra-op:", tf.config.threading.get_intra_op_parallelism_threads())
11print("Inter-op:", tf.config.threading.get_inter_op_parallelism_threads())

This gives each heavy op access to all logical CPUs while allowing a small amount of concurrent scheduling. It is a reasonable starting point for dense CPU-heavy inference or training.

Verifying The Runtime

A short benchmark is better than assuming the settings worked. The snippet below runs repeated matrix multiplications and reports elapsed time:

python
1import time
2import tensorflow as tf
3
4a = tf.random.uniform((4000, 4000))
5b = tf.random.uniform((4000, 4000))
6
7start = time.time()
8for _ in range(5):
9    c = tf.matmul(a, b)
10    _ = c.numpy()
11
12elapsed = time.time() - start
13print(f"Elapsed: {elapsed:.2f} seconds")

Run the benchmark with several thread settings and compare results. CPU tuning is empirical. A machine with many logical CPUs may perform better with fewer TensorFlow threads if the workload is memory-bound.

Data Pipeline And Environment Variables

CPU usage is not limited to TensorFlow kernels. Input preprocessing can become the real bottleneck. If you use tf.data, enable parallel mapping and prefetching:

python
dataset = dataset.map(parse_example, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.prefetch(tf.data.AUTOTUNE)

You can also coordinate external math libraries with environment variables before Python starts:

bash
1export OMP_NUM_THREADS=8
2export TF_NUM_INTRAOP_THREADS=8
3export TF_NUM_INTEROP_THREADS=2
4python train.py

These values matter when underlying kernels use OpenMP or oneDNN. If you set both TensorFlow APIs and environment variables, keep them consistent so you do not accidentally create conflicting limits.

Pinning To CPU Explicitly

If a machine also has GPUs, you may want to force a CPU-only run for testing or deployment:

python
1import tensorflow as tf
2
3tf.config.set_visible_devices([], "GPU")
4
5with tf.device("/CPU:0"):
6    x = tf.constant([[1.0, 2.0], [3.0, 4.0]])
7    y = tf.reduce_sum(x)
8    print(y.numpy())

This does not increase CPU usage by itself, but it ensures the runtime stays on the CPU path while you benchmark and tune.

Common Pitfalls

The biggest mistake is setting thread counts after TensorFlow has already initialized the runtime. In that case the calls may fail or be ignored. Configure threading at process startup.

Another issue is confusing logical CPUs with physical cores. os.cpu_count() includes hyper-threads on many systems, which can overstate the number of threads worth assigning to heavy numeric kernels.

A final pitfall is tuning only model execution and ignoring data loading. If preprocessing is single-threaded, the training step will wait for input no matter how many CPU threads TensorFlow has available.

Summary

  • TensorFlow already uses multiple CPUs, but explicit thread settings improve repeatability.
  • The key knobs are intra-op and inter-op parallelism.
  • Benchmark several configurations instead of assuming the maximum thread count is optimal.
  • Tune the data pipeline along with compute kernels.
  • Set threading before TensorFlow initializes the runtime.

Course illustration
Course illustration

All Rights Reserved.