Configuring Tensorflow to use all CPU's

TensorFlow

CPU optimization

machine learning configuration

parallel computing

performance tuning

Configuring Tensorflow to use all CPU's

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

TensorFlow usually uses multiple CPU cores automatically, but use all CPUs is more subtle than it sounds. The best performance depends on how TensorFlow schedules individual operations, how many operations run in parallel, and whether other libraries such as NumPy, oneDNN, or your data loader are competing for the same cores.

A good configuration starts with explicit thread settings, then measures actual throughput. On CPU workloads, more threads do not always mean faster training.

Understanding TensorFlow CPU Threading

TensorFlow exposes two important controls: intra_op_parallelism_threads, which sets threads used inside one operation, and inter_op_parallelism_threads, which sets how many independent operations can run at the same time.

If both values are left at default, TensorFlow picks settings based on the machine and runtime. That is often fine, but you may want to set them manually for repeatable performance.

Configuring Thread Counts

You should configure threading before building tensors or models:

python

1import os
2import tensorflow as tf
3
4cpu_count = os.cpu_count() or 1
5
6tf.config.threading.set_intra_op_parallelism_threads(cpu_count)
7tf.config.threading.set_inter_op_parallelism_threads(2)
8
9print("Logical CPU count:", cpu_count)
10print("Intra-op:", tf.config.threading.get_intra_op_parallelism_threads())
11print("Inter-op:", tf.config.threading.get_inter_op_parallelism_threads())

This gives each heavy op access to all logical CPUs while allowing a small amount of concurrent scheduling. It is a reasonable starting point for dense CPU-heavy inference or training.

Verifying The Runtime

A short benchmark is better than assuming the settings worked. The snippet below runs repeated matrix multiplications and reports elapsed time:

python

1import time
2import tensorflow as tf
3
4a = tf.random.uniform((4000, 4000))
5b = tf.random.uniform((4000, 4000))
6
7start = time.time()
8for _ in range(5):
9    c = tf.matmul(a, b)
10    _ = c.numpy()
11
12elapsed = time.time() - start
13print(f"Elapsed: {elapsed:.2f} seconds")

Run the benchmark with several thread settings and compare results. CPU tuning is empirical. A machine with many logical CPUs may perform better with fewer TensorFlow threads if the workload is memory-bound.

Data Pipeline And Environment Variables

CPU usage is not limited to TensorFlow kernels. Input preprocessing can become the real bottleneck. If you use tf.data, enable parallel mapping and prefetching:

python

dataset = dataset.map(parse_example, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.prefetch(tf.data.AUTOTUNE)

You can also coordinate external math libraries with environment variables before Python starts:

bash

1export OMP_NUM_THREADS=8
2export TF_NUM_INTRAOP_THREADS=8
3export TF_NUM_INTEROP_THREADS=2
4python train.py

These values matter when underlying kernels use OpenMP or oneDNN. If you set both TensorFlow APIs and environment variables, keep them consistent so you do not accidentally create conflicting limits.

Pinning To CPU Explicitly

If a machine also has GPUs, you may want to force a CPU-only run for testing or deployment:

python

1import tensorflow as tf
2
3tf.config.set_visible_devices([], "GPU")
4
5with tf.device("/CPU:0"):
6    x = tf.constant([[1.0, 2.0], [3.0, 4.0]])
7    y = tf.reduce_sum(x)
8    print(y.numpy())

This does not increase CPU usage by itself, but it ensures the runtime stays on the CPU path while you benchmark and tune.

Common Pitfalls

The biggest mistake is setting thread counts after TensorFlow has already initialized the runtime. In that case the calls may fail or be ignored. Configure threading at process startup.

Another issue is confusing logical CPUs with physical cores. os.cpu_count() includes hyper-threads on many systems, which can overstate the number of threads worth assigning to heavy numeric kernels.

A final pitfall is tuning only model execution and ignoring data loading. If preprocessing is single-threaded, the training step will wait for input no matter how many CPU threads TensorFlow has available.

Summary

TensorFlow already uses multiple CPUs, but explicit thread settings improve repeatability.
The key knobs are intra-op and inter-op parallelism.
Benchmark several configurations instead of assuming the maximum thread count is optimal.
Tune the data pipeline along with compute kernels.
Set threading before TensorFlow initializes the runtime.