Changing the number of threads in TensorFlow on Cifar10

TensorFlow

Cifar10

Machine Learning

Parallel Computing

Neural Networks

Changing the number of threads in TensorFlow on Cifar10

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When TensorFlow training on CIFAR-10 runs mainly on CPU, thread configuration can change training throughput quite a bit. TensorFlow exposes two main thread settings: intra-op threads, which control parallelism inside one operation, and inter-op threads, which control how many independent operations can run at the same time.

In modern TensorFlow 2 code, the standard way to set these values is through tf.config.threading. That is preferable to older session-based examples unless you are maintaining legacy TensorFlow 1 code.

What the Thread Settings Mean

The thread settings solve different problems:

'tf.config.threading.set_intra_op_parallelism_threads(n) limits parallel work inside one op such as a matrix multiply.'
'tf.config.threading.set_inter_op_parallelism_threads(n) limits how many separate ops can be scheduled concurrently.'

These settings mostly matter for CPU execution. If you train on GPU, CPU threads still affect input preparation and some supporting work, but they are usually not the primary performance lever.

Set the Values Early in Process Startup

Configure threading before you build and train the model. That gives TensorFlow a clean initialization path.

python

1import tensorflow as tf
2from tensorflow import keras
3
4tf.config.threading.set_intra_op_parallelism_threads(4)
5tf.config.threading.set_inter_op_parallelism_threads(2)
6
7(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
8
9x_train = x_train.astype("float32") / 255.0
10x_test = x_test.astype("float32") / 255.0
11
12model = keras.Sequential([
13    keras.layers.Input(shape=(32, 32, 3)),
14    keras.layers.Conv2D(32, 3, activation="relu"),
15    keras.layers.MaxPooling2D(),
16    keras.layers.Conv2D(64, 3, activation="relu"),
17    keras.layers.GlobalAveragePooling2D(),
18    keras.layers.Dense(10, activation="softmax"),
19])
20
21model.compile(
22    optimizer="adam",
23    loss="sparse_categorical_crossentropy",
24    metrics=["accuracy"],
25)
26
27model.fit(x_train, y_train, batch_size=128, epochs=3, validation_split=0.1)

There is no universal best value. Four and two are only example numbers. Good settings depend on the machine, whether the workload is CPU-bound, and what else is competing for cores.

Benchmark Instead of Guessing

A common mistake is assuming "more threads equals more speed." Past a certain point, extra threads can make things slower because of scheduling overhead, memory pressure, or contention between ops.

python

1import time
2import tensorflow as tf
3
4def matmul_benchmark():
5    x = tf.random.normal((2000, 1024))
6    w = tf.random.normal((1024, 1024))
7
8    start = time.time()
9    for _ in range(30):
10        _ = tf.matmul(x, w)
11    return time.time() - start
12
13print("elapsed seconds:", matmul_benchmark())

Use small benchmarks like this for quick feedback, but rely on end-to-end epoch time for real decisions. CIFAR-10 training involves more than one hot operation. Data loading, augmentation, and batching can dominate the result.

Tune the Input Pipeline Too

If the input pipeline is slow, changing compute thread counts will not fix the whole system. In TensorFlow, tf.data often matters just as much.

python

1train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
2train_ds = train_ds.shuffle(10000)
3train_ds = train_ds.map(
4    lambda x, y: (tf.image.random_flip_left_right(x), y),
5    num_parallel_calls=tf.data.AUTOTUNE,
6)
7train_ds = train_ds.batch(128).prefetch(tf.data.AUTOTUNE)

If your CPU spends time decoding, augmenting, and batching images, that pipeline can become the real bottleneck. In that case, changing intra-op and inter-op counts alone gives only limited gains.

Legacy TensorFlow 1.x Code Looks Different

Older tutorials may use ConfigProto with a session:

python

1import tensorflow as tf
2
3config = tf.compat.v1.ConfigProto(
4    intra_op_parallelism_threads=4,
5    inter_op_parallelism_threads=2,
6)
7session = tf.compat.v1.Session(config=config)

That is still relevant for TensorFlow 1 compatibility code, but new TensorFlow 2 projects should prefer tf.config.threading.

Common Pitfalls

The biggest mistake is changing thread counts after TensorFlow has already initialized significant work and then assuming the numbers had full effect. Another is oversubscribing the CPU by setting values close to or above total logical cores without measuring. That often hurts more than it helps.

Developers also sometimes blame thread settings for poor training speed when the model is really GPU-bound or when the input pipeline is the limiting factor. Always profile the whole training path before deciding what to tune.

Summary

Use tf.config.threading in modern TensorFlow 2 code.
'intra_op controls parallelism inside one op; inter_op controls concurrency across independent ops.'
Set thread values early, before training begins.
Benchmark end-to-end training time instead of guessing from core count.
Check the tf.data pipeline too, because input work may be the real bottleneck.