Change number of threads for Tensorflow inference with C API
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
With the TensorFlow C API, CPU thread settings must be decided before the session is created. If inference is already running, you generally cannot dial the thread count up or down on the existing session and expect TensorFlow to rebuild its execution pools in place.
What Thread Counts Mean
TensorFlow uses two related CPU settings:
- intra-op threads for parallel work inside one operation such as matrix multiplication
- inter-op threads for running independent operations at the same time
If you set these too low, inference may underuse the machine. If you set them too high, contention and context switching can hurt latency.
Practical Option: Configure Before Session Creation
The C API exposes TF_SetConfig on TF_SessionOptions. Under the hood, that function expects a serialized TensorFlow config proto. In real C applications, the important rule is that the configuration must be applied before creating or loading the session.
A minimal C example can set environment variables before building the session. This is easy to test and keeps the example runnable:
The key detail is placement: set thread-related configuration before TF_LoadSessionFromSavedModel or before creating a graph session.
Per-Session Control With TF_SetConfig
If you need explicit per-session control instead of process-level configuration, use TF_SetConfig with a serialized ConfigProto. That is the official C API hook.
The tradeoff is that you need protobuf bytes for fields such as intra_op_parallelism_threads and inter_op_parallelism_threads. Many teams generate that proto in a higher-level language or with TensorFlow protobuf definitions during the build, then pass the serialized bytes to TF_SetConfig.
Conceptually the flow is:
- create
TF_SessionOptions - serialize a config proto with the thread settings
- call
TF_SetConfig - create or load the session
If you call TF_SetConfig after the session exists, it is too late for that session.
How to Tune the Values
There is no universal best pair of numbers. A good starting point for CPU inference is:
- try a small inter-op value such as 1 or 2
- vary intra-op threads around the number of physical cores available to the process
- measure throughput and latency instead of guessing
For single-request latency, fewer threads can sometimes be faster because they reduce scheduling overhead. For batched throughput, a higher setting may help.
Common Pitfalls
The most common mistake is changing thread settings after the session has already been created. TensorFlow thread pools are typically decided earlier than that.
Another issue is tuning on logical CPU count alone. Hyperthreaded cores do not always behave like fully independent cores for inference workloads.
A third problem is benchmarking without fixing the workload. Thread settings that improve large-batch throughput may hurt single-request latency.
Summary
- Set TensorFlow CPU thread counts before creating or loading the session.
- '
intra-opcontrols parallelism inside an op;inter-opcontrols parallelism across ops.' - The C API hook for session configuration is
TF_SetConfig. - A practical runnable approach is to set thread-related environment variables before session creation.
- Tune with measurements, because the best thread counts depend on model shape and workload.

