numpy
multiprocessing
Python
performance
troubleshooting

My numpy build doesn't use multiple CPU cores

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

NumPy itself is not a general-purpose multithreading engine. Some operations use multiple CPU cores through linked BLAS or LAPACK libraries such as OpenBLAS or MKL, but many ordinary NumPy array operations are still single-threaded. So when a NumPy build appears to use only one core, the first question is not "is NumPy broken". It is "which operation am I measuring and which math library is actually underneath it".

Not Every NumPy Operation Is Parallel

This is the biggest misconception. Matrix multiplication and some linear algebra calls may use multiple threads through BLAS. Element-wise operations such as simple additions often do not.

For example:

python
1import numpy as np
2
3x = np.random.rand(2000, 2000)
4y = np.random.rand(2000, 2000)
5
6z = x + y          # often memory-bound and not heavily threaded
7m = x @ y          # often uses BLAS and may use many cores

If you benchmark x + y and see one core doing most of the work, that does not prove your NumPy build is wrong.

Check Which BLAS Library NumPy Uses

The easiest first diagnostic is:

python
import numpy as np
np.show_config()

Look for references to libraries such as:

  • OpenBLAS
  • MKL
  • BLIS
  • Accelerate on macOS

If NumPy is linked against a single-threaded or minimal backend, then matrix-heavy operations will not scale the way you expect.

Test a BLAS-Heavy Operation

To see whether the build can use multiple cores, test a large matrix multiplication rather than a simple element-wise expression.

python
1import numpy as np
2import time
3
4x = np.random.rand(4000, 4000)
5y = np.random.rand(4000, 4000)
6
7start = time.time()
8_ = x @ y
9print(time.time() - start)

While this runs, observe CPU usage with your system tools. If the BLAS backend is multithreaded and configured to use several threads, you should typically see multiple cores active.

Thread Count May Be Limited by Environment Variables

Even when NumPy is linked against a multithreaded BLAS, the number of threads can be capped by environment variables.

Common ones include:

  • 'OMP_NUM_THREADS'
  • 'OPENBLAS_NUM_THREADS'
  • 'MKL_NUM_THREADS'
  • 'VECLIB_MAXIMUM_THREADS on some macOS setups'

Example:

bash
export OPENBLAS_NUM_THREADS=8
python your_script.py

Or for MKL:

bash
export MKL_NUM_THREADS=8
python your_script.py

If one of these is set to 1, the build may be behaving exactly as configured.

Use threadpoolctl to Inspect Runtime Thread Pools

A practical Python-side tool is threadpoolctl, which can reveal the thread pool libraries loaded into the process.

python
1from threadpoolctl import threadpool_info
2
3for item in threadpool_info():
4    print(item)

This helps answer questions such as:

  • which numeric backend is loaded
  • how many threads it is configured to use
  • whether OpenBLAS or MKL is actually present

That is often more informative than guessing from package names alone.

Building from Source Is Usually Not the First Fix

People often assume they need to rebuild NumPy manually. Usually they do not. In many environments, the simpler solution is to install a distribution already linked to a good BLAS implementation.

Examples include:

  • conda packages that ship with MKL or OpenBLAS
  • wheel builds that already link the intended backend

Rebuilding from source only makes sense when you have a specific reason and understand which numeric backend you want to link.

Common Pitfalls

  • Expecting all NumPy operations to scale across cores the way matrix multiplication often does.
  • Benchmarking element-wise array math and concluding that BLAS threading is broken.
  • Forgetting that environment variables may limit BLAS thread count to one.
  • Guessing about the backend instead of checking np.show_config() or threadpoolctl.
  • Rebuilding NumPy before verifying whether the slow operation is even one that should be multithreaded.

Summary

  • NumPy does not automatically multithread every operation.
  • Multi-core behavior usually depends on the linked BLAS or LAPACK backend.
  • Check the backend with np.show_config() and runtime thread pools with threadpoolctl.
  • Benchmark a BLAS-heavy operation such as matrix multiplication, not just element-wise math.
  • Verify thread-limit environment variables before assuming the build is wrong.

Course illustration
Course illustration

All Rights Reserved.