Tensorflow simultaneous prediction on GPU and CPU

TensorFlow

GPU

CPU

simultaneous prediction

machine learning

Tensorflow simultaneous prediction on GPU and CPU

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

TensorFlow can execute work on both CPU and GPU, but it does not automatically split one prediction call across both devices in a magical way. If you want simultaneous inference, the usual pattern is to place separate model executions on different devices and run them concurrently.

How Device Placement Works

TensorFlow assigns each operation to a device that supports it. In most single-model inference setups, GPU-friendly operations are placed on the GPU and supporting work stays on the CPU. That already gives mixed-device execution, but it is not the same as serving one batch half on CPU and half on GPU.

For explicit simultaneous prediction, think in terms of independent workloads. One request stream might be latency-sensitive and small enough for the CPU. Another might be throughput-oriented and better on the GPU. TensorFlow lets you control that with tf.device(...).

A Concurrent Inference Pattern

The simplest design is to keep one model instance on the CPU and one on the GPU, then submit separate batches to each. The example below builds two identical models, copies the weights, and runs predictions in parallel with a thread pool.

python

1import concurrent.futures
2import numpy as np
3import tensorflow as tf
4
5
6def build_model():
7    model = tf.keras.Sequential([
8        tf.keras.layers.Input(shape=(16,)),
9        tf.keras.layers.Dense(64, activation="relu"),
10        tf.keras.layers.Dense(32, activation="relu"),
11        tf.keras.layers.Dense(4, activation="softmax"),
12    ])
13    model(np.zeros((1, 16), dtype=np.float32))
14    return model
15
16
17with tf.device("/CPU:0"):
18    cpu_model = build_model()
19
20with tf.device("/GPU:0"):
21    gpu_model = build_model()
22
23gpu_model.set_weights(cpu_model.get_weights())
24
25
26def predict_on_cpu(batch):
27    with tf.device("/CPU:0"):
28        return cpu_model(batch, training=False).numpy()
29
30
31def predict_on_gpu(batch):
32    with tf.device("/GPU:0"):
33        return gpu_model(batch, training=False).numpy()
34
35
36cpu_batch = tf.random.normal((64, 16))
37gpu_batch = tf.random.normal((512, 16))
38
39with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
40    cpu_future = executor.submit(predict_on_cpu, cpu_batch)
41    gpu_future = executor.submit(predict_on_gpu, gpu_batch)
42    cpu_result = cpu_future.result()
43    gpu_result = gpu_future.result()
44
45print(cpu_result.shape, gpu_result.shape)

This is a realistic pattern for serving. The CPU can handle small requests without GPU launch overhead, while the GPU handles large batches that benefit from parallel math.

When This Helps and When It Does Not

Running simultaneous inference makes sense when the workloads are independent and large enough to justify the extra coordination. It is most useful in systems that already have mixed request sizes or when the GPU is not fully utilized by itself.

It helps less when you have a tiny model and small batches. In that case, the overhead of copying tensors, managing threads, and maintaining two model instances can outweigh the gain. It also helps less if the input pipeline is the real bottleneck. If CPU preprocessing is saturated, adding concurrent GPU inference may not improve end-to-end latency.

Before optimizing, inspect the available devices:

python

1import tensorflow as tf
2
3for device in tf.config.list_logical_devices():
4    print(device)

That confirms whether TensorFlow sees both the CPU and the GPU. If it only sees the CPU, concurrent device placement is obviously not available.

Practical Design Advice

For training, use distribution strategies and device-aware batching. For inference, keep the design simpler. Duplicate the model when necessary, pin each copy to a device, and route requests intentionally.

Also remember that some operations may still execute on the CPU even inside a GPU-placed model. Device placement is per operation, not just per high-level function call. That is normal and does not mean the configuration failed.

In production, measure throughput, tail latency, and memory use. A second model copy costs RAM and may increase startup time. The goal is not to use every device at all times; the goal is to improve the metrics that matter for your workload.

Common Pitfalls

Expecting one model.predict(...) call to be split automatically across CPU and GPU is the wrong mental model. TensorFlow places operations, not arbitrary slices of one request.

Sharing one model instance across devices without thinking about placement can lead to confusing performance. Use explicit device scopes when you need predictable behavior.

Sending very small batches to the GPU often performs worse than keeping them on the CPU because GPU launch overhead dominates.

Ignoring input transfer and preprocessing costs can hide the real bottleneck. Measure the whole pipeline, not just the math kernels.

Summary

Simultaneous prediction on CPU and GPU is possible when you run separate inference workloads concurrently.
The common pattern is one model copy per device with explicit tf.device(...) placement.
This approach helps when workloads are independent and large enough to justify the overhead.
Always benchmark full pipeline latency and throughput before keeping the extra complexity.