onnxruntime inference is way slower than pytorch on GPU

ONNXRuntime

PyTorch

GPU Inference

Deep Learning Performance

Machine Learning Optimization

onnxruntime inference is way slower than pytorch on GPU

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

ONNX Runtime is often fast on GPU, but it is not guaranteed to beat PyTorch automatically. If ONNX Runtime is much slower, the problem is usually not "ONNX is bad" but a benchmarking mistake, an execution-provider issue, or a model graph that does not map efficiently to the available kernels.

The first job is to verify that both runtimes are really using the GPU the way you think they are. After that, you can look at graph export quality, warmup, and host-device transfer overhead.

Start by Verifying the Execution Provider

The most common mistake is that ONNX Runtime is not actually running the graph on the GPU end to end.

python

1import onnxruntime as ort
2
3session = ort.InferenceSession(
4    "model.onnx",
5    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
6)
7
8print(session.get_providers())
9print(session.get_inputs()[0].name)

If CUDAExecutionProvider is missing, you are not benchmarking GPU inference. That often means the wrong package was installed, such as CPU-only onnxruntime instead of onnxruntime-gpu.

Benchmark Correctly

GPU benchmarks are easy to distort. You need warmup iterations and explicit CUDA synchronization on the PyTorch side.

python

1import time
2import torch
3import onnxruntime as ort
4import numpy as np
5
6x = torch.randn(32, 3, 224, 224, device="cuda")
7model = torch.nn.Identity().cuda().eval()
8
9for _ in range(20):
10    with torch.no_grad():
11        _ = model(x)
12torch.cuda.synchronize()
13
14start = time.perf_counter()
15for _ in range(100):
16    with torch.no_grad():
17        _ = model(x)
18torch.cuda.synchronize()
19print("PyTorch seconds:", time.perf_counter() - start)
20
21session = ort.InferenceSession(
22    "model.onnx",
23    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
24)
25input_name = session.get_inputs()[0].name
26x_np = x.cpu().numpy()
27
28for _ in range(20):
29    session.run(None, {input_name: x_np})
30
31start = time.perf_counter()
32for _ in range(100):
33    session.run(None, {input_name: x_np})
34print("ONNX Runtime seconds:", time.perf_counter() - start)

This is still not a perfect apples-to-apples benchmark because the ONNX example copies data back to CPU NumPy before inference. That copy alone can dominate runtime for small models.

Host-Device Transfers Can Kill the Result

A frequent reason ONNX Runtime looks slow is that data is bouncing between CPU and GPU. PyTorch code may already have tensors resident on the GPU, while the ONNX pipeline converts them to CPU NumPy arrays for session.run.

If you compare that to pure GPU-side PyTorch execution, ONNX Runtime loses before the model even starts.

For serious GPU benchmarking, minimize transfers and look at IO binding when needed. The important idea is that runtime speed and data-movement speed are not the same thing.

Small Models and Small Batches Often Favor PyTorch

If the model is tiny or the batch size is small, launch overhead and framework integration costs matter more than raw kernel efficiency. In those cases, the theoretical benefits of graph optimization may not be large enough to overcome setup overhead.

This is why some workloads show ONNX Runtime helping a lot, while others show little benefit or even regressions. Measure the actual production shape, not just a toy batch of size 1 unless that is really your production case.

Export Quality Matters

A bad export can produce a graph with extra reshapes, unsupported ops, or fallback behavior. If ONNX Runtime cannot execute important nodes efficiently on the GPU, performance suffers.

In PyTorch, an operation may use a highly optimized kernel. After export, the graph may be decomposed differently, or some operators may be handled less efficiently by the available ONNX Runtime provider.

That is why graph inspection matters. Do not assume that because the model is mathematically the same, the runtime path is equally optimized.

Tune Session Options When Needed

ONNX Runtime has tunable session options, and they can matter for performance.

python

1import onnxruntime as ort
2
3opts = ort.SessionOptions()
4opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
5
6session = ort.InferenceSession(
7    "model.onnx",
8    sess_options=opts,
9    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
10)

This will not solve every slowdown, but it is a reasonable baseline. You should also make sure the runtime, CUDA stack, and exported model opset are compatible.

Common Pitfalls

Accidentally using CPU-only ONNX Runtime and thinking you are benchmarking GPU inference.
Comparing GPU-resident PyTorch tensors against ONNX Runtime calls that first copy inputs to CPU NumPy arrays.
Measuring without warmup or without synchronizing CUDA work on the PyTorch side.
Assuming a successful ONNX export means the resulting graph is equally optimized for the GPU.
Benchmarking tiny batches and drawing broad conclusions about runtime performance.

Summary

ONNX Runtime being slower than PyTorch on GPU usually points to benchmarking or execution-path issues, not just the runtime choice.
Confirm that CUDAExecutionProvider is actually active.
Eliminate unnecessary CPU-GPU transfers before comparing runtimes.
Benchmark with warmup and synchronization so the numbers are meaningful.
Inspect the exported graph and provider compatibility when performance remains poor.