ONNX
Python
Machine Learning
Model Prediction
Deep Learning

Getting a prediction from an ONNX model in python

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

The usual way to get predictions from an ONNX model in Python is to load the model with onnxruntime.InferenceSession, inspect the input and output metadata, prepare NumPy arrays with the exact expected shape and dtype, and then call session.run(...). Most ONNX inference problems come from mismatched preprocessing rather than from the runtime call itself.

So the important steps are not just loading the file and calling run. The important steps are matching the model's names, shapes, types, and preprocessing assumptions exactly.

Install the Runtime

For CPU inference:

bash
pip install onnxruntime

If you have a compatible CUDA environment and want GPU execution:

bash
pip install onnxruntime-gpu

The inference code stays similar. What changes is which execution providers the runtime can use.

The Basic Inference Flow

python
1import numpy as np
2import onnxruntime as ort
3
4session = ort.InferenceSession("model.onnx")
5
6input_info = session.get_inputs()[0]
7output_info = session.get_outputs()[0]
8
9print(input_info.name)
10print(input_info.shape)
11print(input_info.type)
12
13x = np.random.randn(1, 3, 224, 224).astype(np.float32)
14result = session.run([output_info.name], {input_info.name: x})
15
16print(result[0].shape)

This is the normal pattern:

  • create the session
  • inspect the metadata
  • prepare the input dictionary
  • call run

Why Input Inspection Matters

You should not guess input names, shapes, or dtypes.

python
1for inp in session.get_inputs():
2    print(inp.name, inp.shape, inp.type)
3
4for out in session.get_outputs():
5    print(out.name, out.shape, out.type)

Typical mistakes include:

  • sending float64 when the model expects float32
  • forgetting the batch dimension
  • using the wrong channel order for images
  • guessing the input name incorrectly

Inspect first, then build the feed dictionary.

Example with Image Preprocessing

python
1from PIL import Image
2import numpy as np
3import onnxruntime as ort
4
5image = Image.open("cat.jpg").convert("RGB").resize((224, 224))
6array = np.asarray(image).astype(np.float32) / 255.0
7array = np.transpose(array, (2, 0, 1))
8array = np.expand_dims(array, axis=0)
9
10session = ort.InferenceSession("image_model.onnx")
11input_name = session.get_inputs()[0].name
12output_name = session.get_outputs()[0].name
13
14pred = session.run([output_name], {input_name: array})[0]
15print(pred.shape)
16print(np.argmax(pred, axis=1))

This is a good example because it shows that the preprocessing pipeline is part of getting a correct prediction. If the original model expected different normalization or channel ordering, the runtime call can succeed while the prediction is still wrong.

Multiple Inputs and Outputs

Some ONNX models take more than one input tensor.

python
1session = ort.InferenceSession("multi_input_model.onnx")
2
3inputs = {
4    "input_a": np.random.randn(1, 10).astype(np.float32),
5    "input_b": np.random.randn(1, 5).astype(np.float32),
6}
7
8outputs = session.run(None, inputs)
9for i, output in enumerate(outputs):
10    print(i, output.shape)

Passing None as the first argument asks for all outputs.

Execution Providers and Performance

Provider selection matters if you care about speed or hardware placement.

python
1session = ort.InferenceSession(
2    "model.onnx",
3    providers=["CPUExecutionProvider"],
4)
5
6print(session.get_providers())

With the GPU runtime installed, you can request GPU providers if the environment supports them. The model file stays the same, but deployment expectations may change with the provider stack.

Validate Against the Original Model

If the ONNX call runs but predictions look wrong, the issue is often one of:

  • preprocessing mismatch
  • wrong dtype
  • wrong batch shape
  • missing normalization
  • RGB versus BGR mismatch

The most reliable debugging step is to run the same sample through the original framework and compare the outputs. If they diverge immediately, the export or preprocessing path is the first place to investigate.

Common Pitfalls

The biggest mistake is feeding arrays with the wrong dtype, especially NumPy float64 instead of float32. Another is guessing the input name instead of reading it from session.get_inputs(). Developers also often assume the ONNX model contains all preprocessing logic when the original application actually did preprocessing outside the model graph. Finally, if the model expects multiple inputs, passing a single tensor in the wrong dictionary shape is a very common source of failure.

Summary

  • Load the model with onnxruntime.InferenceSession.
  • Inspect input and output names, shapes, and types before running inference.
  • Build NumPy inputs that match the model exactly.
  • Call session.run(...) with the correct feed dictionary.
  • If predictions are wrong, verify preprocessing before blaming ONNX Runtime.

Course illustration
Course illustration