Getting a prediction from an ONNX model in python
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
The usual way to get predictions from an ONNX model in Python is to load the model with onnxruntime.InferenceSession, inspect the input and output metadata, prepare NumPy arrays with the exact expected shape and dtype, and then call session.run(...). Most ONNX inference problems come from mismatched preprocessing rather than from the runtime call itself.
So the important steps are not just loading the file and calling run. The important steps are matching the model's names, shapes, types, and preprocessing assumptions exactly.
Install the Runtime
For CPU inference:
If you have a compatible CUDA environment and want GPU execution:
The inference code stays similar. What changes is which execution providers the runtime can use.
The Basic Inference Flow
This is the normal pattern:
- create the session
- inspect the metadata
- prepare the input dictionary
- call
run
Why Input Inspection Matters
You should not guess input names, shapes, or dtypes.
Typical mistakes include:
- sending
float64when the model expectsfloat32 - forgetting the batch dimension
- using the wrong channel order for images
- guessing the input name incorrectly
Inspect first, then build the feed dictionary.
Example with Image Preprocessing
This is a good example because it shows that the preprocessing pipeline is part of getting a correct prediction. If the original model expected different normalization or channel ordering, the runtime call can succeed while the prediction is still wrong.
Multiple Inputs and Outputs
Some ONNX models take more than one input tensor.
Passing None as the first argument asks for all outputs.
Execution Providers and Performance
Provider selection matters if you care about speed or hardware placement.
With the GPU runtime installed, you can request GPU providers if the environment supports them. The model file stays the same, but deployment expectations may change with the provider stack.
Validate Against the Original Model
If the ONNX call runs but predictions look wrong, the issue is often one of:
- preprocessing mismatch
- wrong dtype
- wrong batch shape
- missing normalization
- RGB versus BGR mismatch
The most reliable debugging step is to run the same sample through the original framework and compare the outputs. If they diverge immediately, the export or preprocessing path is the first place to investigate.
Common Pitfalls
The biggest mistake is feeding arrays with the wrong dtype, especially NumPy float64 instead of float32. Another is guessing the input name instead of reading it from session.get_inputs(). Developers also often assume the ONNX model contains all preprocessing logic when the original application actually did preprocessing outside the model graph. Finally, if the model expects multiple inputs, passing a single tensor in the wrong dictionary shape is a very common source of failure.
Summary
- Load the model with
onnxruntime.InferenceSession. - Inspect input and output names, shapes, and types before running inference.
- Build NumPy inputs that match the model exactly.
- Call
session.run(...)with the correct feed dictionary. - If predictions are wrong, verify preprocessing before blaming ONNX Runtime.

