ONNX
Python
machine learning
inference
model deployment

Run inference using Onnx model in python?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

ONNX Runtime is one of the simplest ways to run cross-framework models in Python. You train in PyTorch or TensorFlow, export to ONNX, and then perform inference with a runtime optimized for CPU, CUDA, and other execution providers. This decouples deployment from training framework internals.

Successful ONNX inference depends on three things: matching input names and shapes, selecting the right providers, and validating output postprocessing. This guide covers those steps in a production-oriented flow.

Core Sections

1. Install runtime and inspect model I/O

bash
pip install onnx onnxruntime numpy
# or for GPU
pip install onnxruntime-gpu
python
1import onnxruntime as ort
2
3session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
4print([i.name for i in session.get_inputs()])
5print([o.name for o in session.get_outputs()])

Always inspect graph inputs instead of guessing names.

2. Prepare input with exact dtype and shape

python
1import numpy as np
2
3input_name = session.get_inputs()[0].name
4x = np.random.rand(1, 3, 224, 224).astype(np.float32)
5
6outputs = session.run(None, {input_name: x})
7print(outputs[0].shape)

Most failures come from wrong rank or dtype. Add assertions around preprocessing pipeline to enforce contract.

3. Enable GPU provider when available

python
1session = ort.InferenceSession(
2    "model.onnx",
3    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
4)
5print(session.get_providers())

Provider order matters. Runtime tries providers in listed order and falls back if unsupported kernels appear.

4. Production considerations

Pin ONNX Runtime version and model opset to avoid compatibility drift. For latency-sensitive services, warm up session at startup and reuse sessions across requests instead of recreating per call.

For classification pipelines, centralize postprocessing (softmax, argmax, label mapping) and version it alongside model artifacts to prevent mismatched outputs across services.

5. Build repeatable verification around ONNX inference pipelines in Python

After implementation works once, lock in behavior with repeatable verification artifacts. At minimum, maintain one baseline case, one edge case, and one failure-path case with expected outcomes written down in plain language. This prevents accidental regressions when dependencies, runtime versions, or surrounding infrastructure change.

Use lightweight automation for these checks so they run in local development and CI. A practical pattern is to keep a tiny fixture dataset and one command that executes the critical path end to end. If that command fails, engineers can reproduce issues quickly without rebuilding the entire environment from scratch.

text
1verification checklist
2- baseline scenario with expected output
3- edge scenario with constrained input
4- failure scenario with expected error behavior
5- runtime and dependency versions captured

Treat this checklist as versioned code-adjacent documentation. Updating ONNX inference pipelines in Python without updating its verification contract is a common source of drift and support incidents.

6. Operational guidance and maintenance strategy

The long-term reliability of ONNX inference pipelines in Python depends on observability and change discipline. Add structured logging and targeted metrics around the most failure-prone stages so you can answer quickly: what input was processed, what branch was taken, and why output changed. Incident response improves dramatically when these signals exist before the outage.

Also define ownership for changes. When libraries, runtime versions, or platform policies evolve, someone should review compatibility and re-run validation artifacts before rollout. Small proactive checks are cheaper than emergency rollback windows.

Finally, schedule periodic contract checks even when no incident is active. Silent drift accumulates over time through dependency updates and environment differences. Preventive checks keep ONNX inference pipelines in Python predictable and reduce production surprises.

Common Pitfalls

  • Guessing input tensor names instead of reading model metadata.
  • Feeding wrong dtype or shape and interpreting errors as model corruption.
  • Installing CPU runtime but expecting CUDA provider availability.
  • Recreating inference sessions per request and wasting startup overhead.
  • Versioning model files without versioning postprocessing label maps.

Summary

Running ONNX inference in Python is straightforward with ONNX Runtime once input/output contracts are explicit. Inspect model metadata, enforce preprocessing shape and dtype, and select providers intentionally. With version pinning, session reuse, and stable postprocessing, ONNX becomes a dependable deployment format across environments.


Course illustration
Course illustration

All Rights Reserved.