ML Systems & Infrastructure
ML Fundamentals for Engineers
Data Infrastructure
Training Infrastructure
Evaluation and Testing
Production Operations
Specialized Systems and Capstone
Real-Time vs Batch Inference
When a user types a search query, your model has roughly 50-100ms to respond before latency becomes perceptible. That constraint defines online inference: a request arrives, the model runs a forward pass, and the result is returned synchronously before the caller times out. Every architectural decision flows from that latency budget.

The Serving Stack
The typical architecture looks like this: a load balancer distributes traffic across a pool of model servers (TorchServe, Triton Inference Server, or TF Serving), each backed by one or more GPUs. The model server owns the GPU lifecycle — it loads the model into VRAM on startup, keeps it resident, and processes requests against the live weights.
A request hits the serving stack in stages:
- Load balancer routes the request to a healthy model server instance
- Preprocessing — feature store lookup (5-10ms), tokenization, normalization
- Batching — the server collects the request into a batch with other concurrent requests
- GPU forward pass — the model runs inference on the batch
- Postprocessing — format the output, apply business logic, return the response
If the feature store lookup adds 8ms and tokenization adds 3ms, you have already consumed 11ms of your 50ms budget before a single GPU cycle runs. This is why profiling the full pipeline matters — the model forward pass is often not the bottleneck.
Dynamic Batching
GPUs are parallel processors — running one prediction at a time wastes most of the hardware. A single A100 can process 1 sample or 64 samples in roughly the same wall-clock time because the matrix multiplications parallelize across thousands of CUDA cores.
Model servers solve this by collecting incoming requests over a short window (5-10ms) and grouping them into a single batched forward pass. From the caller's perspective, each request is independent. From the GPU's perspective, it processes 32 or 64 samples simultaneously.
The impact is dramatic:
This 32x throughput improvement at the cost of 10ms latency is the core trade-off of dynamic batching. For most user-facing applications with a 50-100ms budget, 10ms is an acceptable price.
GPU auto-scaling based on queue depth is more responsive than CPU-based scaling. When the model server queue depth crosses a threshold (20 pending requests), trigger a scale-out event. CPU utilization lags the actual bottleneck — the GPU is the constraint, not the CPU.
Model Versioning in Production
Deploying a new model version requires zero-downtime rollouts. The standard approach is blue-green deployment:
- Stand up the new model version on a separate server group
- Wait for the model to load into GPU memory (30-120 seconds for large models)
- Shift 1-5% of traffic as a canary
- Monitor error rates, latency percentiles, and prediction quality for 30 minutes
- If the canary passes, promote to 100% traffic
- If it fails, drain traffic back to the previous version in seconds
This is more complex than a typical web service deployment because the warm-up phase (loading multi-GB model weights into VRAM) must complete before the new version can serve any traffic.
Failure Modes
Online serving introduces failure modes that batch pipelines never encounter:
GPU OOM — A batch of unexpectedly long inputs exhausts VRAM. Set a maximum input length and batch memory budget. Reject oversized requests with a 413 error rather than crashing the server and taking down all concurrent requests.
Cold start — A new model server pod takes 60-120 seconds to load weights. Kubernetes readiness probes prevent traffic from hitting an unready pod, but autoscale lag still creates queuing spikes during traffic bursts.
Request queuing — Under a traffic burst, the model server queue fills faster than the GPU drains it. Set a queue depth limit and return 503 rather than letting latency degrade from 50ms to 5 seconds. A fast failure is better than a slow one.