Real-Time vs Batch Inference

Topics Covered

Online Serving Patterns

The Serving Stack

Dynamic Batching

Model Versioning in Production

Failure Modes

Batch Prediction Pipelines

Architecture

The Cost Math

Use Cases Where Batch Dominates

Freshness: The Core Trade-Off

Failure Handling

Near-Real-Time with Micro-Batching

How Micro-Batching Works

Use Cases Where Micro-Batching Wins

Latency and Cost Characteristics

Operational Concerns

Choosing the Right Pattern

When Online Inference Is Required

When Batch Is Sufficient

When Hybrid Makes Sense

The Decision Framework

Real-World Examples

Migration Path

When a user types a search query, your model has roughly 50-100ms to respond before latency becomes perceptible. That constraint defines online inference: a request arrives, the model runs a forward pass, and the result is returned synchronously before the caller times out. Every architectural decision flows from that latency budget.

Side-by-side comparison of online inference computing predictions on demand versus batch inference precomputing all predictions overnight

The Serving Stack

The typical architecture looks like this: a load balancer distributes traffic across a pool of model servers (TorchServe, Triton Inference Server, or TF Serving), each backed by one or more GPUs. The model server owns the GPU lifecycle — it loads the model into VRAM on startup, keeps it resident, and processes requests against the live weights.

A request hits the serving stack in stages:

  1. Load balancer routes the request to a healthy model server instance
  2. Preprocessing — feature store lookup (5-10ms), tokenization, normalization
  3. Batching — the server collects the request into a batch with other concurrent requests
  4. GPU forward pass — the model runs inference on the batch
  5. Postprocessing — format the output, apply business logic, return the response

If the feature store lookup adds 8ms and tokenization adds 3ms, you have already consumed 11ms of your 50ms budget before a single GPU cycle runs. This is why profiling the full pipeline matters — the model forward pass is often not the bottleneck.

Dynamic Batching

GPUs are parallel processors — running one prediction at a time wastes most of the hardware. A single A100 can process 1 sample or 64 samples in roughly the same wall-clock time because the matrix multiplications parallelize across thousands of CUDA cores.

Model servers solve this by collecting incoming requests over a short window (5-10ms) and grouping them into a single batched forward pass. From the caller's perspective, each request is independent. From the GPU's perspective, it processes 32 or 64 samples simultaneously.

The impact is dramatic:

 
1Without batching:  1 request  per forward pass  → 200 predictions/sec
2With batching:     64 requests per forward pass  → 6,400 predictions/sec
3Throughput gain:   32x
4Added latency:     up to 10ms (the collection window)

This 32x throughput improvement at the cost of 10ms latency is the core trade-off of dynamic batching. For most user-facing applications with a 50-100ms budget, 10ms is an acceptable price.

Key Insight

GPU auto-scaling based on queue depth is more responsive than CPU-based scaling. When the model server queue depth crosses a threshold (20 pending requests), trigger a scale-out event. CPU utilization lags the actual bottleneck — the GPU is the constraint, not the CPU.

Model Versioning in Production

Deploying a new model version requires zero-downtime rollouts. The standard approach is blue-green deployment:

  1. Stand up the new model version on a separate server group
  2. Wait for the model to load into GPU memory (30-120 seconds for large models)
  3. Shift 1-5% of traffic as a canary
  4. Monitor error rates, latency percentiles, and prediction quality for 30 minutes
  5. If the canary passes, promote to 100% traffic
  6. If it fails, drain traffic back to the previous version in seconds

This is more complex than a typical web service deployment because the warm-up phase (loading multi-GB model weights into VRAM) must complete before the new version can serve any traffic.

Failure Modes

Online serving introduces failure modes that batch pipelines never encounter:

GPU OOM — A batch of unexpectedly long inputs exhausts VRAM. Set a maximum input length and batch memory budget. Reject oversized requests with a 413 error rather than crashing the server and taking down all concurrent requests.

Cold start — A new model server pod takes 60-120 seconds to load weights. Kubernetes readiness probes prevent traffic from hitting an unready pod, but autoscale lag still creates queuing spikes during traffic bursts.

Request queuing — Under a traffic burst, the model server queue fills faster than the GPU drains it. Set a queue depth limit and return 503 rather than letting latency degrade from 50ms to 5 seconds. A fast failure is better than a slow one.