Model Serving Architecture

ML Systems & Infrastructure

Model Serving Architecture

Topics Covered

Serving patterns overview

Embedded Serving

Dedicated Model Server

The Decision Framework

Model servers and runtimes

TensorFlow Serving (TFServing)

NVIDIA Triton Inference Server

vLLM for Large Language Models

Model Formats

Containerized model deployment

The Model Container

Kubernetes for Model Serving

Readiness Probes and Model Loading

Auto-scaling for inference

Why Standard CPU-Based Scaling Fails for ML

Model Warm-Up and Cold Start

Scaling to Zero

Traffic management and A/B testing

Shadow Deployment

Canary Deployment

A/B Testing for Models

Blue-Green Deployment

A trained model sitting on a researcher's laptop produces zero value. Serving is where ML meets users — and the architecture you choose determines latency, cost, reliability, and how fast you can ship model improvements. There are two fundamental patterns, and the choice between them shapes everything downstream.

Embedded Serving

The simplest pattern: load the model directly into your application process. Your Flask or FastAPI service imports the model file, loads it into memory at startup, and runs inference on the same machine that handles HTTP requests. No network hop to a separate service, no additional infrastructure to manage.

This works well when the model is small (under 500MB), inference is CPU-friendly (tree-based models, small neural networks), and you have a single application that needs predictions. A fraud scoring microservice loading an XGBoost model is the classic example — the model file is 50MB, inference takes 2ms on CPU, and the application can scale horizontally by adding more pods.

The problem: the model lifecycle is coupled to the application lifecycle. Updating the model means redeploying the entire application. If the model grows to 2GB, your application's memory footprint doubles. If you need GPU inference, every application instance needs a GPU — even if the GPU sits idle 90% of the time.

Dedicated Model Server

The decoupled pattern: run the model on a specialized inference server and call it over the network. Your application sends a prediction request via gRPC or HTTP, the model server runs inference, and returns the result. The model server handles model loading, batching, GPU management, and version routing. Your application handles business logic.

Model server architecture with load balancing and auto-scaling

Dedicated servers win when models are large (hundreds of MB to hundreds of GB), when you need GPU inference, when multiple applications share the same model, or when you need to update models independently from application code. The cost: an additional network hop (typically 1-5ms within a datacenter) and the operational complexity of running the serving infrastructure.

The Decision Framework

The question is not which pattern is better — it is which constraints dominate your situation. Embedded serving is the right default when model size is under 500MB, inference runs on CPU, and only one service needs predictions. Dedicated serving is the right default when models exceed 1GB, inference requires GPUs, multiple services consume predictions, or you need to update models without redeploying applications. Most production ML teams end up on dedicated servers because models grow, GPU requirements emerge, and the operational overhead of coupling model and application lifecycles becomes unsustainable.

Interview Tip

In system design interviews, default to a dedicated model server behind a load balancer. This shows you understand the separation of concerns between business logic and inference, and it naturally leads to discussions about scaling, versioning, and deployment strategies that interviewers want to hear.

Model servers and runtimes

A model server is a specialized inference engine. It handles the parts of serving that are common to every model: loading weights into memory, managing GPU allocation, batching requests, serving multiple model versions, and exposing a prediction API. You write zero inference code — you give the server a model file and it handles the rest.

TensorFlow Serving (TFServing)

The original production model server, built by Google for TensorFlow models. TFServing loads SavedModel files, manages model versions automatically (new version appears on disk, TFServing loads it and drains the old one), and serves predictions via gRPC or REST. It handles batching and GPU memory management out of the box.

TFServing's strength is its maturity and battle-tested reliability. Its limitation: it only serves TensorFlow models. If your team trains in PyTorch, TFServing is not an option unless you export to SavedModel format (possible but lossy for some model architectures).

NVIDIA Triton Inference Server

Triton is framework-agnostic — it serves TensorFlow, PyTorch, ONNX, TensorRT, and custom Python models from a single server. This matters when your organization trains models in multiple frameworks. Triton handles dynamic batching, model ensembles (chaining multiple models in a pipeline), and concurrent model execution on the same GPU.

Triton's killer feature is its model repository pattern. You place model files in a directory structure:

1model_repository/
2├── text_classifier/
3│   ├── config.pbtxt
4│   └── 1/
5│       └── model.onnx
6├── image_detector/
7│   ├── config.pbtxt
8│   └── 1/
9│       └── model.plan   # TensorRT optimized
10└── fraud_scorer/
11    ├── config.pbtxt
12    └── 1/
13        └── model.pt     # PyTorch

Triton watches this directory. Drop a new version folder (2/model.onnx) and Triton loads it automatically. Remove the old version folder and Triton unloads it. No restarts, no downtime.

vLLM for Large Language Models

LLM serving is a different problem. These models are tens to hundreds of gigabytes, generation is autoregressive (each token depends on all previous tokens), and requests vary wildly in output length. Traditional model servers treat each request independently — an LLM server needs to manage GPU memory across many concurrent generation streams.

vLLM solves this with PagedAttention, which manages KV-cache memory like an operating system manages virtual memory: allocating pages on demand, freeing them as sequences complete, and avoiding the fragmentation that wastes GPU memory in naive implementations. The result: 2-4x higher throughput than naive serving at the same GPU cost.

Model Formats

ONNX (Open Neural Network Exchange): A portable format that works across frameworks. Train in PyTorch, export to ONNX, serve on any ONNX-compatible runtime. The export process captures the model's computation graph in a framework-agnostic representation.

TensorRT: NVIDIA's optimization toolkit that converts models into GPU-optimized engines. TensorRT fuses layers, selects optimal GPU kernels, and applies precision calibration (FP16, INT8). Inference is 2-5x faster than running the original model, but the optimized engine is tied to a specific GPU architecture.

SafeTensors: A fast, safe format for storing model weights. Unlike pickle-based formats (which can execute arbitrary code during loading), SafeTensors is a flat memory-mapped file with no code execution risk. Loading is instant because the OS memory-maps the file directly — no deserialization step.

Common Pitfall

Never load model files from untrusted sources using pickle-based formats (PyTorch .pt files, Python pickle files). Pickle can execute arbitrary code during deserialization — a malicious model file can compromise your inference server. Use SafeTensors or ONNX for models from external sources, and scan all model artifacts in your CI pipeline.

Containerized model deployment

Models run in containers for the same reason applications do: reproducibility, isolation, and portability. A container packages the model weights, the runtime (Python, CUDA libraries, model server binary), and the configuration into a single artifact that runs identically on a developer's laptop, in CI, and in production.

The Model Container

A typical model serving container includes three layers. The base image provides the OS, CUDA drivers, and Python runtime. The framework layer adds the model server (Triton, vLLM, TFServing) and its dependencies. The model layer adds the specific model weights and configuration. Separating these layers matters for build speed — the base image rarely changes, so Docker caches it. The framework layer changes when you upgrade the server. The model layer changes on every model update.

A common mistake is baking model weights into the container image. This makes the image multi-gigabyte, slow to pull, and requires rebuilding the image for every model update. The better pattern: the container image contains only the runtime, and model weights are mounted from external storage (S3, GCS, or a persistent volume) at startup. This way, updating a model means changing a storage pointer, not rebuilding and redeploying a container.

Kubernetes for Model Serving

Kubernetes is the standard orchestration platform for model serving because it handles the exact problems model serving needs solved: scheduling pods onto nodes with GPUs, health checking and restarting failed instances, load balancing across replicas, and rolling updates for zero-downtime deployments.

GPU scheduling in Kubernetes uses the NVIDIA device plugin, which exposes GPUs as a schedulable resource. A model serving pod requests GPUs in its resource spec, and Kubernetes places it on a node that has available GPUs. The limitation: Kubernetes currently schedules whole GPUs. If your model needs only 4GB of a 40GB A100, the remaining 36GB is wasted. NVIDIA's Multi-Instance GPU (MIG) and time-slicing address this by partitioning a single GPU across multiple pods.

Readiness Probes and Model Loading

A model server is not ready to serve traffic the moment its process starts. It must first load model weights from storage into CPU memory, transfer them to GPU memory, and optionally run warm-up inference to pre-compile GPU kernels. For large models, this can take 30 seconds to several minutes. Kubernetes readiness probes prevent the load balancer from routing traffic to a pod until the model is fully loaded and warmed up. Without readiness probes, users hit a pod that is still loading and receive timeout errors.

Auto-scaling for inference

Model serving traffic is rarely constant. A recommendation service peaks during evening hours. A fraud detection model spikes during holiday shopping. A document processing service surges when a large customer uploads a batch. Auto-scaling adds inference replicas when load increases and removes them when load drops — matching capacity to demand without manual intervention.

Why Standard CPU-Based Scaling Fails for ML

Kubernetes HPA (Horizontal Pod Autoscaler) defaults to scaling on CPU utilization. This works for web servers where CPU correlates with load. For GPU inference, CPU utilization is near zero — the GPU does all the work. Scaling on CPU would never add replicas even if the GPU is saturated and requests are queuing for seconds.

ML serving needs GPU-aware scaling metrics. The most useful signals are GPU utilization (percentage of time the GPU is actively computing), GPU memory utilization (how full the GPU memory is), and request queue depth (how many requests are waiting for a free inference slot). Queue depth is often the best single metric because it directly measures user-facing impact — a growing queue means latency is increasing.

Model warm-up eliminating cold-start latency

Model Warm-Up and Cold Start

The biggest challenge with auto-scaling model servers is cold start. When the autoscaler adds a new pod, that pod must pull model weights (30 seconds to 2 minutes for large models), load them into GPU memory, and optionally run warm-up inference to pre-compile GPU kernels. During this time, the pod is consuming resources but not serving traffic. If the traffic spike is sharp, the new pods may not be ready before the spike overwhelms existing capacity.

Warm-up strategies mitigate this. Model preloading downloads weights to local SSD during off-peak hours so startup only requires the GPU transfer step. Predictive scaling uses traffic pattern analysis to add pods before the expected spike — if traffic always peaks at 6 PM, start scaling at 5:45 PM. Over-provisioning maintains a buffer of 1-2 extra pods that absorb spikes while new pods start. The trade-off: over-provisioning wastes money during low-traffic periods, but a cold start during a traffic spike wastes user experience.

Scaling to Zero

For models with intermittent traffic (internal tools, batch-triggered inference), scaling to zero eliminates cost during idle periods. The challenge: the first request after scale-to-zero hits a cold start. Solutions include Knative's queue-proxy (buffers the first request while the pod starts), keeping a minimal CPU-based fallback for the cold-start period, or accepting the cold-start latency for non-latency-critical workloads.

Traffic management and A/B testing

Deploying a new model is not the same as rolling out a new feature. A code bug usually produces an error. A bad model produces plausible but wrong answers — silently degrading business metrics for days before anyone notices. Traffic management strategies exist to catch bad models before they reach all users.

Shadow Deployment

The safest way to validate a new model in production: run it alongside the current model on real traffic, but only serve the current model's predictions to users. The new model's predictions are logged and compared offline. This reveals performance differences — latency, throughput, accuracy, output distribution — without any user impact. The cost: you pay for the compute of running two models on every request, and you cannot measure business metrics (click-through rate, conversion) because users never see the new model's predictions.

Canary Deployment

Gradually shift traffic from the old model to the new one. Start with 5% of traffic to the new model, monitor key metrics (latency, error rate, business KPIs), and increase the percentage if metrics look healthy. A typical progression: 5% for 1 hour, 20% for 2 hours, 50% for 4 hours, then 100%. If metrics degrade at any step, roll back to 0% immediately.

The advantage over shadow: you measure real user impact including business metrics. The risk: that 5% of users see the new model's predictions. If the model is badly broken, those users have a degraded experience. The mitigation: automated rollback triggers that revert to 0% if error rate exceeds a threshold or if a business metric drops below a guardrail.

A/B Testing for Models

The gold standard for model comparison. Randomly assign users to model A or model B, run both for a statistically significant period (usually 1-2 weeks), and compare business metrics. Unlike canary deployment, A/B testing is designed for long-term comparison rather than gradual rollout. The traffic split is usually 50/50 and remains fixed for the duration of the test.

The key difference from A/B testing software features: model changes often have subtle, delayed effects. A new recommendation model might slightly reduce click-through rate today but increase long-term user engagement. The test duration must be long enough to capture these downstream effects, and the metrics must include both short-term signals (clicks, conversions) and long-term signals (retention, lifetime value).

Blue-Green Deployment

Run two complete environments: blue (current) and green (new). All traffic goes to blue. When green is validated, switch all traffic at once. If green fails, switch back to blue instantly. This provides zero-downtime deployment and instant rollback, but requires double the infrastructure cost during the transition period. Blue-green works best for critical models where gradual rollout is too slow and you need binary cut-over with instant rollback capability.

Key Insight

Shadow deployment answers 'is the new model safe?' Canary deployment answers 'is the new model better for a small group?' A/B testing answers 'is the new model better for everyone, and by how much?' Use them in sequence: shadow first to catch catastrophic issues, canary to validate in production, A/B test to measure the business impact before full rollout.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone