Inference Optimization

ML Systems & Infrastructure

Inference Optimization

Topics Covered

Quantization

FP32 to FP16 (Half Precision)

FP16 to INT8 (8-bit Integer)

INT8 to INT4 (4-bit)

Post-Training Quantization vs Quantization-Aware Training

Model distillation

Why Soft Labels Beat Hard Labels

The Temperature Parameter

When to Distill vs When to Quantize

Practical Distillation Pipeline

Batching strategies

Static Batching

Dynamic Batching

Continuous Batching for LLMs

Speculative Decoding

Inference caching

When Caching Works

KV-Cache for Transformer Inference

Prefix Caching

Hardware selection

CPU Inference

GPU Inference

TPU and Custom Accelerators

A trained model stores its knowledge as millions or billions of floating-point numbers — the weights. By default, these are 32-bit floats (FP32), each consuming 4 bytes. A 7-billion parameter model in FP32 needs 28GB of memory just for the weights. Quantization reduces the precision of these numbers — FP16 uses 2 bytes, INT8 uses 1 byte, INT4 uses half a byte — shrinking the model and accelerating inference with minimal accuracy loss.

Why does lower precision work? Most weight values cluster near zero. The difference between 0.12345678 (FP32) and 0.1235 (FP16) is invisible in the model's predictions. The information that matters — the relative magnitude and sign of weights — survives quantization. Only extreme outlier weights, which are rare, suffer meaningful precision loss.

Quantization compressing FP32 to INT8 with minimal accuracy loss

FP32 to FP16 (Half Precision)

The easiest win. Replace 32-bit floats with 16-bit floats. Model size halves. Memory bandwidth halves (the GPU moves half the data). Modern GPUs (A100, H100) have dedicated FP16 tensor cores that compute 2x faster than FP32 cores. Accuracy loss is typically under 0.1% — negligible for nearly all applications. Most frameworks support FP16 inference with a single configuration flag. There is almost no reason not to use FP16 for inference.

FP16 to INT8 (8-bit Integer)

A bigger trade-off. Integer arithmetic is faster than floating-point, and INT8 values are half the size of FP16. But integers cannot represent the same range of values. The quantization process maps the continuous float range to 256 discrete integer values (0 to 255 for unsigned, -128 to 127 for signed). This mapping requires choosing a scale factor and zero point that minimize information loss for the specific weight distribution.

Accuracy loss with INT8 is typically 0.5-1% — acceptable for most production models but noticeable for tasks requiring fine-grained distinctions (medical imaging, financial forecasting). Always benchmark on your specific task before deploying INT8 models.

INT8 to INT4 (4-bit)

Aggressive quantization that maps weights to just 16 discrete values. Model size drops to one-eighth of FP32. This enables running 70B parameter models on a single consumer GPU — impossible at FP32 or even FP16. Accuracy loss is 1-3%, which is significant for some applications but acceptable when the alternative is not running the model at all. Techniques like GPTQ and AWQ (Activation- Aware Weight Quantization) minimize the accuracy impact by calibrating the quantization against representative inputs.

Post-Training Quantization vs Quantization-Aware Training

Post-training quantization (PTQ) takes a trained FP32 model and converts it to lower precision after the fact. Fast and easy — run a calibration dataset through the model, compute the optimal scale factors, and save the quantized model. Works well for FP16 and INT8. Can struggle with INT4 because the accuracy loss is harder to minimize without adjusting weights.

Quantization-aware training (QAT) simulates quantization during training. The model learns to be robust to reduced precision, producing weights that quantize better. QAT produces higher-quality INT8 and INT4 models but requires retraining — which means GPU time, training data, and engineering effort. Use QAT when PTQ accuracy loss is unacceptable and the model is important enough to justify the retraining cost.

Key Insight

Quantization is not just about saving memory — it is about throughput. A model that fits in half the memory can process twice as many concurrent requests on the same GPU. For serving, this means quantization can halve your GPU costs with minimal accuracy loss. The economic argument is often stronger than the performance argument.

Model distillation

Quantization makes the same model smaller. Distillation makes a different, smaller model that behaves like the original. A large "teacher" model (expensive to serve) trains a small "student" model (cheap to serve) to mimic its outputs. The student learns not just the correct answers but the teacher's confidence distribution across all possible answers — and this richer signal produces a better small model than training the student directly on labels.

Teacher model training student model via soft labels

Why Soft Labels Beat Hard Labels

A classification model trained on hard labels (1 for correct class, 0 for everything else) learns only "this is a cat." A model trained on the teacher's soft probability distribution learns "this is 90% cat, 7% lynx, 2% dog, 1% fox." The soft labels encode relationships between classes that hard labels discard. The student learns that cats and lynxes are similar, that this particular image is somewhat ambiguous, and that foxes are more plausible than airplanes. This relational knowledge transfers from teacher to student, producing a student that generalizes better than one trained on hard labels alone.

The Temperature Parameter

The teacher's softmax output is sharpened by default — 90% on the top class, 10% spread across thousands of others. Most of the relational information is in that 10%, which is too flat to learn from. Temperature scaling (T > 1) "softens" the distribution, spreading probability mass across more classes and making the inter-class relationships more visible. A temperature of 4-6 is typical. The student trains on both the softened teacher outputs (knowledge transfer) and the hard labels (ground truth), with a weighted combination loss.

When to Distill vs When to Quantize

Quantization preserves the model architecture and reduces precision. Distillation changes the architecture to a smaller one. Use quantization when the model architecture is right but the serving cost is too high — you keep the same model, just cheaper. Use distillation when you need a fundamentally smaller model — different architecture, fewer layers, fewer parameters. The two techniques combine: distill a 70B model into a 7B model, then quantize the 7B model to INT4. The result is a model that is roughly 40x cheaper to serve than the original.

Practical Distillation Pipeline

Step 1: Run the teacher model on your training data and save the soft label outputs for every example. Step 2: Define a smaller student architecture (fewer layers, smaller hidden dimensions). Step 3: Train the student with a combined loss — cross-entropy against hard labels plus KL-divergence against the teacher's soft labels. Step 4: Evaluate the student against the teacher on your test set. Expect 1-5% accuracy loss for a 10x parameter reduction. Step 5: If the gap is too large, try a larger student, increase training data, or use a higher temperature.

Interview Tip

Distillation is how most production LLM applications work today. Teams fine-tune a large model (70B+) on their specific task to establish quality, then distill into a small model (7B or 1B) that can serve at production cost. The large model never serves production traffic — it is a teacher that exists only to train better small models.

Batching strategies

A GPU processes a batch of 32 inputs in nearly the same time as a single input. This is because the GPU's thousands of cores are underutilized on a single input — a batch fills the cores and amortizes the fixed overhead of launching GPU kernels, transferring data, and synchronizing results. Batching is the single highest- leverage optimization for inference throughput.

Dynamic batching accumulating requests and dispatching to GPU

Static Batching

The simplest approach: collect exactly N requests, process them as one batch, return all results. The problem is waiting. If requests arrive slowly, the batch fills slowly, and early requests wait unnecessarily. If you set N too low, you waste GPU utilization. If you set N too high, latency suffers. Static batching works only when request arrival rate is predictable and consistent — which it rarely is in production.

Dynamic Batching

The production-standard approach. Accumulate requests in a buffer and dispatch a batch when either the buffer reaches a maximum size (e.g., 32 requests) or a timeout expires (e.g., 10ms), whichever comes first. During high traffic, batches fill quickly and dispatch at maximum size — optimal throughput. During low traffic, the timeout ensures requests do not wait indefinitely — bounded latency.

The two knobs: maximum batch size (higher increases throughput but adds latency as more requests wait) and timeout (lower reduces latency but may dispatch smaller, less efficient batches). The right values depend on your latency SLA and traffic pattern. A typical starting point: batch size 32, timeout 10ms.

Continuous Batching for LLMs

Traditional batching does not work well for LLMs because requests have wildly different output lengths. In a static batch of 32 requests, if one request generates 500 tokens and 31 generate 20 tokens, those 31 requests are blocked waiting for the long one to finish. GPU utilization drops as short requests complete but their slots sit empty.

Continuous batching (also called iteration-level batching) solves this. When a request in the batch finishes generating, its slot is immediately filled by a new request from the queue. The GPU stays fully utilized because completed slots are recycled on every generation step, not at the end of the entire batch. vLLM, TGI (Text Generation Inference), and TensorRT-LLM all implement continuous batching. The throughput improvement over static batching is 2-5x for LLM workloads with variable output lengths.

Speculative Decoding

A different approach to faster generation: use a small, fast "draft" model to generate several candidate tokens, then have the large target model verify the entire sequence in a single forward pass. If the draft model predicted 5 tokens and 4 are accepted by the verifier, you produced 4 tokens with 2 forward passes (one draft, one verify) instead of 4 separate target model passes. The key property: the output is mathematically identical to running the large model alone. Speculative decoding changes speed but not quality.

Common Pitfall

Dynamic batching adds latency. A request that arrives right after a batch dispatches must wait up to the full timeout period for the next batch. For latency-critical applications with SLAs under 10ms, the batching timeout must be set very low (1-2ms) or batching may need to be disabled entirely. Always measure p99 latency with batching enabled, not just average throughput.

Inference caching

The cheapest inference is the one you never run. If 20% of your prediction requests have identical inputs, caching those results eliminates 20% of your GPU compute — no model changes, no optimization effort, just a key-value lookup in Redis or Memcached.

When Caching Works

Caching works when inputs repeat. Search queries repeat: "weather in new york" is asked thousands of times per hour. Product recommendations for popular items are requested by millions of users. Document classification for the same document is requested by multiple services. In these cases, a cache hit rate of 20-60% is common, and each hit saves the full inference cost.

Caching does not work when every input is unique (personalized recommendations that depend on real-time user state), when inputs are too large to hash efficiently (high-resolution images), or when staleness is unacceptable (fraud detection where the model must see the latest transaction context).

KV-Cache for Transformer Inference

A different kind of cache, specific to transformer models. During autoregressive generation (one token at a time), each new token attends to all previous tokens. Without caching, every token generation recomputes the attention keys and values for all previous tokens — O(n^2) total computation for n tokens. The KV-cache stores the key and value tensors from each layer for all previously generated tokens. Each new token only computes its own key-value pair and attends to the cached pairs. This reduces the per-token cost from O(n) to O(1), making long sequence generation feasible.

The trade-off: the KV-cache grows linearly with sequence length and batch size. For a 70B model with 80 layers generating 4096 tokens, the KV-cache can consume 40-80GB of GPU memory. This is why GPU memory, not compute, is usually the bottleneck for LLM serving — and why vLLM's PagedAttention (efficient KV-cache memory management) is so impactful.

Prefix Caching

Many LLM requests share a common prefix — the system prompt. If every request starts with the same 500-token system prompt, the KV-cache for those 500 tokens is identical across requests. Prefix caching computes the system prompt's KV states once and reuses them for all requests that share that prefix. For applications with long system prompts (2000+ tokens), prefix caching can reduce time-to-first-token by 50-70%.

Hardware selection

The hardware you run inference on determines your cost-performance ceiling. No amount of software optimization compensates for choosing the wrong hardware. A small XGBoost model on a $30,000 A100 GPU wastes money. A 70B LLM on a CPU is unusable. The right hardware matches the model's computational profile to the hardware's strengths.

CPU Inference

CPUs are the right choice when models are small (under 1GB), inference is not compute-bound (tree-based models, small neural networks), or cost is the primary constraint. CPU instances are 3-10x cheaper per hour than GPU instances. For models like XGBoost, LightGBM, or small sklearn models, CPU inference handles thousands of requests per second at under 5ms latency. ONNX Runtime with CPU optimizations (AVX-512, threading) maximizes throughput.

The break-even point: when your model has more than roughly 100M parameters and uses dense matrix multiplication (neural networks), GPU inference becomes cost-effective because the GPU's parallelism outpaces the CPU's cheaper hourly rate.

GPU Inference

GPUs dominate ML inference for neural networks because they process matrix multiplications in parallel across thousands of cores. The NVIDIA ecosystem is the standard: A10 (budget, 24GB), A100 (high-end, 40/80GB), H100 (latest generation, 80GB with faster memory bandwidth). GPU memory is the primary constraint for LLMs — the model must fit in GPU memory plus space for the KV-cache and intermediate activations.

For multi-GPU inference (models too large for one GPU), tensor parallelism splits the model across GPUs connected by NVLink (fast inter-GPU communication). A 70B FP16 model needs 140GB, requiring 2xA100-80GB or 2xH100.

TPU and Custom Accelerators

TPUs (Tensor Processing Units): Google's custom chips, optimized for transformer workloads. Available only on Google Cloud. TPUs excel at large-batch inference and training, with cost advantages over GPUs for very large models. The limitation: vendor lock-in to Google Cloud and a smaller ecosystem (fewer frameworks, fewer community resources).

AWS Inferentia: Amazon's custom inference chip. Roughly 70% cheaper than GPU instances for supported model architectures. The limitation: not all model architectures are supported, and compilation (converting models to Inferentia's format) can be time-consuming and sometimes lossy.

Groq LPU: Specialized for LLM inference with deterministic latency. Achieves extremely fast token generation by using SRAM instead of HBM, eliminating the memory bandwidth bottleneck. Still emerging and limited in availability.

The hardware landscape evolves rapidly. The principles stay constant: match the model's compute profile (CPU-bound vs memory-bandwidth-bound vs compute-bound) to the hardware's strengths, and always benchmark on your specific workload before committing.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone