Training at Scale

Topics Covered

From Single GPU to Distributed Training

The Memory Wall

The Throughput Ceiling

Two Fundamental Strategies

Choosing Your Strategy

Data Parallelism

How Data Parallelism Works

All-Reduce: The Communication Backbone

Synchronous vs Asynchronous SGD

Batch Size Scaling and the Linear Scaling Rule

Communication-Computation Overlap

Model Parallelism and Pipeline Parallelism

Tensor Parallelism: Splitting Within Layers

Pipeline Parallelism: Splitting Between Layers

Micro-Batching to Reduce Bubbles

Choosing Between Tensor and Pipeline Parallelism

GPU Scheduling and Cluster Management

Slurm for HPC Clusters

Kubernetes with GPU Operator

Gang Scheduling and Preemption

Multi-Tenancy and Resource Quotas

Spot and Preemptible Instances

Training Optimization Techniques

Mixed-Precision Training

Gradient Accumulation

Gradient Checkpointing

Fault-Tolerance Checkpointing

Combining Techniques in Practice

Training a neural network on a single GPU works fine when the model fits in memory and you can tolerate the training time. But both of those assumptions break down quickly as models grow. Understanding exactly where the limits are tells you when distributed training stops being optional and becomes the only path forward.

The Memory Wall

A model's memory footprint during training is not just its parameters. You need memory for the parameters themselves, the optimizer states (Adam stores two extra copies per parameter for first and second moment estimates), the gradients, and the activations saved for backpropagation. A 7B parameter model at FP32 needs 28GB just for weights. Adam optimizer states add another 56GB (two FP32 copies). Gradients add 28GB more. That is 112GB before you store a single activation. An A100 with 80GB of HBM cannot even hold the parameters plus optimizer states, let alone run a forward pass.

The memory wall hits sooner than most engineers expect. Even a 1.5B parameter model at FP32 with Adam needs roughly 24GB of memory for parameters and optimizer states alone, which fits on an A100 but leaves limited room for batch activations and intermediate tensors. Once you push past 3B parameters, a single GPU becomes impractical regardless of the batch size you choose. The gap between model sizes that researchers want to train and the memory available on a single GPU has been widening every year, making distributed training a core competency rather than an advanced topic.

The Throughput Ceiling

Even when a model fits in memory, a single GPU has a fixed number of FLOPS. Training GPT-3's 175B parameters on a single A100 would take roughly 355 GPU-years. No one waits 355 years. The only way to make this practical is to spread the work across hundreds or thousands of GPUs running in parallel. The goal is to reduce wall-clock training time from centuries to weeks.

Two Fundamental Strategies

Distributed training splits the problem in one of two ways. Data parallelism keeps a full copy of the model on every GPU and splits the training data across them. Each GPU processes different examples and they synchronize gradients after each step. This works when the model fits on one GPU but training is too slow.

Model parallelism splits the model itself across GPUs when it is too large for any single device. There are two flavors: tensor parallelism splits individual layers across GPUs, and pipeline parallelism assigns different layers to different GPUs. In practice, large-scale training systems combine all three strategies in a hierarchy.

Choosing Your Strategy

The decision tree is straightforward. If your model fits on one GPU and you just need more throughput, start with data parallelism. If your model does not fit on one GPU, you need some form of model parallelism. If your model is so large that it does not fit even when split across the GPUs in one node, you need pipeline parallelism across nodes combined with tensor parallelism within nodes. And if you want the best of both worlds (splitting a huge model AND scaling throughput), you layer data parallelism on top of model parallelism. This layered approach is called 3D parallelism and is how systems like Megatron-DeepSpeed train models with hundreds of billions of parameters.

One common mistake is jumping to model parallelism prematurely. If your model fits on a single GPU, model parallelism only adds communication overhead without any memory benefit. Always check the memory math first: parameters times 16 bytes (for Adam FP32) gives you the minimum memory, and if that fits on your GPU with room for activations, data parallelism is the right starting point.

Key Insight

The decision between data parallelism and model parallelism is not either-or. Modern systems like Megatron-LM use data parallelism across nodes, tensor parallelism within a node (where NVLink provides high bandwidth), and pipeline parallelism across node groups. The right combination depends on model size, cluster topology, and interconnect bandwidth.