Cost Optimization for ML

ML Systems & Infrastructure

Cost Optimization for ML

Topics Covered

GPU Cost Economics

GPU Pricing Tiers

Choosing the Right GPU

Training vs Inference Cost Profile

Compute Optimization Strategies

Training on Spot Instances with Checkpointing

Mixed-Precision Training

Efficient Architectures and Model Distillation

Inference Optimization Techniques

Storage and Data Transfer Costs

Storage Cost Breakdown

Data Transfer Costs

Why Multi-Cloud ML Is Harder Than It Looks

FinOps for ML

Cost Allocation and Accountability

GPU Utilization Monitoring and Right-Sizing

Cloud vs On-Premises TCO

Machine learning workloads are expensive. A single large model training run can cost tens of thousands of dollars, and inference serving can quietly accumulate millions per year. Understanding where the money goes is the first step to spending less of it. This section breaks down GPU pricing, how to choose the right hardware, and why inference costs often dwarf training costs over a model's lifetime.

GPU Pricing Tiers

Cloud providers offer GPUs at three pricing tiers, each with different cost and reliability tradeoffs.

On-demand instances are the simplest option. You request a GPU, you get one (if available), and you pay by the hour until you release it. An NVIDIA A100 on AWS costs roughly $32/hour on-demand. A T4 costs about $0.50/hour. On-demand pricing is the most expensive tier, but you get guaranteed availability and no interruptions. Use on-demand for production inference serving where downtime is unacceptable, or for short experiments where the total cost is small regardless of the per-hour rate.

Reserved instances (or committed use discounts) offer 30-60% savings in exchange for a 1-3 year commitment. You agree to pay for a specific instance type for the commitment period whether you use it or not. A 1-year reserved A100 might cost $19/hour effective rate instead of $32. A 3-year commitment drops it further. Reserved pricing makes sense when you have predictable, sustained workloads. If you know you will run inference servers 24/7 for the next year, reserved instances save serious money. The risk is overcommitting: if your workload shrinks or you switch to a different GPU type, you still pay for the reservation.

Spot instances (called preemptible VMs on GCP) offer the deepest discounts: 50-70% off on-demand pricing. That $32/hour A100 drops to $10-16/hour. The catch is that the cloud provider can reclaim your instance with as little as 30 seconds notice when demand for that GPU type spikes. Spot instances are ideal for fault-tolerant workloads like training with checkpointing, batch inference, and hyperparameter sweeps. They are not suitable for latency-sensitive production serving.

Key Insight

The pricing gap between tiers is large enough to change the economics of entire projects. A 24-hour A100 training run costs $768 on-demand,$ 456 reserved, or $240 on spot. For a team running dozens of experiments per month, the difference between on-demand and spot can be$ 10,000+ monthly.

Choosing the Right GPU

Not every workload needs the most powerful GPU. Matching hardware to workload is one of the easiest cost optimizations, yet teams frequently default to the biggest GPU available.

A100 and H100 GPUs have 40-80 GB of high-bandwidth memory and massive tensor core throughput. They are designed for training large models (billions of parameters) and high-throughput inference on large models. Using an A100 to train a logistic regression or serve a small BERT model is like renting a freight truck to deliver a letter. You pay for capacity you never use.

T4 and L4 GPUs have 16-24 GB of memory and cost a fraction of an A100. They handle inference for most production models comfortably. A quantized 7B parameter model fits on a single T4. For models under 1B parameters, T4s are almost always the right choice for inference. L4 GPUs offer a good middle ground with better performance than T4 at a lower price point than A100.

CPU inference should not be overlooked. For models under 100M parameters (small classifiers, lightweight NLP models, tabular ML), CPU inference is often fast enough and dramatically cheaper. A CPU instance costs $0.05-0.50/hour compared to $0.50-32/hour for GPU instances. If your model responds in under 50ms on CPU, there is no reason to pay for GPU time.

The decision framework is straightforward: profile your model's memory footprint and latency requirements, then pick the cheapest hardware that meets those requirements with reasonable headroom.

Training vs Inference Cost Profile

Training and inference have fundamentally different cost profiles, and understanding this distinction changes how you prioritize optimization efforts.

Training is a burst workload. You spin up GPUs, run training for hours or days, then release them. The cost is bounded: you know when training starts and roughly when it ends. A GPT-scale training run might cost $100,000-$1,000,000 but it happens once (or a few times with retraining). Training costs are highly visible because they come in large, concentrated bills.

Inference is an ongoing workload. Once you deploy a model, it serves predictions continuously. Even a modest inference deployment of 4 T4 GPUs running 24/7 costs about $1,460/month or $17,520/year. A popular model serving millions of requests per day on A100s can cost $50,000-100,000/month. Over the model's lifetime (often 6-18 months before replacement), inference costs frequently exceed training costs by 5-10x.

This is why inference optimization often delivers more total savings than training optimization. A 2x improvement in inference efficiency compounds every hour of every day the model is deployed, while a 2x improvement in training speed saves money only during the occasional training run.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone