ML Systems & Infrastructure
ML Fundamentals for Engineers
Training Infrastructure
Model Serving
Evaluation and Testing
Production Operations
Specialized Systems and Capstone
Data Pipelines: Batch and Stream
Before a machine learning model can make predictions, it needs features — numerical representations of your data. Before you have features, you need clean, transformed data. And before you have that, you need a way to move raw data from where it lives (databases, logs, APIs) to where it needs to be (a data warehouse, a feature store, a training dataset). That is what a data pipeline does, and batch processing is the oldest and most reliable way to build one.
Batch processing means collecting data over a period — an hour, a day, a week — and processing it all at once. You do not react to each event as it arrives. Instead, you wait, accumulate, and then crunch everything in a single run. The tradeoff is latency for throughput: you get results hours later, but you process terabytes efficiently.

The ETL Pattern
ETL stands for Extract, Transform, Load — the three stages of every batch pipeline.
Extract pulls raw data from source systems. This might be a full database dump, an incremental export of rows modified since the last run, or downloading files from an API. The key constraint is that extraction should not overload the source system. Running a SELECT * on a production database during peak traffic is a recipe for incidents.
Transform is where the real work happens. You clean nulls, deduplicate records, join tables, compute aggregations, and reshape data into the schema your downstream consumers expect. This is also where feature engineering lives — computing rolling averages, ratios, category encodings, and other derived values.
Load writes the transformed data to its destination: a data warehouse (BigQuery, Snowflake, Redshift), a feature store (Feast, Tecton), or flat files in object storage (S3, GCS). The load step must be atomic — either all the data lands or none of it does. Partial loads create data inconsistency that is painful to debug.
Apache Spark: The Engine
MapReduce was the original distributed batch framework, but it wrote intermediate results to disk between every step, making multi-stage pipelines slow. Apache Spark replaced it by keeping intermediate data in memory across stages. A pipeline that took 30 minutes on MapReduce often runs in 3 minutes on Spark.
Spark's programming model is built around DataFrames — distributed tables that you manipulate with SQL-like operations. Spark evaluates these operations lazily: when you write df.filter(...).groupBy(...).agg(...), Spark does not execute anything immediately. It builds a Directed Acyclic Graph (DAG) of operations. Only when you trigger an action (like .write() or .count()) does Spark optimize the DAG and execute it across the cluster.
This job runs nightly. By morning, every user has fresh features computed from the previous day's activity. The model training pipeline reads these features, and the prediction service looks them up at inference time.
Lazy evaluation is not just an optimization trick — it changes how you debug. Because Spark builds a DAG before executing, you can inspect the execution plan with .explain() to see how Spark will distribute work across the cluster. If you see a shuffle (data redistribution across nodes), that is your bottleneck. Shuffles move data over the network and are 10-100x slower than local operations.
Partitioning and Shuffles
When Spark reads a Parquet file from S3, it splits the data into partitions — chunks that are processed independently on different cluster nodes. Operations that work within a single partition (filters, maps) are fast because they require no data movement. Operations that need to combine data across partitions — like groupBy, join, or distinct — trigger a shuffle.
A shuffle redistributes data across the cluster so that all records with the same key land on the same node. This involves serializing data, writing it to disk, transferring it over the network, and deserializing it on the receiving end. Shuffles are the single most expensive operation in Spark. A poorly designed pipeline that triggers unnecessary shuffles can be 10x slower than one that minimizes them.
The practical lesson: partition your data by the columns you most frequently group or join on. If every query groups by user_id, partition by user_id. If you join transactions with users, co-partition both tables by user_id so the join is local and shuffle-free.