Data Pipelines: Batch and Stream

Topics Covered

Batch Processing Fundamentals

The ETL Pattern

Apache Spark: The Engine

Partitioning and Shuffles

Stream Processing Fundamentals

Apache Kafka: The Backbone

Stream Processing Frameworks

Windowing: Turning Streams into Aggregates

Watermarks and Late Events

Pipeline Orchestration

Apache Airflow

Idempotency: The Most Important Property

Backfills: Reprocessing History

Monitoring and SLAs

Data Validation and Quality

Why Validate: The Cascade of Bad Data

Great Expectations: Define What Good Data Looks Like

Schema Enforcement

Data Quality Metrics

The Quarantine Pattern

Choosing Batch vs Stream

The Latency Spectrum

Complexity and Cost

Lambda Architecture: Batch + Stream

Kappa Architecture: Stream Only

The Practical Decision Framework

Before a machine learning model can make predictions, it needs features — numerical representations of your data. Before you have features, you need clean, transformed data. And before you have that, you need a way to move raw data from where it lives (databases, logs, APIs) to where it needs to be (a data warehouse, a feature store, a training dataset). That is what a data pipeline does, and batch processing is the oldest and most reliable way to build one.

Batch processing means collecting data over a period — an hour, a day, a week — and processing it all at once. You do not react to each event as it arrives. Instead, you wait, accumulate, and then crunch everything in a single run. The tradeoff is latency for throughput: you get results hours later, but you process terabytes efficiently.

Batch ETL pipeline with scheduled extraction, Spark transformation stages, and loading into feature store

The ETL Pattern

ETL stands for Extract, Transform, Load — the three stages of every batch pipeline.

Extract pulls raw data from source systems. This might be a full database dump, an incremental export of rows modified since the last run, or downloading files from an API. The key constraint is that extraction should not overload the source system. Running a SELECT * on a production database during peak traffic is a recipe for incidents.

Transform is where the real work happens. You clean nulls, deduplicate records, join tables, compute aggregations, and reshape data into the schema your downstream consumers expect. This is also where feature engineering lives — computing rolling averages, ratios, category encodings, and other derived values.

Load writes the transformed data to its destination: a data warehouse (BigQuery, Snowflake, Redshift), a feature store (Feast, Tecton), or flat files in object storage (S3, GCS). The load step must be atomic — either all the data lands or none of it does. Partial loads create data inconsistency that is painful to debug.

Apache Spark: The Engine

MapReduce was the original distributed batch framework, but it wrote intermediate results to disk between every step, making multi-stage pipelines slow. Apache Spark replaced it by keeping intermediate data in memory across stages. A pipeline that took 30 minutes on MapReduce often runs in 3 minutes on Spark.

Spark's programming model is built around DataFrames — distributed tables that you manipulate with SQL-like operations. Spark evaluates these operations lazily: when you write df.filter(...).groupBy(...).agg(...), Spark does not execute anything immediately. It builds a Directed Acyclic Graph (DAG) of operations. Only when you trigger an action (like .write() or .count()) does Spark optimize the DAG and execute it across the cluster.

python
1from pyspark.sql import SparkSession
2from pyspark.sql import functions as F
3
4spark = SparkSession.builder.appName("daily_features").getOrCreate()
5
6# Extract: read yesterday's transaction logs
7transactions = spark.read.parquet("s3://data-lake/transactions/date=2025-01-15/")
8
9# Transform: compute per-user features
10user_features = (
11    transactions
12    .groupBy("user_id")
13    .agg(
14        F.count("*").alias("txn_count_1d"),
15        F.sum("amount").alias("total_spend_1d"),
16        F.avg("amount").alias("avg_txn_amount_1d"),
17        F.countDistinct("merchant_id").alias("unique_merchants_1d"),
18    )
19)
20
21# Load: write to feature store
22user_features.write.mode("overwrite").parquet(
23    "s3://feature-store/user_daily_features/date=2025-01-15/"
24)

This job runs nightly. By morning, every user has fresh features computed from the previous day's activity. The model training pipeline reads these features, and the prediction service looks them up at inference time.

Key Insight

Lazy evaluation is not just an optimization trick — it changes how you debug. Because Spark builds a DAG before executing, you can inspect the execution plan with .explain() to see how Spark will distribute work across the cluster. If you see a shuffle (data redistribution across nodes), that is your bottleneck. Shuffles move data over the network and are 10-100x slower than local operations.

Partitioning and Shuffles

When Spark reads a Parquet file from S3, it splits the data into partitions — chunks that are processed independently on different cluster nodes. Operations that work within a single partition (filters, maps) are fast because they require no data movement. Operations that need to combine data across partitions — like groupBy, join, or distinct — trigger a shuffle.

A shuffle redistributes data across the cluster so that all records with the same key land on the same node. This involves serializing data, writing it to disk, transferring it over the network, and deserializing it on the receiving end. Shuffles are the single most expensive operation in Spark. A poorly designed pipeline that triggers unnecessary shuffles can be 10x slower than one that minimizes them.

The practical lesson: partition your data by the columns you most frequently group or join on. If every query groups by user_id, partition by user_id. If you join transactions with users, co-partition both tables by user_id so the join is local and shuffle-free.