ML Pipeline Orchestration

Course

ML Systems & Infrastructure

ML Pipeline Orchestration

Topics Covered

Pipeline Concepts and DAG Design

Fan-out and fan-in patterns

Task granularity trade-offs

Parameterized pipelines

Pipeline versioning

Orchestration Tools

Apache Airflow -- the incumbent

Dagster -- the asset-centric challenger

Kubeflow Pipelines -- Kubernetes-native ML

Prefect -- the modern alternative

When to pick which

Pipeline Reliability and Recovery

Task idempotency

Retry with exponential backoff

Partial DAG reruns

Dead letter queues for poison data

Alerting and SLA monitoring

Scheduling and Triggers

Time-based scheduling

Event-driven triggers

Data-aware scheduling (sensors)

Hybrid scheduling

Backfill

Pipeline Concepts and DAG Design

Without an orchestrator, your ML workflow is a chain of cron jobs and bash scripts. When the training step fails at 3 AM, nobody notices until the daily predictions are stale. The data science team reruns everything manually, hoping they remembered the right order. One person runs feature engineering before the new data lands; another kicks off training against yesterday's features. The pipeline concept exists to eliminate this chaos by making dependencies explicit, execution automatic, and failures visible.

The difference between a Jupyter notebook and a production ML system is the pipeline. A notebook runs interactively — a human clicks cells in order, inspects intermediate results, retries when something fails. A pipeline encodes that same sequence of steps in a way that runs unattended, recovers from failures automatically, and produces reproducible results regardless of who or what triggers it. Getting from notebook to pipeline is where most ML teams spend the majority of their engineering effort.

An ML pipeline is a directed acyclic graph (DAG) where each node is a task and each edge is a dependency. "Directed" means data flows one way — from ingestion to deployment. "Acyclic" means no circular dependencies — task A cannot depend on task B if B already depends on A. This constraint is what makes pipelines schedulable: the orchestrator can topologically sort the DAG and execute tasks in the correct order, parallelizing independent branches.

A typical ML pipeline DAG looks like this:

1ingest_raw_data
2    |
3validate_data
4    |
5compute_features
6  /    \
7train    build_serving_index
8  |
9evaluate
10  |
11deploy_model

Each node encapsulates a single unit of work — pulling data from a warehouse, computing feature vectors, training a model, evaluating metrics, pushing a model artifact to a registry. The edges encode the contract: "do not start training until feature computation finishes."

Why "acyclic"? Imagine a pipeline where the model evaluation step feeds back into the training step to adjust hyperparameters. If you encode this as a cycle in the DAG, the orchestrator cannot determine execution order — training depends on evaluation, which depends on training. The solution is to break the cycle by making hyperparameter tuning an outer loop: run the entire train-evaluate pipeline as a single DAG invocation, then use the evaluation results to parameterize the next DAG run. Each DAG run is acyclic; the iteration happens across runs, not within a single run.

ML pipeline DAG showing task execution order and failure retry flow

Fan-out and fan-in patterns

Real pipelines are rarely linear chains. Fan-out occurs when a single task feeds multiple downstream tasks that can run in parallel. After compute_features completes, you might train three candidate models (XGBoost, LightGBM, neural net) simultaneously. Fan-in is the reverse: a task that waits for multiple upstream tasks to finish. An evaluate_best_model task cannot start until all three training tasks complete.

Fan-out is where orchestration earns its keep. Without it, you would need custom logic to launch parallel jobs and track their completion. The orchestrator handles this natively — you declare the dependencies, and it manages concurrent execution, resource allocation, and failure isolation across branches.

1compute_features
2  /     |      \
3train_xgb  train_lgbm  train_nn
4  \     |      /
5  evaluate_best
6      |
7  deploy_champion

Task granularity trade-offs

How fine-grained should each task be? This is one of the most consequential design decisions in pipeline engineering.

Too fine-grained means each task does very little — one task per feature column, one task per data validation check. The overhead of scheduling, serializing inputs/outputs, and tracking state across hundreds of micro-tasks dominates the actual computation. A pipeline with 500 tiny tasks might spend more time on orchestrator overhead than on ML work.

Too coarse-grained means each task does too much. If compute_features_and_train is a single monolithic task and it fails during training, you must rerun the entire feature computation even though it succeeded. Coarse tasks also prevent parallelism — you cannot fan out training variants if training is bundled with feature engineering.

The practical guideline: each task should represent a logically atomic unit of work that produces a meaningful intermediate artifact. Feature computation produces a feature table. Training produces a model artifact. Evaluation produces a metrics report. If a task fails, you want to rerun exactly that task and everything downstream — nothing more.

Key Insight

A good litmus test for task granularity: if a task fails and you rerun it, does it waste significant compute by redoing work that already succeeded? If yes, split the task. If the task is so small that the orchestrator overhead exceeds the compute time, merge it with its neighbor.

Parameterized pipelines

Hard-coding values like training dates, model hyperparameters, or data source paths into task code makes pipelines brittle. Parameterized pipelines accept runtime arguments that flow through the entire DAG.

python

1# Parameterized pipeline definition (Airflow-style)
2dag = DAG(
3    dag_id="training_pipeline",
4    params={
5        "training_date": Param("2025-01-15", type="string"),
6        "learning_rate": Param(0.01, type="number"),
7        "data_source": Param("s3://ml-data/prod/", type="string"),
8    },
9)

Parameters enable backfills (rerun the pipeline for a historical date), experiments (run the same pipeline with different hyperparameters), and environment promotion (same pipeline code, different data source for staging vs. production). Without parameterization, each of these requires a code change and a new deployment.

Pipeline versioning

ML pipelines are code, and code changes over time. When you add a new feature column to the feature engineering step, every model trained after that change uses a different feature set than models trained before it. Without versioning, you cannot answer basic questions: "Which version of the pipeline produced the model that is currently serving predictions?"

Version your pipelines the same way you version application code — Git commits, tags, or semantic versions. Tag each pipeline run with the code version, the parameter values, and the input data snapshot. This creates an audit trail from prediction back to the exact pipeline run, code version, and training data that produced it.

Pipeline versioning also enables safe rollbacks. If a new pipeline version produces a model with degraded accuracy, you can redeploy the previous model AND revert to the previous pipeline version without guessing which code change caused the regression.

In practice, pipeline versioning combines three dimensions:

Code version — the Git commit SHA or tag that defines the pipeline logic (feature transformations, model architecture, hyperparameters).
Data version — a snapshot identifier for the input data. This can be a data warehouse snapshot timestamp, a DVC (Data Version Control) hash, or a partition date.
Environment version — the container image tag or dependency lockfile hash that specifies the runtime environment (Python version, library versions, CUDA version).

A fully reproducible pipeline run requires all three. Two runs with the same code but different library versions can produce different model outputs due to numerical differences in underlying implementations. Recording the environment version alongside the code version eliminates this ambiguity.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone