ML Systems & Infrastructure
ML Fundamentals for Engineers
Data Infrastructure
Training Infrastructure
Model Serving
Evaluation and Testing
Specialized Systems and Capstone
ML Pipeline Orchestration
Without an orchestrator, your ML workflow is a chain of cron jobs and bash scripts. When the training step fails at 3 AM, nobody notices until the daily predictions are stale. The data science team reruns everything manually, hoping they remembered the right order. One person runs feature engineering before the new data lands; another kicks off training against yesterday's features. The pipeline concept exists to eliminate this chaos by making dependencies explicit, execution automatic, and failures visible.
The difference between a Jupyter notebook and a production ML system is the pipeline. A notebook runs interactively — a human clicks cells in order, inspects intermediate results, retries when something fails. A pipeline encodes that same sequence of steps in a way that runs unattended, recovers from failures automatically, and produces reproducible results regardless of who or what triggers it. Getting from notebook to pipeline is where most ML teams spend the majority of their engineering effort.
An ML pipeline is a directed acyclic graph (DAG) where each node is a task and each edge is a dependency. "Directed" means data flows one way — from ingestion to deployment. "Acyclic" means no circular dependencies — task A cannot depend on task B if B already depends on A. This constraint is what makes pipelines schedulable: the orchestrator can topologically sort the DAG and execute tasks in the correct order, parallelizing independent branches.
A typical ML pipeline DAG looks like this:
Each node encapsulates a single unit of work — pulling data from a warehouse, computing feature vectors, training a model, evaluating metrics, pushing a model artifact to a registry. The edges encode the contract: "do not start training until feature computation finishes."
Why "acyclic"? Imagine a pipeline where the model evaluation step feeds back into the training step to adjust hyperparameters. If you encode this as a cycle in the DAG, the orchestrator cannot determine execution order — training depends on evaluation, which depends on training. The solution is to break the cycle by making hyperparameter tuning an outer loop: run the entire train-evaluate pipeline as a single DAG invocation, then use the evaluation results to parameterize the next DAG run. Each DAG run is acyclic; the iteration happens across runs, not within a single run.

Fan-out and fan-in patterns
Real pipelines are rarely linear chains. Fan-out occurs when a single task feeds multiple downstream tasks that can run in parallel. After compute_features completes, you might train three candidate models (XGBoost, LightGBM, neural net) simultaneously. Fan-in is the reverse: a task that waits for multiple upstream tasks to finish. An evaluate_best_model task cannot start until all three training tasks complete.
Fan-out is where orchestration earns its keep. Without it, you would need custom logic to launch parallel jobs and track their completion. The orchestrator handles this natively — you declare the dependencies, and it manages concurrent execution, resource allocation, and failure isolation across branches.
Task granularity trade-offs
How fine-grained should each task be? This is one of the most consequential design decisions in pipeline engineering.
Too fine-grained means each task does very little — one task per feature column, one task per data validation check. The overhead of scheduling, serializing inputs/outputs, and tracking state across hundreds of micro-tasks dominates the actual computation. A pipeline with 500 tiny tasks might spend more time on orchestrator overhead than on ML work.
Too coarse-grained means each task does too much. If compute_features_and_train is a single monolithic task and it fails during training, you must rerun the entire feature computation even though it succeeded. Coarse tasks also prevent parallelism — you cannot fan out training variants if training is bundled with feature engineering.
The practical guideline: each task should represent a logically atomic unit of work that produces a meaningful intermediate artifact. Feature computation produces a feature table. Training produces a model artifact. Evaluation produces a metrics report. If a task fails, you want to rerun exactly that task and everything downstream — nothing more.
A good litmus test for task granularity: if a task fails and you rerun it, does it waste significant compute by redoing work that already succeeded? If yes, split the task. If the task is so small that the orchestrator overhead exceeds the compute time, merge it with its neighbor.
Parameterized pipelines
Hard-coding values like training dates, model hyperparameters, or data source paths into task code makes pipelines brittle. Parameterized pipelines accept runtime arguments that flow through the entire DAG.
Parameters enable backfills (rerun the pipeline for a historical date), experiments (run the same pipeline with different hyperparameters), and environment promotion (same pipeline code, different data source for staging vs. production). Without parameterization, each of these requires a code change and a new deployment.
Pipeline versioning
ML pipelines are code, and code changes over time. When you add a new feature column to the feature engineering step, every model trained after that change uses a different feature set than models trained before it. Without versioning, you cannot answer basic questions: "Which version of the pipeline produced the model that is currently serving predictions?"
Version your pipelines the same way you version application code — Git commits, tags, or semantic versions. Tag each pipeline run with the code version, the parameter values, and the input data snapshot. This creates an audit trail from prediction back to the exact pipeline run, code version, and training data that produced it.
Pipeline versioning also enables safe rollbacks. If a new pipeline version produces a model with degraded accuracy, you can redeploy the previous model AND revert to the previous pipeline version without guessing which code change caused the regression.
In practice, pipeline versioning combines three dimensions:
- Code version — the Git commit SHA or tag that defines the pipeline logic (feature transformations, model architecture, hyperparameters).
- Data version — a snapshot identifier for the input data. This can be a data warehouse snapshot timestamp, a DVC (Data Version Control) hash, or a partition date.
- Environment version — the container image tag or dependency lockfile hash that specifies the runtime environment (Python version, library versions, CUDA version).
A fully reproducible pipeline run requires all three. Two runs with the same code but different library versions can produce different model outputs due to numerical differences in underlying implementations. Recording the environment version alongside the code version eliminates this ambiguity.