Dataset Management and Versioning

Topics Covered

Data Versioning Fundamentals

Why Data Versioning Is Hard

Versioning Tools

Immutable Datasets

Lineage Tracking

Why Lineage Matters

Implementing Lineage

Reproducibility and Data Splits

Deterministic Train/Test Splits

Data Snapshots for Point-in-Time Training

Avoiding Data Leakage

Data Governance for ML

PII and Privacy Regulations

Data Contracts Between Producers and Consumers

Access Control and Audit Trails

Code has git. Models need the equivalent for data. When a model regresses in production, the first question is always: "what changed — the code, the data, or the features?" If you version your code but not your data, you can only answer one-third of that question. Data versioning makes data changes trackable, diffable, and reversible — the same properties that make git indispensable for code.

Why Data Versioning Is Hard

Data versioning is harder than code versioning for three reasons. First, scale: a git repository rarely exceeds a few gigabytes; a training dataset can be terabytes. Storing full copies of every version is prohibitively expensive. Second, format: code is text files with line-level diffs; data is Parquet files, images, CSVs, or database dumps where "diff" does not have an obvious meaning. Third, frequency: code changes with explicit commits; data changes continuously as new records arrive, old records are corrected, and upstream schemas evolve.

Dataset versioning with diffs across versions

Versioning Tools

Three tools dominate data versioning, each with a different architecture:

DVC (Data Version Control) extends git. You keep your code in git as usual. DVC tracks large data files alongside the code by storing lightweight pointer files (.dvc files) in git and the actual data in remote storage (S3, GCS, Azure Blob). A .dvc file contains a hash of the data content. When the data changes, the hash changes, and git tracks the pointer change. To reproduce any experiment, check out the git commit and run dvc pull — you get the exact data that existed at that point.

LakeFS takes a different approach: it provides a git-like interface directly on top of object storage. Your data lake becomes a repository with branches, commits, and merges. You can create a branch, load new data into it, validate it, and merge it to main — all without copying data. LakeFS uses copy-on-write semantics, so branching is nearly free regardless of dataset size.

Delta Lake (from Databricks) adds versioning to data warehouse tables. It stores data in Parquet files with a transaction log that records every change. You can query any historical version of a table with SELECT * FROM table VERSION AS OF 5 or TIMESTAMP AS OF '2024-01-15'. Time travel is built into the query engine, not bolted on.

Key Insight

The right versioning tool depends on where your data lives. DVC fits teams that store data as files in object storage and work in git-centric workflows. LakeFS fits teams that need branch-based data workflows on a data lake. Delta Lake fits teams that work in Spark or Databricks and query data through SQL. All three achieve the same goal — reproducible, auditable data — through different architectures.

Immutable Datasets

The cardinal rule of data versioning: never modify a dataset in place. Once version v1 is created, it is immutable. If you need to fix a label error, add new records, or remove PII, create version v2. The old version remains accessible for reproducibility.

Immutability is not just a best practice — it prevents a class of bugs that are almost impossible to debug otherwise. Without immutability, a model trained last week and a model trained today use "the same dataset" that has silently changed. Performance differences are unexplainable because you cannot diff what no longer exists.

The practical implementation: store datasets in object storage with versioned paths (s3://datasets/training/v1/, s3://datasets/training/v2/) or use a tool (DVC, LakeFS, Delta Lake) that enforces immutability through its version control mechanism. Delete nothing. Storage is cheap; debugging irreproducible results is expensive.

Data versioning tells you what each version of a dataset looks like. Lineage tracking tells you how it got there. Lineage is the directed graph from raw data sources through every transformation, feature computation, model training, and prediction — with version tags at every node. When a model regresses, lineage lets you trace backward from "model v5 produces bad predictions" to "the training data for model v5 came from feature pipeline v3, which consumed raw data v12, which introduced corrupted records on January 15th."

Data lineage graph from raw data to predictions

Why Lineage Matters

Without lineage, debugging a model regression is guesswork. You know the model got worse, but you do not know whether the cause is a code change, a data change, a feature engineering change, or an upstream schema migration. With lineage, you query: "show me every input that went into model v5 that differs from model v4." The diff immediately narrows the search to the component that changed.

Lineage also enables impact analysis: before changing a data pipeline, query "what models and features depend on this pipeline?" If the answer is "3 production models and 12 feature definitions," you know the blast radius of your change and can test accordingly. Without lineage, you change a pipeline and discover the downstream impact when models break in production.

Implementing Lineage

Lineage tracking does not require a specialized tool, though tools help. The minimal implementation: every training run logs its inputs (data versions, feature versions, code commit) and outputs (model artifact, metrics). Every feature pipeline logs its inputs (raw data version, code commit) and outputs (feature table version). These logs form the edges of the lineage graph.

Metadata stores (MLflow, Amundsen, DataHub, OpenLineage) centralize lineage information. OpenLineage defines a standard for emitting lineage events from any pipeline tool (Airflow, Spark, dbt). The events are collected in a metadata store where they can be queried and visualized as a graph.

Tracing model regression back through lineage to data

Automated lineage is preferable to manual logging. Orchestration tools like Airflow can emit lineage events automatically for each task — which datasets a task read, which datasets it wrote, which code ran. Spark and dbt plugins can capture column-level lineage (which input columns influenced which output columns), enabling even finer-grained debugging.

The practical minimum: if you cannot implement full lineage tooling, at least log the data version, feature version, and code commit with every model artifact. This three-tuple is enough to reproduce any model and narrow the search when something goes wrong.

Interview Tip

In system design interviews involving ML, mentioning lineage tracking signals senior-level thinking. When the interviewer asks how you would debug a model that suddenly produces bad predictions, the answer is: check the lineage graph. Trace back from the model version to the data version that trained it. Diff that data version against the previous one. This structured approach is what distinguishes production ML engineers from researchers.

Reproducibility in ML means running the same training pipeline on the same data produces the same model with the same performance. This sounds trivial but is surprisingly hard. The data must be identical (versioning), the code must be identical (git), the random seed must be fixed (deterministic splits), and even the hardware and library versions must match (environment pinning). Miss any one of these and your "reproduction" produces different results.

Deterministic Train/Test Splits

The train/test split is the foundation of model evaluation. If the split changes between experiments, you are comparing models that were trained and tested on different data — an apples-to-oranges comparison that makes experimental results meaningless.

A deterministic split requires two things: a fixed random seed and a fixed dataset version. Given both, the split function produces the same partition every time. Most frameworks support this: train_test_split(data, test_size=0.2, random_state=42). The seed (42) ensures the same rows land in the same partition across runs.

Hash-based splitting is more robust than seed-based splitting for production systems. Instead of relying on a random seed and row ordering, hash a stable identifier (user ID, record ID) and assign to partitions based on the hash value. hash(user_id) % 10 < 8 goes to training, the rest to test. This approach is order-independent (adding new records does not reshuffle existing assignments), consistent across languages and frameworks, and naturally prevents data leakage (all records for one user land in the same partition).

Data Snapshots for Point-in-Time Training

A model trained on "the latest data" is not reproducible because "the latest data" is a moving target. Data snapshots freeze the state of a dataset at a specific moment. The snapshot becomes the versioned input to a training run.

Snapshot strategies depend on the data infrastructure:

  • Object storage: Copy the current dataset to a versioned path (s3://data/training/2024-01-15/). Immutable by convention.
  • Delta Lake: The timestamp or version number is the snapshot. SELECT * FROM features VERSION AS OF 42 retrieves the exact state.
  • DVC: The git commit hash serves as the snapshot. Each commit locks the data version via the .dvc pointer file.

The critical rule: every model artifact must record which snapshot trained it. Without this link, you cannot reproduce the model or debug regressions.

Avoiding Data Leakage

Data leakage is when information from the test set contaminates the training set. The model appears to perform well in evaluation but fails in production because it "memorized" answers it should never have seen.

Common leakage patterns:

Temporal leakage: Using future data to predict the past. A churn prediction model trained on January data that includes February features. In production, February features do not exist at prediction time. Fix: split by time — train on data before a cutoff, test on data after.

Entity leakage: A user appears in both training and test sets. The model memorizes user-specific patterns rather than learning generalizable features. Fix: split by entity — all records for a user go to one partition.

Feature leakage: A feature that encodes the label (directly or indirectly). An insurance claim prediction model that includes "claim amount" as a feature — of course it predicts claims perfectly, because claim amount is only nonzero for claims. Fix: audit features for causal relationship with the label.

ML models are data products. They consume data, learn from data, and produce predictions that affect users. This makes ML systems subject to every data governance requirement that applies to the underlying data — plus additional requirements specific to ML. A model trained on PII-containing data effectively memorizes aspects of that data. Deleting the training data does not delete the model's learned representations. Governance for ML must account for this.

PII and Privacy Regulations

GDPR, CCPA, and similar regulations give users rights over their data: the right to access (what data do you have about me?), the right to deletion (remove my data), and the right to explanation (why did you make this decision about me?). For ML systems, these rights create specific obligations:

Right to deletion (GDPR Article 17): If a user requests deletion and their data was used to train a model, you must either retrain the model without their data or demonstrate that the model cannot practically reconstruct the deleted data. Full retraining is expensive but certain. The alternative — machine unlearning techniques that approximately remove a user's influence — is an active research area but not yet a reliable compliance strategy.

Right to explanation (GDPR Article 22): Automated decisions that significantly affect users (credit scoring, hiring, insurance) must be explainable. This constrains model choice — a deep neural network that provides no feature attributions may be accurate but non-compliant. Teams in regulated domains often choose interpretable models (logistic regression, gradient boosted trees with feature importance) or add explainability layers (SHAP, LIME) to complex models.

Data minimization: Collect only the data you need for the stated purpose. Training a recommendation model does not justify collecting health records. Audit your feature set for unnecessary PII — remove features that do not improve model performance but increase privacy risk.

Data rollback and model recovery flow

Data Contracts Between Producers and Consumers

ML models are fragile consumers. A slight change in an upstream data schema — a renamed column, a new null pattern, a changed encoding — silently breaks feature pipelines and produces garbage predictions. Data contracts formalize the agreement between data producers and ML consumers:

Schema contracts: The producer guarantees that the table has specific columns with specific types. Adding a column is fine; renaming or removing one is a breaking change that requires coordinated migration.

Freshness contracts: The producer guarantees that data is updated at least every N hours. If the ML feature pipeline expects hourly updates and the producer switches to daily without notice, online features go stale and model quality degrades silently.

Quality contracts: The producer guarantees bounds on null rates, value ranges, and distribution properties. "The age column will have less than 1% nulls and no values above 150." Violations trigger alerts before the data reaches the ML pipeline.

Common Pitfall

The most dangerous data governance failure in ML is not a privacy violation — it is silent model degradation from upstream data changes. A renamed column crashes the pipeline (loud, detected immediately). A subtle distribution shift in a feature passes through silently and degrades model accuracy for weeks before anyone notices. Data contracts with automated validation are the primary defense.

Access Control and Audit Trails

ML datasets need the same access controls as any sensitive data, plus ML-specific controls:

Training data access: Who can access the data used to train production models? In regulated industries, training data access must be logged and limited to authorized personnel.

Model artifact access: Who can deploy a model to production? The model registry should enforce approval workflows — a model trained on sensitive data should not be deployable without review.

Prediction audit trails: For high-stakes decisions (credit, insurance, hiring), log the model version, input features, and output prediction for every decision. This enables post-hoc auditing and is required by some regulations. Retention policies (typically 7 years for financial decisions) determine how long these logs are kept.