ML Systems & Infrastructure
ML Fundamentals for Engineers
Data Infrastructure
Training Infrastructure
Model Serving
Production Operations
Specialized Systems and Capstone
ML Testing Strategies
Traditional software testing assumes deterministic behavior: given the same input, the function always returns the same output. ML breaks this assumption at every level. A model retrained on slightly different data produces different weights, different predictions, and different edge-case behavior. The same model architecture trained twice with different random seeds can disagree on 5-15% of predictions. This means test suites built around exact equality assertions are useless for anything touching model inference.
This non-determinism creates a real organizational problem: teams that apply traditional testing mindsets to ML code either write brittle tests that break on every retraining (and get ignored), or write no tests at all (and ship bugs). Neither approach works. The solution requires understanding which parts of an ML system ARE deterministic and which are not, then applying different testing strategies to each.
The good news: a large fraction of ML code is NOT the model itself. Data preprocessing, feature engineering, input validation, and output formatting are all deterministic functions that follow standard software testing rules. The testing strategy is to draw a sharp boundary between "code that is deterministic" and "code that involves learned parameters," then apply different testing techniques to each. In a typical ML codebase, 60-80% of the code is deterministic infrastructure that can be tested with standard techniques. The remaining 20-40% requires the specialized approaches covered in this lesson.
What You Can Unit Test
Data preprocessing functions are the highest-value unit test target in any ML codebase. A function that normalizes pixel values to [0, 1], tokenizes text, or one-hot encodes categorical variables should produce identical output for identical input every time. These functions are also where silent bugs cause the most damage: a preprocessing bug that shifts all features by 0.01 will not crash anything, but it will degrade model accuracy by 3-8% in ways that are extremely difficult to debug.
Feature transformation tests are equally important. If your pipeline computes log transforms, interaction features, or rolling averages, every one of these functions can be tested with exact expected outputs. The effort pays off during model retraining: when preprocessing bugs are caught immediately, you never waste a 6-hour training run on corrupted features.
Input Validation and Edge Cases
Every ML system has an implicit contract about what inputs are valid. A recommendation model expects user IDs that exist in the embedding table, product IDs that map to known categories, and timestamps that are not from the future. Input validation tests verify that the system enforces this contract explicitly rather than failing silently.
Edge case testing for ML goes beyond what traditional software considers edge cases. An empty string input to a text classifier, a single-pixel image to a vision model, a user with zero historical interactions fed to a recommendation system — these are all valid production scenarios that many models handle poorly. The model might not crash, but it might produce a uniform probability distribution (effectively saying "I have no idea"), which is a valid output that your downstream system needs to handle correctly.

Smoke Tests for Model Inference
Smoke tests are the fastest way to catch catastrophic deployment failures. They answer one question: does the model load and produce a structurally valid output? This sounds basic, but model loading failures are the single most common deployment blocker: corrupted model files, missing library dependencies, incompatible serialization formats, or insufficient memory all manifest as crashes on the first prediction.
You cannot assert that a model produces the exact right answer, but you CAN assert that it produces an answer at all, in the right format, within expected bounds. A smoke test loads the model, passes a known input, and checks that the output is structurally valid.
These tests catch deployment-breaking issues: serialization corruption, missing dependencies, shape mismatches after a feature schema change. They run in seconds and should execute on every commit.
A common mistake is running smoke tests only in the training environment. The production serving environment often has different library versions, different hardware (CPU vs GPU), and different memory constraints. A model that passes smoke tests in a Jupyter notebook can fail to load in a production Docker container because a required library is missing from the container image. Run smoke tests in the exact environment where the model will serve.
Property-Based Testing
When exact outputs are unpredictable, test properties that must always hold. A classification model must produce probabilities that sum to 1. A regression model predicting house prices must never return negative values. A ranking model must be transitive: if A ranks above B and B ranks above C, then A must rank above C. These invariants should hold for ANY valid input, which makes them perfect candidates for automated testing with random inputs.
Property-based testing with libraries like Hypothesis generates hundreds of random inputs automatically. This is far more effective than hand-picked test cases for finding edge cases: extreme values, NaN propagation, integer overflow in feature indices, or inputs with unusual distributions.
Beyond individual property assertions, consider invariance testing: the model's prediction should not change when irrelevant features change. If a sentiment classifier gives a different prediction when you change the user's name in the text but keep everything else the same, the model has learned a spurious correlation. Invariance tests systematically perturb non-relevant input features and verify that predictions remain stable. This catches a class of fairness and robustness bugs that no other test type detects.
Another powerful property test is equivariance: when you apply a known transformation to the input, the output should transform in a predictable way. Doubling a house's square footage should not decrease the price prediction. Adding a positive review to a user's history should not lower their predicted satisfaction score. These directional expectations are easy to encode as property tests and catch models that violate basic domain logic.
The ML testing pyramid has unit tests at the base because they are fast and cheap, but unlike traditional software, the top of the pyramid (model validation, integration tests) catches the most ML-specific bugs. A passing unit test suite with no model validation is a false sense of security.