ML Testing Strategies

Topics Covered

Unit Tests for ML

What You Can Unit Test

Input Validation and Edge Cases

Smoke Tests for Model Inference

Property-Based Testing

Data Validation and Testing

Schema Validation

Distribution Drift Detection

Range and Consistency Validation

Feature Importance Stability

Data Freshness

Model Validation

Minimum Metric Thresholds

Regression Detection

Slice-Based Evaluation

Bias and Fairness Checks

Shadow Testing

Integration and End-to-End Tests

End-to-End Pipeline Tests

Prediction Range and Sanity Tests

Latency and Throughput Testing

Stress Testing and Breaking Points

Infrastructure Validation

Model Version and Dependency Verification

Traditional software testing assumes deterministic behavior: given the same input, the function always returns the same output. ML breaks this assumption at every level. A model retrained on slightly different data produces different weights, different predictions, and different edge-case behavior. The same model architecture trained twice with different random seeds can disagree on 5-15% of predictions. This means test suites built around exact equality assertions are useless for anything touching model inference.

This non-determinism creates a real organizational problem: teams that apply traditional testing mindsets to ML code either write brittle tests that break on every retraining (and get ignored), or write no tests at all (and ship bugs). Neither approach works. The solution requires understanding which parts of an ML system ARE deterministic and which are not, then applying different testing strategies to each.

The good news: a large fraction of ML code is NOT the model itself. Data preprocessing, feature engineering, input validation, and output formatting are all deterministic functions that follow standard software testing rules. The testing strategy is to draw a sharp boundary between "code that is deterministic" and "code that involves learned parameters," then apply different testing techniques to each. In a typical ML codebase, 60-80% of the code is deterministic infrastructure that can be tested with standard techniques. The remaining 20-40% requires the specialized approaches covered in this lesson.

What You Can Unit Test

Data preprocessing functions are the highest-value unit test target in any ML codebase. A function that normalizes pixel values to [0, 1], tokenizes text, or one-hot encodes categorical variables should produce identical output for identical input every time. These functions are also where silent bugs cause the most damage: a preprocessing bug that shifts all features by 0.01 will not crash anything, but it will degrade model accuracy by 3-8% in ways that are extremely difficult to debug.

python
1def test_normalize_pixel_values():
2    raw = np.array([0, 128, 255], dtype=np.uint8)
3    result = normalize_pixels(raw)
4    np.testing.assert_allclose(result, [0.0, 0.50196, 1.0], atol=1e-4)
5    assert result.dtype == np.float32
6
7def test_tokenizer_handles_unicode():
8    text = "café résumé naïve"
9    tokens = tokenize(text)
10    assert all(isinstance(t, int) for t in tokens)
11    assert len(tokens) > 0
12    # Roundtrip: decode should recover original
13    assert detokenize(tokens) == text

Feature transformation tests are equally important. If your pipeline computes log transforms, interaction features, or rolling averages, every one of these functions can be tested with exact expected outputs. The effort pays off during model retraining: when preprocessing bugs are caught immediately, you never waste a 6-hour training run on corrupted features.

Input Validation and Edge Cases

Every ML system has an implicit contract about what inputs are valid. A recommendation model expects user IDs that exist in the embedding table, product IDs that map to known categories, and timestamps that are not from the future. Input validation tests verify that the system enforces this contract explicitly rather than failing silently.

python
1def test_rejects_unknown_user_id():
2    with pytest.raises(ValueError, match="Unknown user"):
3        model.predict(user_id="nonexistent_user_12345")
4
5def test_rejects_future_timestamp():
6    future = datetime.now() + timedelta(days=365)
7    with pytest.raises(ValueError, match="future timestamp"):
8        model.predict(user_id="user_1", timestamp=future)
9
10def test_handles_missing_optional_features():
11    # Optional features should use defaults, not crash
12    result = model.predict(user_id="user_1", location=None)
13    assert result is not None
14    assert result.shape == (1, 10)

Edge case testing for ML goes beyond what traditional software considers edge cases. An empty string input to a text classifier, a single-pixel image to a vision model, a user with zero historical interactions fed to a recommendation system — these are all valid production scenarios that many models handle poorly. The model might not crash, but it might produce a uniform probability distribution (effectively saying "I have no idea"), which is a valid output that your downstream system needs to handle correctly.

ML testing pyramid showing unit tests at the base, data validation in the middle, model validation above, and integration tests at the top

Smoke Tests for Model Inference

Smoke tests are the fastest way to catch catastrophic deployment failures. They answer one question: does the model load and produce a structurally valid output? This sounds basic, but model loading failures are the single most common deployment blocker: corrupted model files, missing library dependencies, incompatible serialization formats, or insufficient memory all manifest as crashes on the first prediction.

You cannot assert that a model produces the exact right answer, but you CAN assert that it produces an answer at all, in the right format, within expected bounds. A smoke test loads the model, passes a known input, and checks that the output is structurally valid.

python
1def test_model_loads_and_predicts():
2    model = load_model("models/v2.3.pkl")
3    sample = create_sample_input()
4    prediction = model.predict(sample)
5    # Output exists and has correct shape
6    assert prediction.shape == (1, 10)  # 10-class classifier
7    # Probabilities are valid
8    assert np.all(prediction >= 0)
9    assert np.all(prediction <= 1)
10    assert abs(prediction.sum() - 1.0) < 1e-5
11
12def test_model_handles_empty_input():
13    model = load_model("models/v2.3.pkl")
14    with pytest.raises(ValueError, match="empty input"):
15        model.predict(np.array([]))

These tests catch deployment-breaking issues: serialization corruption, missing dependencies, shape mismatches after a feature schema change. They run in seconds and should execute on every commit.

A common mistake is running smoke tests only in the training environment. The production serving environment often has different library versions, different hardware (CPU vs GPU), and different memory constraints. A model that passes smoke tests in a Jupyter notebook can fail to load in a production Docker container because a required library is missing from the container image. Run smoke tests in the exact environment where the model will serve.

Property-Based Testing

When exact outputs are unpredictable, test properties that must always hold. A classification model must produce probabilities that sum to 1. A regression model predicting house prices must never return negative values. A ranking model must be transitive: if A ranks above B and B ranks above C, then A must rank above C. These invariants should hold for ANY valid input, which makes them perfect candidates for automated testing with random inputs.

python
1from hypothesis import given, strategies as st
2
3@given(st.lists(st.floats(min_value=-1e6, max_value=1e6), min_size=5, max_size=5))
4def test_output_probabilities_sum_to_one(features):
5    prediction = model.predict([features])
6    assert prediction.min() >= 0, "Negative probability"
7    assert abs(prediction.sum() - 1.0) < 1e-5, "Probabilities do not sum to 1"

Property-based testing with libraries like Hypothesis generates hundreds of random inputs automatically. This is far more effective than hand-picked test cases for finding edge cases: extreme values, NaN propagation, integer overflow in feature indices, or inputs with unusual distributions.

Beyond individual property assertions, consider invariance testing: the model's prediction should not change when irrelevant features change. If a sentiment classifier gives a different prediction when you change the user's name in the text but keep everything else the same, the model has learned a spurious correlation. Invariance tests systematically perturb non-relevant input features and verify that predictions remain stable. This catches a class of fairness and robustness bugs that no other test type detects.

Another powerful property test is equivariance: when you apply a known transformation to the input, the output should transform in a predictable way. Doubling a house's square footage should not decrease the price prediction. Adding a positive review to a user's history should not lower their predicted satisfaction score. These directional expectations are easy to encode as property tests and catch models that violate basic domain logic.

Key Insight

The ML testing pyramid has unit tests at the base because they are fast and cheap, but unlike traditional software, the top of the pyramid (model validation, integration tests) catches the most ML-specific bugs. A passing unit test suite with no model validation is a false sense of security.