scikit-learn
cross validation
custom splits
time series data
machine learning

scikit-learn cross validation custom splits for time series data

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Time series validation is different from ordinary cross-validation because future rows must never leak into the training set. In scikit-learn, the right approach is either TimeSeriesSplit for expanding windows or a custom splitter when you need fixed horizons, gaps, or domain-specific boundaries.

Why Standard K-Fold Is Wrong For Time Series

Regular KFold assumes examples can be shuffled or partitioned without regard to order. That assumption breaks for temporal data because the model would train on observations from the future and then be evaluated on the past.

For forecasting, demand prediction, anomaly detection, and many event models, that leakage creates unrealistically good scores. The validation process must mirror real deployment: train on older data, test on newer data.

Start With TimeSeriesSplit

Scikit-learn includes TimeSeriesSplit, which grows the training window over time while keeping each test fold later than its corresponding training fold.

python
1import numpy as np
2from sklearn.model_selection import TimeSeriesSplit
3
4X = np.arange(12).reshape(-1, 1)
5y = np.arange(12)
6
7tscv = TimeSeriesSplit(n_splits=3)
8
9for fold, (train_idx, test_idx) in enumerate(tscv.split(X), start=1):
10    print(f"Fold {fold}")
11    print("train:", train_idx)
12    print("test: ", test_idx)

This pattern is a good default when you want an expanding-window evaluation. Earlier folds train on less history, later folds train on more.

Build A Custom Splitter When You Need More Control

Real projects often need rules that TimeSeriesSplit does not encode directly. Common examples include:

  • A fixed training window instead of an expanding one.
  • A gap between training and test data to avoid leakage from delayed signals.
  • A fixed forecast horizon such as the next 7 days.

Scikit-learn accepts any iterable that yields (train_indices, test_indices), so you can define a custom generator without needing a full estimator class.

python
1import numpy as np
2
3
4def rolling_time_series_split(n_samples, train_size, test_size, gap=0, step=None):
5    if step is None:
6        step = test_size
7
8    start = 0
9    while start + train_size + gap + test_size <= n_samples:
10        train_end = start + train_size
11        test_start = train_end + gap
12        test_end = test_start + test_size
13
14        train_idx = np.arange(start, train_end)
15        test_idx = np.arange(test_start, test_end)
16
17        yield train_idx, test_idx
18        start += step
19
20
21for train_idx, test_idx in rolling_time_series_split(
22    n_samples=20,
23    train_size=8,
24    test_size=4,
25    gap=2
26):
27    print("train:", train_idx, "test:", test_idx)

That generator creates a rolling window with a two-step gap. The gap is useful when features contain information that would not be available immediately in a live setting.

Use The Custom Splitter In Model Evaluation

You can pass the generator directly to functions such as cross_val_score or use it in your own training loop.

python
1import numpy as np
2from sklearn.linear_model import LinearRegression
3from sklearn.metrics import mean_absolute_error
4
5X = np.arange(30).reshape(-1, 1)
6y = X.ravel() * 2 + 3
7
8model = LinearRegression()
9scores = []
10
11splits = rolling_time_series_split(
12    n_samples=len(X),
13    train_size=10,
14    test_size=5,
15    gap=1
16)
17
18for train_idx, test_idx in splits:
19    model.fit(X[train_idx], y[train_idx])
20    predictions = model.predict(X[test_idx])
21    score = mean_absolute_error(y[test_idx], predictions)
22    scores.append(score)
23
24print(scores)
25print(sum(scores) / len(scores))

The important part is not the model choice. It is the split logic. If the index ranges reflect production behavior, the evaluation will be much more trustworthy.

Choose The Window Strategy Deliberately

Expanding windows are useful when older history remains relevant and more data should always help. Rolling windows are better when the process drifts and very old data becomes misleading.

A fixed forecast horizon is also worth encoding explicitly. Predicting one step ahead, seven steps ahead, and thirty steps ahead are different tasks. Your splitter should match the operational question, not just the library default.

For panel data with multiple entities, consider splitting by both time and entity carefully. A valid temporal split can still leak information if the same event or aggregate feature shows up across groups in an unrealistic way.

Common Pitfalls

The biggest mistake is using shuffled cross-validation on ordered data. That leaks future information and produces inflated scores.

Another mistake is forgetting a gap when features are derived from delayed signals, overlapping windows, or rolling statistics that would not be finalized at prediction time.

Developers also sometimes optimize hyperparameters on one temporal split and report the same split as final performance. Use a separate holdout period if the model selection process is extensive.

Finally, inspect the generated indices. Many time-series bugs come from off-by-one errors in split boundaries rather than from the model itself.

Summary

  • Time series validation must preserve chronological order.
  • 'TimeSeriesSplit is a solid default for expanding-window evaluation.'
  • Custom generators let you add fixed windows, forecast horizons, and safety gaps.
  • The best split strategy is the one that matches real deployment timing.
  • Always inspect index boundaries to catch leakage and off-by-one mistakes early.

Course illustration
Course illustration

All Rights Reserved.