cross validation
decision trees
sklearn
machine learning
data science

cross validation decision trees in sklearn

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Decision trees can fit training data extremely well, which is exactly why they need careful evaluation. In scikit-learn, cross-validation is the standard way to estimate how well a tree generalizes beyond one lucky train-test split.

Why Cross-Validation Matters for Trees

A decision tree can overfit very quickly if you let it grow without constraints. If you evaluate it on only one split of the data, the result may look better or worse than reality depending on how that split happened.

Cross-validation reduces that luck factor by repeating the train-evaluate cycle on several folds. Each fold uses a different part of the data as validation data, and scikit-learn reports the scores across all runs.

A Basic cross_val_score Example

The most direct starting point is cross_val_score with a DecisionTreeClassifier.

python
1from sklearn.datasets import load_iris
2from sklearn.model_selection import cross_val_score
3from sklearn.tree import DecisionTreeClassifier
4
5X, y = load_iris(return_X_y=True)
6
7model = DecisionTreeClassifier(max_depth=3, random_state=42)
8scores = cross_val_score(model, X, y, cv=5)
9
10print("Fold scores:", scores)
11print("Mean score:", scores.mean())

This trains five different trees, each time holding out a different fold for validation. The mean score is usually a better summary than a single split.

Choosing the Right Splitter

For classification, scikit-learn often uses stratified folds by default when appropriate. That helps keep class proportions balanced across folds.

You can also choose the splitter explicitly:

python
1from sklearn.datasets import load_iris
2from sklearn.model_selection import StratifiedKFold, cross_val_score
3from sklearn.tree import DecisionTreeClassifier
4
5X, y = load_iris(return_X_y=True)
6cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
7
8model = DecisionTreeClassifier(max_depth=4, random_state=42)
9scores = cross_val_score(model, X, y, cv=cv)
10
11print(scores)
12print(scores.mean())

Shuffling with a fixed random_state is useful when you want reproducible results.

Cross-Validation Helps Tune Tree Complexity

Trees are sensitive to hyperparameters such as these:

  • 'max_depth'
  • 'min_samples_split'
  • 'min_samples_leaf'
  • 'ccp_alpha for pruning'

Cross-validation lets you compare settings based on out-of-fold performance instead of training accuracy.

python
1from sklearn.datasets import load_iris
2from sklearn.model_selection import cross_val_score
3from sklearn.tree import DecisionTreeClassifier
4
5X, y = load_iris(return_X_y=True)
6
7for depth in [1, 2, 3, 4, None]:
8    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
9    scores = cross_val_score(model, X, y, cv=5)
10    print(f"max_depth={depth!r}, mean={scores.mean():.3f}")

This gives you a practical way to see whether a deeper tree is helping or just fitting noise.

Use the Right Metric

Accuracy is fine for balanced classification problems, but it is not always the right metric. If classes are imbalanced, consider metrics such as F1, precision, recall, or ROC AUC.

Example with an explicit metric:

python
scores = cross_val_score(model, X, y, cv=5, scoring="f1_macro")
print(scores.mean())

For regression trees, use DecisionTreeRegressor and choose a regression metric such as negative mean squared error or r2.

Keep Preprocessing Inside the CV Loop

If your model needs preprocessing, put it into a pipeline. Otherwise you risk data leakage by transforming the full dataset before cross-validation.

Decision trees usually do not need feature scaling, but pipelines still matter when you have imputation, encoding, or feature selection.

Common Pitfalls

A common mistake is reporting only training accuracy. Decision trees can memorize the training set, so that number often says little about generalization.

Another mistake is tuning hyperparameters on the same cross-validation result you later report as final performance. That can make the estimate too optimistic. If model selection is substantial, use a separate test set or nested cross-validation.

A third issue is forgetting reproducibility. Decision trees can vary with randomness, so set random_state when you want comparable results.

Summary

  • Cross-validation gives a more reliable estimate than one train-test split
  • Decision trees often benefit from cross-validated tuning of depth and leaf constraints
  • 'cross_val_score is the quickest way to evaluate a scikit-learn tree'
  • Use metrics that match the problem, not only default accuracy
  • Keep preprocessing inside the cross-validation pipeline to avoid leakage

Course illustration
Course illustration

All Rights Reserved.