How to explore a decision tree built using scikit learn

decision tree

scikit learn

machine learning

data science

python

How to explore a decision tree built using scikit learn

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Decision trees are popular because they are fast to train and easy to interpret compared with many other models. After training, exploration is essential to verify split logic, detect overfitting, and communicate model behavior to stakeholders. Scikit learn provides several tools to inspect structure, feature importance, and prediction paths.

Train A Baseline Tree

Start with a simple classifier on a known dataset.

python

1from sklearn.datasets import load_iris
2from sklearn.model_selection import train_test_split
3from sklearn.tree import DecisionTreeClassifier
4from sklearn.metrics import accuracy_score
5
6X, y = load_iris(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(
8    X, y, test_size=0.25, random_state=42, stratify=y
9)
10
11model = DecisionTreeClassifier(max_depth=3, random_state=42)
12model.fit(X_train, y_train)
13
14pred = model.predict(X_test)
15print("accuracy:", accuracy_score(y_test, pred))

Use a fixed seed for reproducible exploration.

Visualize The Tree Structure

plot_tree gives quick structural insight into splits and leaf purity.

python

1import matplotlib.pyplot as plt
2from sklearn.tree import plot_tree
3
4plt.figure(figsize=(12, 8))
5plot_tree(
6    model,
7    filled=True,
8    rounded=True,
9    feature_names=load_iris().feature_names,
10    class_names=load_iris().target_names
11)
12plt.tight_layout()
13plt.show()

Node labels show feature, threshold, class counts, and impurity, which helps explain model decisions.

Export Human Readable Rules

For compact inspection, export text rules.

python

1from sklearn.tree import export_text
2
3rules = export_text(model, feature_names=load_iris().feature_names)
4print(rules)

This is useful for code reviews and model governance documents where image plots are inconvenient.

Inspect Feature Importance

Tree models expose impurity based feature importance.

python

for name, importance in zip(load_iris().feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")

Treat these values as directional, not absolute truth. Correlated features can distort interpretation.

Trace Prediction Path For One Sample

To understand why a single prediction occurred, inspect the decision path.

python

1sample = X_test[0:1]
2node_indicator = model.decision_path(sample)
3leaf_id = model.apply(sample)
4
5print("leaf id:", leaf_id[0])
6print("visited nodes:", node_indicator.indices)

Combine node ids with tree thresholds from model.tree_ for detailed explanation.

Control Complexity And Overfitting

Unconstrained trees can memorize training data. Use depth and sample constraints.

python

1pruned = DecisionTreeClassifier(
2    max_depth=4,
3    min_samples_leaf=3,
4    random_state=42
5)
6pruned.fit(X_train, y_train)

Compare train and test scores to detect overfitting. If train is near perfect and test drops, tighten constraints.

Cost complexity pruning is another effective method.

python

path = model.cost_complexity_pruning_path(X_train, y_train)
print("alphas:", path.ccp_alphas[:5])

Then evaluate models over alpha values and select based on validation performance.

Build A Repeatable Exploration Notebook

A strong workflow includes:

Fixed seed and dataset split.
Baseline metrics.
Structural plot and textual rules.
Complexity tuning experiments.
Final model report with key paths and feature signals.

This makes interpretation repeatable and easier to review by data and product teams.

Compare Trees Across Hyperparameters

Exploration is stronger when you compare multiple tree configurations side by side. Vary depth, minimum samples per leaf, and split criterion, then track both metrics and interpretability.

A very deep tree may improve training accuracy but produce unreadable rules and unstable behavior. A slightly shallower tree can be easier to explain with minimal accuracy loss.

Store experiment metadata with model parameters and generated rule text. This creates an audit trail for later review and helps teams justify why one tree was selected over another.

Regression Tree Variant

For regression tasks, use DecisionTreeRegressor and inspect metrics such as mean absolute error. Interpretation principles stay similar, but leaf outputs represent continuous values rather than class probabilities.

Common Pitfalls

Interpreting impurity importance as causal evidence.
Ignoring overfitting signals in deep unrestricted trees.
Using one random split and over trusting conclusions.
Presenting tree visuals without class distribution context.
Forgetting to document preprocessing applied before training.

Summary

Use plotting and rule export to inspect decision logic.
Evaluate feature importance carefully with domain context.
Trace single sample paths for explainability.
Control tree complexity with depth and pruning techniques.
Keep exploration reproducible with fixed seeds and documented workflow.