is there any way to get samples under each leaf of a decision tree?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Yes, you can get the samples that end up under each leaf of a decision tree. In scikit-learn, the usual pattern is to run apply() on the training or evaluation data, which gives you the leaf node id for each sample, then group samples by that id. This is useful for model interpretation, debugging, and understanding where the tree is overfitting.
The Core Trick: Use apply()
Every sample in a scikit-learn decision tree ends at exactly one leaf. The apply() method returns the leaf index for each row.
At this point, leaf_ids[i] tells you which leaf sample X[i] landed in.
Group Samples by Leaf
Once you have the leaf ids, grouping is straightforward.
This version stores row indices. That is usually better than copying the raw samples immediately because:
- indices are cheaper to store,
- you can look up labels or metadata later,
- you can reuse the same grouping for multiple arrays.
Retrieve the Actual Samples and Labels
If you want the underlying rows:
This is helpful when you want to inspect what kinds of examples the tree clustered into the same terminal decision.
Leaf Statistics from the Tree Itself
The tree object also exposes structural information. For example, tree_.n_node_samples tells you how many training samples reached each node.
This is useful when you only want counts, not the actual sample membership.
Use decision_path() for More Detail
If you need not only the leaf, but also the full path each sample took through the tree, use decision_path().
This helps answer a richer question: not just “which leaf did the sample reach,” but “which sequence of decisions got it there.”
Training Data Versus New Data
Be explicit about which dataset you are analyzing.
- Using the training set shows how the fitted tree partitioned the data it learned from.
- Using validation or test data shows how new samples distribute across leaves.
That distinction matters. A leaf with many training samples but almost no validation samples may indicate a branch that is too specialized.
Practical Interpretation Uses
Leaf-level sample inspection is useful for:
- diagnosing overfitting,
- explaining model behavior to stakeholders,
- finding mislabeled or unusual samples,
- extracting rule-based cohorts from a tree.
For example, if one leaf contains only two training examples and predicts a class with high confidence, that is often a sign the tree is memorizing rather than generalizing.
Common Pitfalls
- Looking only at leaf sample counts and forgetting to recover the actual row indices when interpretation needs specific examples.
- Assuming the leaf ids are meaningful labels rather than internal node ids assigned by the tree structure.
- Mixing training and test samples in one grouping without tracking which dataset each sample came from.
- Ignoring
decision_path()when the real question is about rule traversal rather than final leaf membership. - Treating tiny leaf sample counts as normal when they may be evidence of overfitting.
Summary
- Use
tree.apply(X)to get the leaf id for each sample. - Group row indices by leaf id to recover the samples that end under each leaf.
- Use
tree.tree_.n_node_sampleswhen you only need counts. - Use
decision_path()when you need the full route through the tree, not just the terminal node. - Inspect leaf membership on both training and validation data for better model interpretation.

