Decision trees. Choosing thresholds to split objects
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
When a decision tree splits a numeric feature, it does not guess a random threshold. It evaluates candidate cut points and chooses the one that produces the largest improvement in purity or loss, depending on whether the tree is solving a classification or regression problem.
Where candidate thresholds come from
For a numeric feature, the usual candidates are the midpoints between sorted distinct values. In classification, many implementations consider only the boundaries where the class label changes, because splits between identical labels cannot improve the node.
Suppose a feature column and labels look like this after sorting:
Reasonable thresholds are the midpoints:
- '
3.0between2and4' - '
5.5between4and7' - '
8.0between7and9'
The tree tests each one and measures how pure the left and right child nodes become.
Classification: impurity reduction
For classification trees, the split score is usually based on Gini impurity or entropy.
A simple Gini function in Python looks like this:
To score a threshold, split the labels into left and right groups and compute the weighted impurity:
The best threshold is the one with the lowest weighted impurity after the split.
A complete threshold search example
This simple example shows the core idea that tree libraries automate internally.
Regression trees use a different objective
For regression, the goal is not class purity but error reduction. Instead of Gini or entropy, regression trees often use mean squared error or variance reduction. The candidate thresholds are generated similarly, but the scoring function changes.
That is why the tree-building procedure feels similar across tasks even though the objective is different.
Why trees do not test every possible number
A threshold such as 5.500001 and 5.500002 would produce the same partition if no training values fall between them. So decision trees only need thresholds that actually change which samples go left or right. That is why midpoint candidates are enough.
This keeps training efficient and makes the split search well-defined.
Common Pitfalls
The most common mistake is thinking the best threshold is chosen by hand-crafted business logic such as "split at age 30." A trained tree picks thresholds based on the objective function and the data distribution.
Another issue is forgetting that threshold choice at one node depends only on the samples that reached that node. The tree is solving a local optimization repeatedly, not a single global threshold problem.
Be careful with overfitting too. If the tree is allowed to keep splitting until every tiny impurity improvement is exploited, it may fit noise rather than signal.
Finally, remember that tree thresholds are data-dependent. Small changes in the dataset can produce different split points, which is one reason random forests and boosting often outperform a single tree.
Summary
- Numeric decision tree splits use candidate thresholds derived from sorted feature values.
- In classification, the best threshold minimizes weighted impurity such as Gini or entropy.
- In regression, the same search pattern is used with an error-based criterion such as MSE.
- Midpoints between adjacent values are enough because only changed partitions matter.
- Threshold selection is local to each node and can overfit without proper stopping or pruning.

