Decision trees. Choosing thresholds to split objects

Decision trees

threshold selection

data splitting

classification algorithms

machine learning techniques

Decision trees. Choosing thresholds to split objects

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When a decision tree splits a numeric feature, it does not guess a random threshold. It evaluates candidate cut points and chooses the one that produces the largest improvement in purity or loss, depending on whether the tree is solving a classification or regression problem.

Where candidate thresholds come from

For a numeric feature, the usual candidates are the midpoints between sorted distinct values. In classification, many implementations consider only the boundaries where the class label changes, because splits between identical labels cannot improve the node.

Suppose a feature column and labels look like this after sorting:

text

feature:  2   4   7   9
label:    A   A   B   B

Reasonable thresholds are the midpoints:

'3.0 between 2 and 4'
'5.5 between 4 and 7'
'8.0 between 7 and 9'

The tree tests each one and measures how pure the left and right child nodes become.

Classification: impurity reduction

For classification trees, the split score is usually based on Gini impurity or entropy.

A simple Gini function in Python looks like this:

python

1from collections import Counter
2
3
4def gini(labels):
5    counts = Counter(labels)
6    total = len(labels)
7    return 1.0 - sum((count / total) ** 2 for count in counts.values())

To score a threshold, split the labels into left and right groups and compute the weighted impurity:

python

1def weighted_gini(left_labels, right_labels):
2    total = len(left_labels) + len(right_labels)
3    return (
4        len(left_labels) / total * gini(left_labels) +
5        len(right_labels) / total * gini(right_labels)
6    )

The best threshold is the one with the lowest weighted impurity after the split.

A complete threshold search example

python

1def best_threshold(values, labels):
2    pairs = sorted(zip(values, labels), key=lambda item: item[0])
3    best_score = float('inf')
4    best_split = None
5
6    for i in range(len(pairs) - 1):
7        current_value, current_label = pairs[i]
8        next_value, next_label = pairs[i + 1]
9
10        if current_value == next_value:
11            continue
12
13        threshold = (current_value + next_value) / 2
14        left_labels = [label for value, label in pairs if value <= threshold]
15        right_labels = [label for value, label in pairs if value > threshold]
16
17        score = weighted_gini(left_labels, right_labels)
18        if score < best_score:
19            best_score = score
20            best_split = threshold
21
22    return best_split, best_score
23
24
25values = [2, 4, 7, 9]
26labels = ['A', 'A', 'B', 'B']
27print(best_threshold(values, labels))

This simple example shows the core idea that tree libraries automate internally.

Regression trees use a different objective

For regression, the goal is not class purity but error reduction. Instead of Gini or entropy, regression trees often use mean squared error or variance reduction. The candidate thresholds are generated similarly, but the scoring function changes.

That is why the tree-building procedure feels similar across tasks even though the objective is different.

Why trees do not test every possible number

A threshold such as 5.500001 and 5.500002 would produce the same partition if no training values fall between them. So decision trees only need thresholds that actually change which samples go left or right. That is why midpoint candidates are enough.

This keeps training efficient and makes the split search well-defined.

Common Pitfalls

The most common mistake is thinking the best threshold is chosen by hand-crafted business logic such as "split at age 30." A trained tree picks thresholds based on the objective function and the data distribution.

Another issue is forgetting that threshold choice at one node depends only on the samples that reached that node. The tree is solving a local optimization repeatedly, not a single global threshold problem.

Be careful with overfitting too. If the tree is allowed to keep splitting until every tiny impurity improvement is exploited, it may fit noise rather than signal.

Finally, remember that tree thresholds are data-dependent. Small changes in the dataset can produce different split points, which is one reason random forests and boosting often outperform a single tree.

Summary

Numeric decision tree splits use candidate thresholds derived from sorted feature values.
In classification, the best threshold minimizes weighted impurity such as Gini or entropy.
In regression, the same search pattern is used with an error-based criterion such as MSE.
Midpoints between adjacent values are enough because only changed partitions matter.
Threshold selection is local to each node and can overfit without proper stopping or pruning.