Decision Tree Learning and Impurity
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Decision tree learning is a widely used method in machine learning for both classification and regression tasks. It involves the creation of a model that predicts values of a target variable by learning decision rules inferred from data features. The model is represented as a tree structure, which is easy to interpret, visualize, and deploy. One key aspect of constructing decision trees is understanding how nodes in the tree split data, which is directly informed by a concept known as "impurity."
Decision Tree Learning
Understanding Decision Trees
A decision tree consists of nodes representing attribute tests, branches representing outcomes of these tests, and leaf nodes containing output labels:
• Root Node: The topmost node in the tree, representing the entire dataset. • Internal Nodes: These nodes denote test conditions on one of the feature variables. Each node separates data into child nodes that represent different possible outcomes of the test. • Leaf Nodes: These are terminal nodes that provide the predicted outcome.
Decision Tree Algorithm
The construction of a decision tree involves determining which feature to consider at each node and what threshold or condition should be applied to decide data splits. The two popular algorithms for building decision trees include:
• ID3 (Iterative Dichotomiser 3): Utilizes information gain based on entropy to construct a tree. • CART (Classification and Regression Trees): Uses Gini impurity to construct binary trees.
The choice of criteria for splitting impacts the efficiency and accuracy of the decision tree and is crucial in decision tree learning.
Impurity Measures
Impurity measures quantify the disorder or impurity in a dataset, with the primary goal of minimizing impurity at each node. Here are the primary impurity indices used:
Entropy and Information Gain
Entropy measures the homogeneity of a sample. It can be defined as:
where is the proportion of class in the subset .
Information Gain (IG) is the difference between the entropy of the original dataset and the weighted average entropy after a split:
Here, is the attribute being used for the split, are the possible values of , and is the subset of for which attribute takes the value .
Gini Impurity
Gini impurity is another measure used to gauge the impurity of a data sample:
A node with lower gini impurity is preferred as it indicates better purity.
Choosing Between Splitting Criteria
The choice between entropy and gini impurity often boils down to specific application requirements. Information gain (entropy-based) tends to find better splits for specific distributions, whereas gini impurity can be more computationally efficient and yield similar results.
Example
Consider a dataset of weather conditions used to decide whether to play a game outside. It includes attributes like outlook (sunny, overcast, rainy), temperature, humidity, and wind.
- Compute Entropy: Calculate the entropy of the entire dataset.
- Calculate Information Gain: For each attribute, calculate the information gain to determine which attribute provides the highest gain, indicating it as the best feature to split the node.
For example, splitting based on the outlook might yield the highest information gain.
- Gini Impurity Calculation: Perform similar calculations using gini impurity to ascertain the purity improvement after splits.
Summary Table
| Feature | Initial Impurity | Split Impurity | Gain | Reduction |
| Outlook | 0.94 (Entropy) | 0.694 (Avg) | 0.246 | |
| Temperature | 0.94 (Entropy) | 0.911 (Avg) | 0.029 | |
| Humidity | 0.94 (Entropy) | 0.788 (Avg) | 0.151 | |
| Wind | 0.94 (Entropy) | 0.892 (Avg) | 0.048 |
Overfitting and Pruning
One major issue with decision trees is that they can easily overfit, especially if allowed to grow too deep. Pruning techniques like reduced-error pruning and cost-complexity pruning help simplify trees, improving generalization by removing sections of the tree that provide little power in predicting the target variable.
Conclusion
Decision tree learning stands out for its simplicity and interpretability. The careful selection of splitting criteria is paramount in constructing an accurate and efficient model, with impurity measures being key in driving split decisions. Whether employing entropy or gini impurity, practitioners must balance the complexity of trees against overfitting. By doing so, decision trees remain a staple choice for many classification and regression analysis tasks.

