Can the value of information gain be negative?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
The short answer is no, not if you are using the standard decision-tree definition of information gain and computing it correctly.
Information gain measures how much uncertainty is reduced after splitting a dataset on some feature. In a classification setting, we usually define it as:
Entropy(S) = -sum p_i * log2(p_i)InformationGain(S, A) = Entropy(S) - sum_v (|S_v| / |S|) * Entropy(S_v)
Here, S is the original dataset, A is the attribute you split on, and each S_v is one subset produced by that split.
Why is the result non-negative? Because the second term is a weighted average of the entropies of the subsets. Under the usual definition, splitting on a feature cannot increase the expected uncertainty beyond the uncertainty of the original set. At worst, the split tells you nothing useful, and the information gain is 0.
Intuition
Think of entropy as disorder. Before splitting, your labels have some amount of uncertainty. After splitting on a feature, you look at the child groups separately. If the feature is helpful, those groups are more pure, so the average entropy drops. If the feature is useless, the average entropy stays the same. In the standard setup, it should not become larger.
That is why information gain is usually described as an entropy reduction, not just an arbitrary score.
Small Example
Suppose your dataset has four rows:
A -> YesA -> NoB -> YesB -> Yes
Before the split:
p(Yes) = 3/4p(No) = 1/4
So the starting entropy is about 0.811.
Now split on the feature value:
- Subset
Ahas oneYesand oneNo, so its entropy is1.0 - Subset
Bhas twoYesand zeroNo, so its entropy is0.0
The weighted entropy after the split is:
(2/4) * 1.0 + (2/4) * 0.0 = 0.5
So the information gain is:
0.811 - 0.5 = 0.311
That is positive, which matches the intuition that this split gives useful information.
When People See Negative Values
If you see a negative result in code, the usual explanation is not that information gain is truly negative. It is that something in the computation is wrong or mismatched.
Common causes include:
- Using the wrong sign in the entropy formula
- Forgetting the subset weights
|S_v| / |S| - Mixing logarithm bases in one part of the computation
- Using counts in one place and probabilities in another
- Dividing by the wrong sample size
- Floating-point roundoff leading to a tiny negative value such as
-1e-15
That last case is especially common. In practice, values extremely close to zero are often clamped:
Python Example
Here is a small, runnable example that computes entropy and information gain for a categorical split:
This prints a positive number close to 0.311.
Important Nuance
There is one subtle point worth mentioning. In decision trees, information gain as entropy reduction should be non-negative. But in broader statistical or approximate estimation settings, you can sometimes see negative values from an estimator due to sampling noise, regularization, or numerical approximation. That does not mean the true underlying quantity is negative. It means the estimate is imperfect.
So if you are reading a machine-learning textbook or implementing ID3, C4.5, or a similar tree algorithm, the practical rule is:
- Theoretical information gain: non-negative
- Slight negative output in code: usually numerical noise or a bug
Summary
- Standard decision-tree information gain should not be negative.
- A value of
0means the split provides no reduction in uncertainty. - A positive value means the feature makes the labels more predictable after the split.
- A negative result usually points to a coding mistake, a sign error, or numerical precision issues.
- In approximate estimation settings, tiny negative estimates can appear even though the true quantity is not negative.

