information-theory
data-analysis
machine-learning
information-gain
mathematical-concepts

Can the value of information gain be negative?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

The short answer is no, not if you are using the standard decision-tree definition of information gain and computing it correctly.

Information gain measures how much uncertainty is reduced after splitting a dataset on some feature. In a classification setting, we usually define it as:

  • Entropy(S) = -sum p_i * log2(p_i)
  • InformationGain(S, A) = Entropy(S) - sum_v (|S_v| / |S|) * Entropy(S_v)

Here, S is the original dataset, A is the attribute you split on, and each S_v is one subset produced by that split.

Why is the result non-negative? Because the second term is a weighted average of the entropies of the subsets. Under the usual definition, splitting on a feature cannot increase the expected uncertainty beyond the uncertainty of the original set. At worst, the split tells you nothing useful, and the information gain is 0.

Intuition

Think of entropy as disorder. Before splitting, your labels have some amount of uncertainty. After splitting on a feature, you look at the child groups separately. If the feature is helpful, those groups are more pure, so the average entropy drops. If the feature is useless, the average entropy stays the same. In the standard setup, it should not become larger.

That is why information gain is usually described as an entropy reduction, not just an arbitrary score.

Small Example

Suppose your dataset has four rows:

  • A -> Yes
  • A -> No
  • B -> Yes
  • B -> Yes

Before the split:

  • p(Yes) = 3/4
  • p(No) = 1/4

So the starting entropy is about 0.811.

Now split on the feature value:

  • Subset A has one Yes and one No, so its entropy is 1.0
  • Subset B has two Yes and zero No, so its entropy is 0.0

The weighted entropy after the split is:

  • (2/4) * 1.0 + (2/4) * 0.0 = 0.5

So the information gain is:

  • 0.811 - 0.5 = 0.311

That is positive, which matches the intuition that this split gives useful information.

When People See Negative Values

If you see a negative result in code, the usual explanation is not that information gain is truly negative. It is that something in the computation is wrong or mismatched.

Common causes include:

  • Using the wrong sign in the entropy formula
  • Forgetting the subset weights |S_v| / |S|
  • Mixing logarithm bases in one part of the computation
  • Using counts in one place and probabilities in another
  • Dividing by the wrong sample size
  • Floating-point roundoff leading to a tiny negative value such as -1e-15

That last case is especially common. In practice, values extremely close to zero are often clamped:

python
gain = parent_entropy - weighted_child_entropy
if abs(gain) < 1e-12:
    gain = 0.0

Python Example

Here is a small, runnable example that computes entropy and information gain for a categorical split:

python
1from collections import Counter, defaultdict
2from math import log2
3
4
5def entropy(labels):
6    total = len(labels)
7    counts = Counter(labels)
8    result = 0.0
9    for count in counts.values():
10        p = count / total
11        result -= p * log2(p)
12    return result
13
14
15def information_gain(rows, feature_index, label_index):
16    parent_labels = [row[label_index] for row in rows]
17    parent_entropy = entropy(parent_labels)
18
19    groups = defaultdict(list)
20    for row in rows:
21        groups[row[feature_index]].append(row[label_index])
22
23    total = len(rows)
24    weighted_child_entropy = 0.0
25    for child_labels in groups.values():
26        weighted_child_entropy += (len(child_labels) / total) * entropy(child_labels)
27
28    return parent_entropy - weighted_child_entropy
29
30
31rows = [
32    ("A", "Yes"),
33    ("A", "No"),
34    ("B", "Yes"),
35    ("B", "Yes"),
36]
37
38print(information_gain(rows, feature_index=0, label_index=1))

This prints a positive number close to 0.311.

Important Nuance

There is one subtle point worth mentioning. In decision trees, information gain as entropy reduction should be non-negative. But in broader statistical or approximate estimation settings, you can sometimes see negative values from an estimator due to sampling noise, regularization, or numerical approximation. That does not mean the true underlying quantity is negative. It means the estimate is imperfect.

So if you are reading a machine-learning textbook or implementing ID3, C4.5, or a similar tree algorithm, the practical rule is:

  • Theoretical information gain: non-negative
  • Slight negative output in code: usually numerical noise or a bug

Summary

  • Standard decision-tree information gain should not be negative.
  • A value of 0 means the split provides no reduction in uncertainty.
  • A positive value means the feature makes the labels more predictable after the split.
  • A negative result usually points to a coding mistake, a sign error, or numerical precision issues.
  • In approximate estimation settings, tiny negative estimates can appear even though the true quantity is not negative.

Course illustration
Course illustration

All Rights Reserved.