Information Gain
Scikit-learn
Machine Learning
Feature Selection
Python

Information Gain calculation with Scikit-learn

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Information gain measures how much knowing a feature reduces uncertainty about the target. In scikit-learn, you usually do not call a function literally named information_gain, but you can either compute the entropy-based quantity yourself or use related feature-selection APIs such as mutual_info_classif, depending on whether you want exact split-style reasoning or a practical ranking function.

What Information Gain Means

For classification, information gain is usually described as:

  • entropy before a split
  • minus the weighted entropy after the split

That is why decision trees prefer splits that make the class labels inside each child node more predictable.

At a conceptual level:

text
information gain = parent entropy - weighted child entropy

If a split makes the target much less uncertain, the gain is high.

Compute It Manually for a Categorical Split

If you want the textbook quantity directly, compute it from counts.

python
1import math
2from collections import Counter, defaultdict
3
4def entropy(labels):
5    counts = Counter(labels)
6    total = len(labels)
7    return -sum((c / total) * math.log2(c / total) for c in counts.values())
8
9def information_gain(feature, target):
10    parent_entropy = entropy(target)
11    grouped = defaultdict(list)
12
13    for x, y in zip(feature, target):
14        grouped[x].append(y)
15
16    child_entropy = 0.0
17    total = len(target)
18    for labels in grouped.values():
19        child_entropy += (len(labels) / total) * entropy(labels)
20
21    return parent_entropy - child_entropy
22
23
24feature = ["sunny", "sunny", "rainy", "rainy", "overcast"]
25target = ["no", "no", "yes", "yes", "yes"]
26
27print(information_gain(feature, target))

This is the clearest route when you are learning the formula or debugging a small example.

What scikit-learn Gives You Directly

For practical feature selection, scikit-learn exposes mutual_info_classif and mutual_info_regression. These are closely related to the idea of information gain and are usually the right built-in tools when you want to rank features by dependency with the target.

python
1import pandas as pd
2from sklearn.feature_selection import mutual_info_classif
3
4X = pd.DataFrame({
5    "age": [20, 22, 35, 40, 60, 62],
6    "owns_house": [0, 0, 1, 1, 1, 1],
7    "city_code": [1, 1, 2, 2, 3, 3],
8})
9
10y = [0, 0, 1, 1, 1, 1]
11
12scores = mutual_info_classif(X, y, discrete_features=[False, True, True], random_state=42)
13print(dict(zip(X.columns, scores)))

This does not replicate a single decision-tree split formula exactly, but it does give you a non-negative dependency score that serves a similar feature-ranking purpose.

Decision Trees Use the Idea Internally

If your real goal is to understand how trees choose splits, scikit-learn decision trees already use impurity criteria internally. With an entropy-based criterion, the chosen split is effectively guided by information gain.

python
1from sklearn.tree import DecisionTreeClassifier
2
3clf = DecisionTreeClassifier(criterion="entropy", max_depth=2, random_state=42)
4clf.fit(X, y)
5
6print(clf.tree_.feature)

This tells you which feature index was chosen at each split. The exact gain values are not exposed as a simple top-level helper function, which is why manual calculation or related feature-selection tools are often used when you need interpretability outside the tree itself.

Choose the Right Tool for the Goal

Use manual entropy calculations when:

  • you are studying the formula
  • you want exact split-level information gain
  • the feature is categorical and the example is small

Use scikit-learn feature-selection functions when:

  • you want a practical ranking of many features
  • the dataset includes continuous variables
  • you are building a pipeline rather than teaching the math

The confusion usually comes from expecting one scikit-learn helper to cover both goals directly.

Common Pitfalls

One common mistake is assuming mutual_info_classif is a literal drop-in implementation of the textbook discrete information-gain split formula. It is closely related, but it is a feature-selection estimator, not a direct tree-split debugger.

Another is mixing regression and classification scoring functions. Use mutual_info_regression for continuous targets and mutual_info_classif for discrete targets.

Developers also sometimes forget that entropy-based calculations depend on how features are represented. Continuous variables often need a different treatment than simple categorical examples in tutorials.

Finally, do not over-interpret small differences in scores on tiny datasets. Information-based measures become more meaningful with enough data.

Summary

  • Information gain is entropy reduction after splitting on a feature.
  • You can compute it manually in Python for textbook categorical examples.
  • In scikit-learn, mutual_info_classif is the usual built-in tool for information-based feature ranking in classification.
  • Entropy-based decision trees use the same underlying idea internally.
  • Choose manual calculation or scikit-learn utilities based on whether you want theory, debugging, or production feature selection.

Course illustration
Course illustration

All Rights Reserved.