Imbalanced data undersampling or oversampling?

imbalanced data

undersampling

oversampling

data preprocessing

machine learning

Imbalanced data undersampling or oversampling?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

There is no universal rule that oversampling is better than undersampling or vice versa. The right choice depends on how much majority-class data you can afford to discard, how small the minority class is, how likely the model is to overfit duplicated examples, and whether class weighting can solve the problem without any resampling at all.

Start With the Real Goal, Not the Technique

The goal is not to make the dataset look balanced on paper. The goal is to improve minority-class performance without distorting evaluation or losing too much useful signal.

Before resampling, fix the evaluation setup:

split train and test first
use metrics such as recall, precision, F1, PR AUC, or ROC AUC instead of plain accuracy
consider class weights as a baseline

Many imbalanced problems are made worse by rushing into SMOTE or random undersampling before establishing a proper baseline.

Undersampling: Faster but Riskier for Information Loss

Undersampling reduces the number of majority-class examples.

python

1from imblearn.under_sampling import RandomUnderSampler
2
3X_resampled, y_resampled = RandomUnderSampler(random_state=42).fit_resample(X_train, y_train)
4print(X_resampled.shape)

Advantages:

faster training
smaller datasets
useful when the majority class is extremely large

Disadvantages:

throws away data
can remove rare but important majority-class patterns
may destabilize the model if the retained subset is not representative

Undersampling is often reasonable when the majority class is huge and somewhat redundant.

Oversampling: Keeps Data but Can Overfit

Oversampling increases the representation of the minority class.

python

1from imblearn.over_sampling import RandomOverSampler
2
3X_resampled, y_resampled = RandomOverSampler(random_state=42).fit_resample(X_train, y_train)
4print(X_resampled.shape)

Advantages:

preserves all majority-class examples
often improves minority recall
simple to apply

Disadvantages:

increases training time
naive duplication can overfit the minority class
synthetic techniques can create unrealistic samples if the feature space is awkward

Oversampling is often a stronger default than aggressive undersampling when the minority class is truly scarce and every majority example may still carry signal.

Consider SMOTE and Class Weights Carefully

A common middle ground is synthetic oversampling such as SMOTE.

python

from imblearn.over_sampling import SMOTE

X_resampled, y_resampled = SMOTE(random_state=42).fit_resample(X_train, y_train)

SMOTE can help, but it is not automatic magic. It can work poorly when:

the minority class is noisy
classes overlap heavily
features are categorical or highly structured

Also consider class weights before resampling, especially for linear models, tree ensembles, and neural networks.

python

1from sklearn.linear_model import LogisticRegression
2
3model = LogisticRegression(class_weight="balanced", max_iter=1000)
4model.fit(X_train, y_train)

Sometimes class weighting gets most of the benefit without altering the data distribution at all.

Resample Only the Training Split

This is one of the most important rules. Never oversample or undersample before the train-test split.

Wrong:

resample the whole dataset
then split into train and test

Right:

split first
resample only the training data
evaluate on the untouched validation or test data

Otherwise the evaluation becomes optimistic because information from the training process leaks into the test set.

A Practical Decision Rule

A reasonable order of attack is:

establish a baseline with proper metrics
try class weights
try modest oversampling if the minority class is too small
try undersampling if the majority class is extremely large
compare results with cross-validation

This is more reliable than picking a technique by slogan.

Common Pitfalls

Optimizing for accuracy on an imbalanced dataset often hides the fact that the minority class is being missed almost entirely.
Resampling before the train-test split causes leakage and makes the model look better than it really is.
Heavy undersampling can discard useful majority patterns and make the classifier brittle.
Random oversampling can overfit by repeating the same rare examples too many times.
Using SMOTE on unsuitable feature spaces or noisy minority data can generate misleading synthetic points.

Summary

Neither undersampling nor oversampling is universally best.
Start with proper metrics and a baseline before changing the class distribution.
Try class weights first because they are often simpler and safer.
Use oversampling when minority data is scarce and majority data is still valuable.
Use undersampling when the majority class is huge and highly redundant, but be careful not to throw away important information.