imbalanced data
undersampling
oversampling
data preprocessing
machine learning

Imbalanced data undersampling or oversampling?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

There is no universal rule that oversampling is better than undersampling or vice versa. The right choice depends on how much majority-class data you can afford to discard, how small the minority class is, how likely the model is to overfit duplicated examples, and whether class weighting can solve the problem without any resampling at all.

Start With the Real Goal, Not the Technique

The goal is not to make the dataset look balanced on paper. The goal is to improve minority-class performance without distorting evaluation or losing too much useful signal.

Before resampling, fix the evaluation setup:

  • split train and test first
  • use metrics such as recall, precision, F1, PR AUC, or ROC AUC instead of plain accuracy
  • consider class weights as a baseline

Many imbalanced problems are made worse by rushing into SMOTE or random undersampling before establishing a proper baseline.

Undersampling: Faster but Riskier for Information Loss

Undersampling reduces the number of majority-class examples.

python
1from imblearn.under_sampling import RandomUnderSampler
2
3X_resampled, y_resampled = RandomUnderSampler(random_state=42).fit_resample(X_train, y_train)
4print(X_resampled.shape)

Advantages:

  • faster training
  • smaller datasets
  • useful when the majority class is extremely large

Disadvantages:

  • throws away data
  • can remove rare but important majority-class patterns
  • may destabilize the model if the retained subset is not representative

Undersampling is often reasonable when the majority class is huge and somewhat redundant.

Oversampling: Keeps Data but Can Overfit

Oversampling increases the representation of the minority class.

python
1from imblearn.over_sampling import RandomOverSampler
2
3X_resampled, y_resampled = RandomOverSampler(random_state=42).fit_resample(X_train, y_train)
4print(X_resampled.shape)

Advantages:

  • preserves all majority-class examples
  • often improves minority recall
  • simple to apply

Disadvantages:

  • increases training time
  • naive duplication can overfit the minority class
  • synthetic techniques can create unrealistic samples if the feature space is awkward

Oversampling is often a stronger default than aggressive undersampling when the minority class is truly scarce and every majority example may still carry signal.

Consider SMOTE and Class Weights Carefully

A common middle ground is synthetic oversampling such as SMOTE.

python
from imblearn.over_sampling import SMOTE

X_resampled, y_resampled = SMOTE(random_state=42).fit_resample(X_train, y_train)

SMOTE can help, but it is not automatic magic. It can work poorly when:

  • the minority class is noisy
  • classes overlap heavily
  • features are categorical or highly structured

Also consider class weights before resampling, especially for linear models, tree ensembles, and neural networks.

python
1from sklearn.linear_model import LogisticRegression
2
3model = LogisticRegression(class_weight="balanced", max_iter=1000)
4model.fit(X_train, y_train)

Sometimes class weighting gets most of the benefit without altering the data distribution at all.

Resample Only the Training Split

This is one of the most important rules. Never oversample or undersample before the train-test split.

Wrong:

  • resample the whole dataset
  • then split into train and test

Right:

  • split first
  • resample only the training data
  • evaluate on the untouched validation or test data

Otherwise the evaluation becomes optimistic because information from the training process leaks into the test set.

A Practical Decision Rule

A reasonable order of attack is:

  1. establish a baseline with proper metrics
  2. try class weights
  3. try modest oversampling if the minority class is too small
  4. try undersampling if the majority class is extremely large
  5. compare results with cross-validation

This is more reliable than picking a technique by slogan.

Common Pitfalls

  • Optimizing for accuracy on an imbalanced dataset often hides the fact that the minority class is being missed almost entirely.
  • Resampling before the train-test split causes leakage and makes the model look better than it really is.
  • Heavy undersampling can discard useful majority patterns and make the classifier brittle.
  • Random oversampling can overfit by repeating the same rare examples too many times.
  • Using SMOTE on unsuitable feature spaces or noisy minority data can generate misleading synthetic points.

Summary

  • Neither undersampling nor oversampling is universally best.
  • Start with proper metrics and a baseline before changing the class distribution.
  • Try class weights first because they are often simpler and safer.
  • Use oversampling when minority data is scarce and majority data is still valuable.
  • Use undersampling when the majority class is huge and highly redundant, but be careful not to throw away important information.

Course illustration
Course illustration

All Rights Reserved.