Data Standardization
Data Normalization
Robust Scaler
Data Preprocessing
Machine Learning Techniques

Data Standardization vs Normalization vs Robust Scaler

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

In the realm of data preprocessing, particularly in machine learning, it is crucial to scale your features correctly to ensure that your models perform optimally. Three common techniques to adjust the scales of features are data standardization, normalization, and robust scaling. Each method has its own set of use-cases, benefits, and limitations. In this article, we'll dissect each approach, explain the mathematical underpinnings, and discuss scenarios where they are most applicable.

Data Standardization

Data standardization is a scaling technique where the features are rescaled so that they have the properties of a Gaussian distribution with zero mean and a standard deviation of one. This is often referred to as Z-score standardization.

Technical Explanation

The formula for standardizing a dataset is:

X=XμσX' = \frac{X - \mu}{\sigma}

Where: • XX is the original data • μ\mu is the mean of the data • σ\sigma is the standard deviation of the data • XX' is the standardized data

Use-Case

Standardization is typically used when the algorithms assume that the data is centered around zero. Linear regression, logistic regression, and algorithms like Support Vector Machines (SVM) and Principal Component Analysis (PCA) often benefit from standardized data.

Example

Consider a dataset with a feature height ranging from 150 cm to 200 cm. If we standardize this feature, it will have a mean of 0 and a standard deviation of 1, centering the data around zero.

Data Normalization

Normalization is the process of scaling individual samples to have unit norms. It commonly scales the data between 0 and 1.

Technical Explanation

The formula for min-max normalization is:

X=XXminXmaxXminX' = \frac{X - X_{min}}{X_{max} - X_{min}}

Where: • XX is the original data • $X_\{min\}$ and $X_\{max\}$ are the minimum and maximum values in the dataset, respectively • XX' is the normalized data

Use-Case

Normalization is useful when the data is not bounded and you want to convert it into a bounded interval. It is highly beneficial when the model does not assume any nature or distribution of the data, such as K-nearest neighbors and neural networks.

Example

If you have a feature price ranging from 30,000to30,000 to120,000, normalizing this feature will scale it between 0 and 1.

Robust Scaler

Robust scaling involves scaling features using statistics that are robust to outliers. Unlike standard normalization, it uses the median and the interquartile range (IQR).

Technical Explanation

The formula for robust scaling is:

X=XQ1(X)IQR(X)X' = \frac{X - Q_{1}(X)}{IQR(X)}

Where: • Q1(X)Q_{1}(X) is the 25th percentile of data • IQR(X)=Q3(X)Q1(X)IQR(X) = Q_{3}(X) - Q_{1}(X), where Q3(X)Q_{3}(X) is the 75th percentile • XX' is the robust scaled data

Use-Case

Robust scaling is especially useful in datasets with many outliers. Models like tree-based algorithms might not need feature scaling, but using robust scaling can still improve gradient-based optimization convergence in others.

Example

For a feature income where most of the data is concentrated around the median and has large outliers, robust scaling helps center the data around the median with scales less influenced by extreme values.

Key Differences Summarized

Below is a table summarizing the key characteristics, benefits, and typical usage scenarios of each method:

FeatureStandardizationNormalizationRobust Scaling
Formula$\frac\{X - \mu\}\{\sigma\}$$\frac\{X - X_\{min\}\}\{X_\{max\} - X_\{min\}\}$$\frac\{X - Q_\{1\}(X)\}\{IQR(X)\}$
CentroidZero meanNo standard centroidMedian-centered
ScaleUnit variance0 to 1 for bounded normalizationMedian and IQR influenced
SensitivitySensitive to outliersSensitive to outliersLess sensitive to outliers
Use CasesSVM, PCA, regression algorithmsKNN, neural networksData with heavy outliers

Conclusion

Choosing the right preprocessing technique is critical for model performance and convergence. Understanding your data's distribution and outlier sensitivity is key to selecting between standardization, normalization, and robust scaling. Always remember to test different scaling methods to ascertain which one offers the best improvement for your specific application.


Course illustration
Course illustration

All Rights Reserved.