Data Standardization vs Normalization vs Robust Scaler

Data Standardization

Data Normalization

Robust Scaler

Data Preprocessing

Machine Learning Techniques

Data Standardization vs Normalization vs Robust Scaler

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

In the realm of data preprocessing, particularly in machine learning, it is crucial to scale your features correctly to ensure that your models perform optimally. Three common techniques to adjust the scales of features are data standardization, normalization, and robust scaling. Each method has its own set of use-cases, benefits, and limitations. In this article, we'll dissect each approach, explain the mathematical underpinnings, and discuss scenarios where they are most applicable.

Data Standardization

Data standardization is a scaling technique where the features are rescaled so that they have the properties of a Gaussian distribution with zero mean and a standard deviation of one. This is often referred to as Z-score standardization.

Technical Explanation

The formula for standardizing a dataset is:

$X' = \frac{X - \mu}{\sigma}$

Where: • $X$ is the original data • $\mu$ is the mean of the data • $\sigma$ is the standard deviation of the data • $X'$ is the standardized data

Use-Case

Standardization is typically used when the algorithms assume that the data is centered around zero. Linear regression, logistic regression, and algorithms like Support Vector Machines (SVM) and Principal Component Analysis (PCA) often benefit from standardized data.

Example

Consider a dataset with a feature height ranging from 150 cm to 200 cm. If we standardize this feature, it will have a mean of 0 and a standard deviation of 1, centering the data around zero.

Data Normalization

Normalization is the process of scaling individual samples to have unit norms. It commonly scales the data between 0 and 1.

Technical Explanation

The formula for min-max normalization is:

$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$

Where: • $X$ is the original data • $X_\{min\}$ and $X_\{max\}$ are the minimum and maximum values in the dataset, respectively • $X'$ is the normalized data

Use-Case

Normalization is useful when the data is not bounded and you want to convert it into a bounded interval. It is highly beneficial when the model does not assume any nature or distribution of the data, such as K-nearest neighbors and neural networks.

Example

If you have a feature price ranging from $30,000 to$ 120,000, normalizing this feature will scale it between 0 and 1.

Robust Scaler

Robust scaling involves scaling features using statistics that are robust to outliers. Unlike standard normalization, it uses the median and the interquartile range (IQR).

Technical Explanation

The formula for robust scaling is:

$X' = \frac{X - Q_{1}(X)}{IQR(X)}$

Where: • $Q_{1}(X)$ is the 25th percentile of data • $IQR(X) = Q_{3}(X) - Q_{1}(X)$ , where $Q_{3}(X)$ is the 75th percentile • $X'$ is the robust scaled data

Use-Case

Robust scaling is especially useful in datasets with many outliers. Models like tree-based algorithms might not need feature scaling, but using robust scaling can still improve gradient-based optimization convergence in others.

Example

For a feature income where most of the data is concentrated around the median and has large outliers, robust scaling helps center the data around the median with scales less influenced by extreme values.

Key Differences Summarized

Below is a table summarizing the key characteristics, benefits, and typical usage scenarios of each method:

Feature	Standardization	Normalization	Robust Scaling
Formula	`$\frac\{X - \mu\}\{\sigma\}`$	$`\frac\{X - X_\{min\}\}\{X_\{max\} - X_\{min\}\}`$	$`\frac\{X - Q_\{1\}(X)\}\{IQR(X)\}$`
Centroid	Zero mean	No standard centroid	Median-centered
Scale	Unit variance	0 to 1 for bounded normalization	Median and IQR influenced
Sensitivity	Sensitive to outliers	Sensitive to outliers	Less sensitive to outliers
Use Cases	SVM, PCA, regression algorithms	KNN, neural networks	Data with heavy outliers

Conclusion

Choosing the right preprocessing technique is critical for model performance and convergence. Understanding your data's distribution and outlier sensitivity is key to selecting between standardization, normalization, and robust scaling. Always remember to test different scaling methods to ascertain which one offers the best improvement for your specific application.