How to detect points which are drastically different than their neighbours

anomaly detection

data analysis

outlier detection

statistical methods

pattern recognition

How to detect points which are drastically different than their neighbours

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Detecting outliers or anomalous points in a dataset is a fundamental task in data analysis, statistics, and machine learning. These are points that are significantly different from the rest of the data, potentially indicating errors, novel information, or rare events. This article explores various methods to detect such outliers, explains the math behind them, and provides practical guidance on when to use each approach.

What Are Outliers and Why Do They Matter?

Outliers are observations that deviate so much from other observations that they appear to be generated by a different mechanism. Detecting outliers is crucial because they can skew statistical estimates, lead to inaccurate models, and bias the results of hypothesis tests. On the other hand, outliers sometimes represent the most interesting data points, such as fraudulent transactions or equipment failures.

Statistical Methods

Z-Score Analysis

The Z-score measures how many standard deviations a data point is from the mean:

$Z = \frac{X - \mu}{\sigma}$

where $X$ is the data point, $\mu$ is the mean, and $\sigma$ is the standard deviation. A data point with $|Z| > 3$ is typically flagged as an outlier, since under a normal distribution, only about 0.3% of data falls beyond 3 standard deviations from the mean.

Limitation: The Z-score assumes the data follows a normal distribution. It is also sensitive to outliers themselves, since the mean and standard deviation are affected by extreme values.

Modified Z-Score

The modified Z-score replaces the mean with the median and the standard deviation with the Median Absolute Deviation (MAD), making it robust to outliers:

$M = \frac{0.6745 \cdot (X - \tilde{X})}{\text{MAD}}$

where $\tilde{X}$ is the median of the dataset and $\text{MAD} = \text{median}(|X_i - \tilde{X}|)$ . The constant 0.6745 makes the MAD consistent with the standard deviation for normal distributions. Points with $|M| > 3.5$ are considered outliers.

IQR Method

The interquartile range method defines outliers as points falling below $Q_1 - 1.5 \cdot \text{IQR}$ or above $Q_3 + 1.5 \cdot \text{IQR}$ , where $Q_1$ and $Q_3$ are the first and third quartiles and $\text{IQR} = Q_3 - Q_1$ . This method makes no distributional assumptions and is the basis for box plot whiskers.

Clustering-Based Methods

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points into clusters based on density. Points that do not belong to any cluster are labeled as noise, which serves as an outlier detection mechanism. Unlike K-means, DBSCAN can find arbitrarily shaped clusters and does not require specifying the number of clusters in advance.

Key parameters:

$\epsilon$ : the maximum distance between two points for them to be considered neighbors
$\text{minPts}$ : the minimum number of points required to form a dense region

Machine Learning Methods

Isolation Forest

Isolation Forest works on the principle that outliers are easier to isolate than normal points. It builds random binary trees by selecting a random feature and a random split value at each node. Anomalies require fewer splits to be isolated, leading to shorter average path lengths. The anomaly score is derived from the average path length across all trees:

$s(x, n) = 2^{-\frac{E[h(x)]}{c(n)}}$

where $E[h(x)]$ is the average path length for point $x$ and $c(n)$ is the average path length in an unsuccessful search in a binary search tree with $n$ nodes.

Local Outlier Factor (LOF)

LOF compares the local density of a point to the densities of its neighbors. A point with a substantially lower density than its neighbors is considered an outlier. LOF is particularly effective for datasets where outliers exist in low-density regions adjacent to high-density clusters.

Detecting Outliers in Multi-Dimensional Data

In high-dimensional data, outliers may not be apparent along any single dimension. Specialized techniques include:

PCA (Principal Component Analysis): Project data onto principal components. Outliers often stand out in the directions of least variance (the last principal components).
Mahalanobis Distance: Measures the distance of a point from the distribution center while accounting for correlations between variables: $D_M = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$ , where $\Sigma$ is the covariance matrix.

Handling Outliers

Once detected, you need to decide how to handle outliers:

Remove: If outliers are clearly errors (data entry mistakes, sensor glitches), removal is appropriate.
Transform: Log transformations or winsorization can reduce the influence of extreme values without removing them.
Use robust methods: Models like decision trees, random forests, and robust regression handle outliers naturally without requiring removal.
Investigate: Sometimes outliers are the most valuable data points. In fraud detection, they are exactly what you are looking for.

Method Comparison

Method	Assumptions	Best For	Handles High Dimensions
Z-Score	Normal distribution	Simple univariate data	No
Modified Z-Score	None (robust)	Skewed or contaminated data	No
IQR	None	General univariate data	No
DBSCAN	Density-based clusters	Spatial data with noise	Yes
Isolation Forest	None	General-purpose, large datasets	Yes
LOF	Local density variation	Clusters of varying density	Yes

Summary

Detecting points that are drastically different from their neighbors requires choosing the right method for your data characteristics. For univariate data, the modified Z-score or IQR method provides robust detection. For multivariate or high-dimensional data, Isolation Forest and DBSCAN are strong general-purpose choices. Always consider whether outliers represent errors to remove or signals to investigate, because the right treatment depends entirely on the domain context.