anomaly detection
data analysis
outlier detection
statistical methods
pattern recognition

How to detect points which are drastically different than their neighbours

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Detecting outliers or anomalous points in a dataset is a fundamental task in data analysis, statistics, and machine learning. These are points that are significantly different from the rest of the data, potentially indicating errors, novel information, or rare events. This article explores various methods to detect such outliers, explains the math behind them, and provides practical guidance on when to use each approach.

What Are Outliers and Why Do They Matter?

Outliers are observations that deviate so much from other observations that they appear to be generated by a different mechanism. Detecting outliers is crucial because they can skew statistical estimates, lead to inaccurate models, and bias the results of hypothesis tests. On the other hand, outliers sometimes represent the most interesting data points, such as fraudulent transactions or equipment failures.

Statistical Methods

Z-Score Analysis

The Z-score measures how many standard deviations a data point is from the mean:

Z=XμσZ = \frac{X - \mu}{\sigma}

where XX is the data point, μ\mu is the mean, and σ\sigma is the standard deviation. A data point with Z>3|Z| > 3 is typically flagged as an outlier, since under a normal distribution, only about 0.3% of data falls beyond 3 standard deviations from the mean.

Limitation: The Z-score assumes the data follows a normal distribution. It is also sensitive to outliers themselves, since the mean and standard deviation are affected by extreme values.

Modified Z-Score

The modified Z-score replaces the mean with the median and the standard deviation with the Median Absolute Deviation (MAD), making it robust to outliers:

M=0.6745(XX~)MADM = \frac{0.6745 \cdot (X - \tilde{X})}{\text{MAD}}

where X~\tilde{X} is the median of the dataset and MAD=median(XiX~)\text{MAD} = \text{median}(|X_i - \tilde{X}|). The constant 0.6745 makes the MAD consistent with the standard deviation for normal distributions. Points with M>3.5|M| > 3.5 are considered outliers.

IQR Method

The interquartile range method defines outliers as points falling below Q11.5IQRQ_1 - 1.5 \cdot \text{IQR} or above Q3+1.5IQRQ_3 + 1.5 \cdot \text{IQR}, where Q1Q_1 and Q3Q_3 are the first and third quartiles and IQR=Q3Q1\text{IQR} = Q_3 - Q_1. This method makes no distributional assumptions and is the basis for box plot whiskers.

Clustering-Based Methods

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points into clusters based on density. Points that do not belong to any cluster are labeled as noise, which serves as an outlier detection mechanism. Unlike K-means, DBSCAN can find arbitrarily shaped clusters and does not require specifying the number of clusters in advance.

Key parameters:

  • ϵ\epsilon: the maximum distance between two points for them to be considered neighbors
  • minPts\text{minPts}: the minimum number of points required to form a dense region

Machine Learning Methods

Isolation Forest

Isolation Forest works on the principle that outliers are easier to isolate than normal points. It builds random binary trees by selecting a random feature and a random split value at each node. Anomalies require fewer splits to be isolated, leading to shorter average path lengths. The anomaly score is derived from the average path length across all trees:

s(x,n)=2E[h(x)]c(n)s(x, n) = 2^{-\frac{E[h(x)]}{c(n)}}

where E[h(x)]E[h(x)] is the average path length for point xx and c(n)c(n) is the average path length in an unsuccessful search in a binary search tree with nn nodes.

Local Outlier Factor (LOF)

LOF compares the local density of a point to the densities of its neighbors. A point with a substantially lower density than its neighbors is considered an outlier. LOF is particularly effective for datasets where outliers exist in low-density regions adjacent to high-density clusters.

Detecting Outliers in Multi-Dimensional Data

In high-dimensional data, outliers may not be apparent along any single dimension. Specialized techniques include:

  • PCA (Principal Component Analysis): Project data onto principal components. Outliers often stand out in the directions of least variance (the last principal components).
  • Mahalanobis Distance: Measures the distance of a point from the distribution center while accounting for correlations between variables: DM=(xμ)TΣ1(xμ)D_M = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}, where Σ\Sigma is the covariance matrix.

Handling Outliers

Once detected, you need to decide how to handle outliers:

  • Remove: If outliers are clearly errors (data entry mistakes, sensor glitches), removal is appropriate.
  • Transform: Log transformations or winsorization can reduce the influence of extreme values without removing them.
  • Use robust methods: Models like decision trees, random forests, and robust regression handle outliers naturally without requiring removal.
  • Investigate: Sometimes outliers are the most valuable data points. In fraud detection, they are exactly what you are looking for.

Method Comparison

MethodAssumptionsBest ForHandles High Dimensions
Z-ScoreNormal distributionSimple univariate dataNo
Modified Z-ScoreNone (robust)Skewed or contaminated dataNo
IQRNoneGeneral univariate dataNo
DBSCANDensity-based clustersSpatial data with noiseYes
Isolation ForestNoneGeneral-purpose, large datasetsYes
LOFLocal density variationClusters of varying densityYes

Summary

Detecting points that are drastically different from their neighbors requires choosing the right method for your data characteristics. For univariate data, the modified Z-score or IQR method provides robust detection. For multivariate or high-dimensional data, Isolation Forest and DBSCAN are strong general-purpose choices. Always consider whether outliers represent errors to remove or signals to investigate, because the right treatment depends entirely on the domain context.


Course illustration
Course illustration

All Rights Reserved.