How to detect points which are drastically different than their neighbours
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Detecting outliers or anomalous points in a dataset is a fundamental task in data analysis, statistics, and machine learning. These are points that are significantly different from the rest of the data, potentially indicating errors, novel information, or rare events. This article explores various methods to detect such outliers, explains the math behind them, and provides practical guidance on when to use each approach.
What Are Outliers and Why Do They Matter?
Outliers are observations that deviate so much from other observations that they appear to be generated by a different mechanism. Detecting outliers is crucial because they can skew statistical estimates, lead to inaccurate models, and bias the results of hypothesis tests. On the other hand, outliers sometimes represent the most interesting data points, such as fraudulent transactions or equipment failures.
Statistical Methods
Z-Score Analysis
The Z-score measures how many standard deviations a data point is from the mean:
where is the data point, is the mean, and is the standard deviation. A data point with is typically flagged as an outlier, since under a normal distribution, only about 0.3% of data falls beyond 3 standard deviations from the mean.
Limitation: The Z-score assumes the data follows a normal distribution. It is also sensitive to outliers themselves, since the mean and standard deviation are affected by extreme values.
Modified Z-Score
The modified Z-score replaces the mean with the median and the standard deviation with the Median Absolute Deviation (MAD), making it robust to outliers:
where is the median of the dataset and . The constant 0.6745 makes the MAD consistent with the standard deviation for normal distributions. Points with are considered outliers.
IQR Method
The interquartile range method defines outliers as points falling below or above , where and are the first and third quartiles and . This method makes no distributional assumptions and is the basis for box plot whiskers.
Clustering-Based Methods
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points into clusters based on density. Points that do not belong to any cluster are labeled as noise, which serves as an outlier detection mechanism. Unlike K-means, DBSCAN can find arbitrarily shaped clusters and does not require specifying the number of clusters in advance.
Key parameters:
- : the maximum distance between two points for them to be considered neighbors
- : the minimum number of points required to form a dense region
Machine Learning Methods
Isolation Forest
Isolation Forest works on the principle that outliers are easier to isolate than normal points. It builds random binary trees by selecting a random feature and a random split value at each node. Anomalies require fewer splits to be isolated, leading to shorter average path lengths. The anomaly score is derived from the average path length across all trees:
where is the average path length for point and is the average path length in an unsuccessful search in a binary search tree with nodes.
Local Outlier Factor (LOF)
LOF compares the local density of a point to the densities of its neighbors. A point with a substantially lower density than its neighbors is considered an outlier. LOF is particularly effective for datasets where outliers exist in low-density regions adjacent to high-density clusters.
Detecting Outliers in Multi-Dimensional Data
In high-dimensional data, outliers may not be apparent along any single dimension. Specialized techniques include:
- PCA (Principal Component Analysis): Project data onto principal components. Outliers often stand out in the directions of least variance (the last principal components).
- Mahalanobis Distance: Measures the distance of a point from the distribution center while accounting for correlations between variables: , where is the covariance matrix.
Handling Outliers
Once detected, you need to decide how to handle outliers:
- Remove: If outliers are clearly errors (data entry mistakes, sensor glitches), removal is appropriate.
- Transform: Log transformations or winsorization can reduce the influence of extreme values without removing them.
- Use robust methods: Models like decision trees, random forests, and robust regression handle outliers naturally without requiring removal.
- Investigate: Sometimes outliers are the most valuable data points. In fraud detection, they are exactly what you are looking for.
Method Comparison
| Method | Assumptions | Best For | Handles High Dimensions |
| Z-Score | Normal distribution | Simple univariate data | No |
| Modified Z-Score | None (robust) | Skewed or contaminated data | No |
| IQR | None | General univariate data | No |
| DBSCAN | Density-based clusters | Spatial data with noise | Yes |
| Isolation Forest | None | General-purpose, large datasets | Yes |
| LOF | Local density variation | Clusters of varying density | Yes |
Summary
Detecting points that are drastically different from their neighbors requires choosing the right method for your data characteristics. For univariate data, the modified Z-score or IQR method provides robust detection. For multivariate or high-dimensional data, Isolation Forest and DBSCAN are strong general-purpose choices. Always consider whether outliers represent errors to remove or signals to investigate, because the right treatment depends entirely on the domain context.

