Anomaly detection using Python
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Anomaly detection is the task of identifying observations that differ strongly from the normal pattern of a dataset. It is widely used in fraud detection, infrastructure monitoring, quality control, and security because the rare cases are often the ones that matter most.
In Python, the right tool depends on the shape of the problem. For general tabular data, scikit-learn provides several practical starting points, and Isolation Forest is one of the most common.
Types Of Anomalies
Not every anomaly looks the same. In practice, people usually talk about three broad categories:
- point anomalies: one individual record looks unusual
- contextual anomalies: the record is unusual only in a certain context such as time or season
- collective anomalies: a group of records forms an abnormal pattern together
The distinction matters because a method that works well for isolated outliers may not be enough for time-series or sequence anomalies.
Isolation Forest Example
Isolation Forest works by randomly partitioning the data. Outliers tend to be isolated more quickly than normal points, so they receive stronger anomaly scores.
A small runnable example:
This gives you a simple unsupervised baseline. The contamination parameter tells the model roughly what fraction of the data you expect to be anomalous.
Why Scaling And Cleaning Matter
Many anomaly detection methods are sensitive to feature scale. If one feature has values around 1 and another has values around 100000, distance-based methods can be dominated by the large-scale feature.
A typical preprocessing step looks like this:
Missing values also matter. If the input contains NaN values, many models will fail or behave unpredictably unless you impute or remove those rows first.
Other Useful Methods
Isolation Forest is not the only option.
LocalOutlierFactor is useful when anomalies are defined by local density rather than global rarity. OneClassSVM can work for some boundary-learning problems but can be sensitive to parameter choice and scale. Simple statistical methods such as z-scores can still be effective when the data distribution is well understood.
The best model depends on whether your data is tabular, temporal, dense, sparse, high-dimensional, or heavily imbalanced.
Evaluating Anomaly Detection
Evaluation is tricky because anomalies are rare and labels are often incomplete. If you do have labeled anomalies, precision and recall are more informative than overall accuracy, since accuracy can look high even when the detector misses the rare cases that matter.
If you do not have labels, evaluation often becomes domain-driven. You inspect top-ranked anomalies and check whether they are genuinely interesting or just artifacts of bad preprocessing.
Common Pitfalls
The biggest mistake is skipping preprocessing. Scaling, missing-value handling, and feature engineering often matter as much as the model choice.
Another pitfall is setting the anomaly fraction arbitrarily and treating the output as ground truth. Most anomaly detectors rank unusual points; the final threshold still needs human or domain judgment.
A third issue is evaluating with accuracy on highly imbalanced data. A detector that marks everything as normal can still score well on accuracy while being useless in practice.
Summary
- Anomaly detection looks for rare observations that differ from normal behavior.
- Isolation Forest is a strong practical starting point for many tabular datasets.
- Preprocessing such as scaling and missing-value handling is critical.
- Different anomaly types may require different algorithms.
- Evaluate with domain-aware metrics and thresholds, not just overall accuracy.

