Fraud Detection and Anomaly Systems

ML Systems & Infrastructure

Fraud Detection and Anomaly Systems

Topics Covered

Anomaly Detection Approaches

Statistical Methods

Isolation Forests

Autoencoders

Unsupervised vs Supervised

Feature Engineering for Fraud

Real-Time Scoring Architecture

The Scoring Pipeline

Latency Budget

Feature Freshness

Rule Engines vs ML Models

Why Rules Matter

Why Rules Are Not Enough

Why ML Models Are Not Enough

The Hybrid Approach

Score Calibration

Feedback Loops and Label Collection

The Label Delay Problem

The Feedback Loop

Adversarial Drift

Explainability

Regulatory Requirements

Every payment platform, marketplace, and financial service faces the same problem: a tiny fraction of transactions are fraudulent, and you cannot label every single one. In a dataset of 100,000 transactions, perhaps 100 are fraudulent. You cannot hire analysts to review every transaction, and you cannot wait for chargebacks to arrive weeks later. You need methods that identify outliers automatically, often without any labeled examples at all.

This is the domain of anomaly detection: finding the data points that do not belong. The core insight is that fraudulent transactions look different from legitimate ones in measurable ways. A customer who normally spends $50 on groceries suddenly purchases $3,000 of electronics in a foreign country. An account that has been dormant for six months suddenly initiates 15 wire transfers in an hour. These deviations from normal behavior are the signal that anomaly detection systems learn to find.

Key Insight

The fundamental challenge of fraud detection is class imbalance: in a dataset of 100,000 transactions, only 100 might be fraudulent. A model that predicts every transaction as legitimate achieves 99.9% accuracy while catching zero fraud. This is why precision-recall and cost-sensitive metrics matter more than accuracy.

Statistical Methods

The simplest anomaly detection methods use statistics. Z-score measures how many standard deviations a value falls from the mean. If the average transaction amount for a user is $45 with a standard deviation of $20, a $200 transaction has a z-score of 7.75. Anything beyond 3 standard deviations is flagged as anomalous. This works well for single-variable outliers but misses multi-dimensional patterns.

IQR (interquartile range) identifies outliers as values beyond 1.5 times the interquartile range above Q3 or below Q1. IQR is more robust to skewed distributions than z-scores because it uses medians rather than means. Transaction amounts tend to be right-skewed (many small purchases, few large ones), making IQR a better fit for raw amount-based detection.

Statistical methods are fast, interpretable, and require no training data. Their weakness is that they operate on single features. A $200 transaction might be normal for a business account but fraudulent for a college student. You need methods that consider multiple features simultaneously.

Isolation Forests

Isolation forests detect anomalies by isolating data points through random partitioning. The algorithm randomly selects a feature and a split value, then partitions the data. Normal data points are surrounded by similar points, so they require many splits to isolate. Anomalies sit in sparse regions and are isolated in just a few splits. The number of splits needed to isolate a point is its anomaly score.

This is an elegant approach because it directly measures what makes anomalies different: they are rare and different. You do not need to define what "normal" looks like. You only need to observe that anomalies are easier to separate. Isolation forests handle high-dimensional data well, run efficiently on large datasets, and require minimal hyperparameter tuning.

Autoencoders

Autoencoders are neural networks trained to reconstruct their input. The network compresses the input into a lower-dimensional bottleneck layer, then expands it back to the original dimensions. When trained on legitimate transactions, the autoencoder learns to reconstruct normal patterns. When it encounters a fraudulent transaction, it cannot reconstruct it accurately because fraud patterns were not in the training data. The reconstruction error, the difference between input and output, serves as the anomaly score.

A legitimate transaction might have a reconstruction error of 0.02. A fraudulent transaction might have an error of 0.35. Setting a threshold on reconstruction error separates normal from anomalous. Autoencoders excel at capturing complex, non-linear relationships between features that statistical methods and tree-based methods miss.

Unsupervised vs Supervised

Unsupervised methods (isolation forests, autoencoders, statistical methods) detect anomalies without labeled fraud examples. They catch unknown patterns, novel attack vectors that no one has seen before. But they also produce more false positives because they flag anything unusual, not just fraud.

Supervised methods (gradient-boosted trees, neural networks trained on labeled fraud data) are more accurate for known fraud types. If you have 10,000 confirmed chargeback cases, a supervised model learns exactly what that fraud looks like and catches similar patterns with high precision. But it misses entirely new fraud types that were not in the training data.

The practical approach is to use both. Supervised models catch known fraud patterns with high precision. Unsupervised models flag unusual activity that might represent new fraud types, and analysts investigate these flags to generate new labels for future supervised training.

Feature Engineering for Fraud

Raw transaction data (amount, merchant, timestamp) is not enough. The most predictive features are derived ones.

Velocity features measure the rate of activity: 5 transactions in 2 minutes, 3 new payees added in 1 hour, first international transaction in 6 months. These capture behavioral shifts that individual transactions cannot.

Graph features capture relationships between entities: two accounts sharing the same shipping address, a device used by multiple accounts, a phone number linked to previous fraud. Fraudsters reuse infrastructure, and graph features expose these connections.

Device fingerprinting identifies the hardware and software profile of the device making the transaction. A transaction from a device with a different browser, OS, timezone, and language than the account's usual device is suspicious.

Geolocation anomalies flag physically impossible patterns: a card used in New York at 2:00 PM and in London at 2:30 PM. No legitimate user travels that fast.

Real-Time Scoring Architecture

Fraud must be blocked before the transaction completes, not detected after the money is gone. A batch system that runs overnight and flags yesterday's fraud is useless when the stolen funds have already been withdrawn. Real-time scoring means evaluating every transaction as it happens, making a decision in under 100 milliseconds, and either allowing, blocking, or routing the transaction to manual review before the payment processor settles the funds.

The Scoring Pipeline

The real-time fraud scoring pipeline has five stages, each adding information and making the decision more precise.

Stage 1: Event Ingestion. The transaction event arrives via a streaming platform like Kafka or Kinesis. The event contains raw transaction data: amount, currency, merchant, card hash, device ID, IP address, and timestamp. Kafka provides the durability and ordering guarantees needed for financial data, ensuring no transaction is lost even if downstream services are temporarily unavailable.

Stage 2: Feature Enrichment. A feature service looks up precomputed and real-time features from an online feature store (Redis for sub-millisecond lookups, DynamoDB for durable storage). Velocity features (transactions in the last 5 minutes) must be computed in real-time because they change with every new transaction. Historical features (lifetime spend, average transaction amount, account age) can be batch-computed hourly or daily because they change slowly. The feature store bridges these two worlds: real-time features are updated with each transaction via stream processing, while batch features are refreshed on a schedule.

Stage 3: Rule Engine. Deterministic rules evaluate patterns that are known and absolute. Block all transactions from sanctioned countries. Flag any card used in two countries within 30 minutes. Reject transactions where the amount exceeds 10 times the user's average. Rules are fast (microsecond evaluation), explainable (you can tell the customer exactly why their transaction was blocked), and have zero false negatives for the patterns they cover.

Stage 4: ML Model Scoring. The ML model receives the enriched feature vector and computes a fraud probability between 0.0 and 1.0. This is typically a gradient-boosted tree model (XGBoost, LightGBM) served via a low-latency inference service. Tree models are preferred over neural networks for this stage because they offer sub-5ms inference time, handle tabular features well, and are easier to interpret. The model catches complex patterns that rules cannot express: the subtle combination of a slightly unusual amount, a merchant in a rarely-visited category, and a device fingerprint that has changed one field.

Stage 5: Decision Engine. The decision engine combines the rule output and model score into a final action. If any hard rule fires (sanctioned country, impossible geography), the transaction is blocked immediately. If the model score exceeds the block threshold (for example, 0.85), the transaction is blocked. If the score falls in the review zone (0.40-0.85), the transaction is routed to a human analyst for manual review. Below 0.40, the transaction is allowed. These thresholds are tuned based on the business's tolerance for false positives versus missed fraud.

Interview Tip

In system design interviews, fraud detection is a common topic. Always mention the dual-path architecture: rules for known patterns (fast, explainable, zero false negatives on known fraud) and ML for novel patterns (catches what rules miss). Interviewers want to see that you understand why both are needed — rules alone cannot catch novel fraud, ML alone cannot explain decisions to regulators.

Latency Budget

The entire pipeline must complete in under 100 milliseconds to avoid delaying the payment authorization. A typical budget allocation:

1Stage                                Latency Budget
2─────────────────────────────────    ──────────────
3Event ingestion (Kafka consumer)     5-10ms
4Feature enrichment (Redis lookup)    5-15ms
5Rule engine evaluation               1-2ms
6ML model inference                   3-10ms
7Decision engine + response           2-5ms
8Network overhead                     10-20ms
9Total                                26-62ms

This leaves headroom within the 100ms SLA. The feature enrichment step is usually the bottleneck because it requires network calls to the feature store. Keeping the feature store co-located in the same availability zone and using connection pooling keeps this under 15ms.

Feature Freshness

Not all features need the same update frequency. Velocity features (transactions in the last 5 minutes, login attempts in the last hour) must be computed in real-time because they change with every event. A stream processor (Flink, Kafka Streams) maintains sliding window counters that update as each transaction arrives.

Historical features (average transaction amount over 90 days, total lifetime spend, account age) change slowly and can be batch-computed every hour or every day. A batch pipeline (Spark, dbt) computes these features on a schedule and writes them to the feature store. The online feature store serves both real-time and batch features with the same low-latency read path, hiding the difference in computation frequency from the scoring pipeline.

Rule Engines vs ML Models

Why not just use rules? Why not just use ML? Every fraud team asks this question, and the answer is always the same: you need both. Rules and ML models have complementary strengths, and a system that relies on only one approach leaves significant gaps.

Why Rules Matter

Rules are deterministic, fast, and explainable. "Block if more than 10 transactions from different countries in 1 hour" is immediately understandable to regulators, analysts, and customers. When a customer calls to ask why their transaction was blocked, you can give a specific, concrete reason. When a regulator audits your system, you can show them exactly what each rule does and why it exists.

Rules execute in microseconds. There is no model loading, no feature vector construction, no inference latency. For known fraud patterns, rules provide instant protection the moment you deploy them. A new phishing campaign targeting your users can be mitigated within minutes by adding a rule, without waiting for model retraining.

Rules have zero false negatives for the patterns they cover. If the rule says "block transactions from sanctioned country X," every single transaction from country X is blocked. There is no probability threshold, no score calibration, no chance of the model having a bad day.

Why Rules Are Not Enough

Rules are brittle. Fraudsters learn the thresholds and stay just below them. If your rule blocks accounts with more than 10 transactions per hour, fraudsters will make 9. If your rule flags transactions above $5,000, fraudsters will split their purchases into $4,900 increments. Rules create a fixed boundary that sophisticated fraudsters map and exploit.

Rules cannot capture complex feature interactions. A transaction might be suspicious only when the amount is slightly above average AND the merchant category is unusual AND the device fingerprint has changed AND the transaction time is outside normal hours. Writing rules for every combination of interacting features is impractical. With 50 features, the number of possible interaction rules is combinatorially explosive.

Rules do not generalize. Each rule covers one specific pattern. New fraud types require new rules, and you cannot write rules for patterns you have not seen yet. The rule base grows over time, becoming a maintenance burden where rules interact in unexpected ways, sometimes blocking legitimate transactions through unintended combinations.

Why ML Models Are Not Enough

ML models are black boxes. A gradient-boosted tree with 500 trees and 10 features makes decisions that are difficult to explain in plain language. Regulators, particularly under PSD2 in Europe and various US banking regulations, require that automated decisions affecting consumers be explainable. "The model said so" is not an acceptable explanation for blocking someone's payment.

ML models have probabilistic outputs. A score of 0.72 means the model thinks there is a 72% chance of fraud. But is that enough to block the transaction? The threshold decision involves business trade-offs (how much fraud loss versus how many legitimate customers blocked) that the model cannot make on its own.

ML models require training data, and fraud labels arrive with a 30-90 day delay (chargebacks). A brand new fraud pattern has no training data, so the model has never seen it and may not flag it. Rules can be deployed immediately when a new pattern is identified.

The Hybrid Approach

The production answer is a layered architecture. Rules serve as the first pass: they evaluate fast, catch known patterns with certainty, and provide explainable decisions. The ML model serves as the second pass: it catches the complex, novel patterns that rules miss.

The decision engine combines both signals. Hard rules (sanctioned countries, impossible geography) override everything. Soft rules (velocity limits, amount thresholds) contribute to a risk score alongside the ML output. The combined score determines the final action.

Common Pitfall

A common mistake is setting the ML fraud threshold once and forgetting it. As fraud patterns shift and your model retrains, the optimal threshold changes. A threshold of 0.7 might be right when the model is fresh, but after three months of distribution shift, that same threshold could be blocking 2x more legitimate transactions. Monitor your precision and recall weekly, not just your model accuracy.

Score Calibration

The ML model outputs a raw score, but the business needs a calibrated probability. A calibrated model means that among all transactions scored 0.70, approximately 70% are actually fraudulent. Without calibration, a score of 0.70 might correspond to a 40% actual fraud rate or a 90% fraud rate, making threshold selection unreliable.

Platt scaling and isotonic regression are common calibration techniques. Platt scaling fits a logistic function to the model's scores using a held-out calibration set. Isotonic regression fits a non-parametric monotonic function. Both transform raw scores into true probabilities, enabling principled threshold selection based on the business's cost of false positives versus false negatives.

Feedback Loops and Label Collection

The hardest part of fraud detection is not building the model. It is getting the labels. At the moment you score a transaction, you do not know whether it is fraudulent. The transaction looks like a normal purchase. The customer has not complained yet. The chargeback has not been filed. The truth arrives 30-90 days later, long after the model made its decision.

The Label Delay Problem

When a customer disputes a charge, the bank initiates a chargeback process. The merchant has 30-45 days to respond. The issuing bank then has another 30-45 days to resolve. For some fraud types (account takeover, synthetic identity), the victim may not notice for months. This means your training data is always incomplete: your most recent 90 days of transactions have unreliable labels because many fraud cases have not been reported yet.

This creates a training bias. If you retrain your model on the last 30 days of data, most of the fraud labels are missing. The model learns that recent transactions are mostly legitimate, because the fraud labels simply have not arrived yet. The practical fix is to train on data that is at least 90 days old, where label coverage is more complete, and accept that your model is always learning from slightly stale patterns.

The Feedback Loop

The fraud detection feedback loop works as follows: the model scores transactions, high-scoring transactions are reviewed by human analysts, analyst decisions (confirmed fraud or false positive) become labels, and the model retrains on these new labels. This is a standard active learning loop, and it works well when executed carefully.

But there is a critical bias in this loop. The model only generates labels for transactions it flags. Transactions that the model scores below the review threshold are never reviewed, so you never learn whether they were actually fraudulent. This is survivorship bias: you only see the outcomes of decisions the model already made. If the model has a blind spot for a particular fraud type, that fraud type is never flagged, never reviewed, never labeled, and the model never learns to detect it.

The fix is random sampling. Randomly select a small percentage (0.1-1%) of transactions that the model scored as low risk and send them for analyst review anyway. This provides an unbiased sample of the model's false negative rate and surfaces fraud types that the model is missing. It costs more in analyst time, but the label quality improvement makes the model significantly better.

Adversarial Drift

Fraud detection is fundamentally different from most ML problems because the data distribution changes in response to the model. If the model learns to block card-testing attacks (rapid sequences of small transactions), fraudsters switch to account takeover (compromising a legitimate user's credentials). If the model learns to detect account takeover, fraudsters move to synthetic identity fraud (creating fake identities with real data fragments).

This is an adversarial arms race, and it means the model must continuously retrain. A model trained on last quarter's fraud patterns degrades within weeks as fraudsters adapt. The retraining cadence depends on how quickly fraud patterns shift. Most production systems retrain weekly or biweekly, with a monitoring system that triggers emergency retraining when performance metrics (precision, recall, false positive rate) degrade beyond a threshold.

Explainability

When a transaction is blocked, someone needs to know why. The customer, the analyst reviewing the case, the regulator auditing the system. SHAP (SHapley Additive exPlanations) values decompose the model's score into per-feature contributions. For a transaction scored 0.82, SHAP might show: velocity features contributed +0.30, device change contributed +0.25, unusual merchant contributed +0.15, and normal amount contributed -0.08. This tells the analyst which features drove the decision.

LIME (Local Interpretable Model-agnostic Explanations) creates a simple, interpretable model (linear regression) that approximates the complex model's behavior in the neighborhood of the specific prediction. The simple model's coefficients serve as feature importance scores.

Feature importance shows which features the model relies on globally, not for a specific prediction. If velocity features have the highest importance, the model is primarily a behavioral anomaly detector. If device features dominate, the model is primarily a device fraud detector. This informs feature engineering priorities.

Regulatory Requirements

PSD2 (Payment Services Directive 2) in Europe requires Strong Customer Authentication and mandates that payment service providers explain automated fraud decisions. GDPR grants data subjects the right to meaningful information about the logic involved in automated decision-making. In the US, ECOA (Equal Credit Opportunity Act) requires adverse action notices that explain why a financial decision was made.

These regulations mean that explainability is not optional. Every blocked transaction must have a documented reason that can be provided to the customer and to regulators on request. This is another argument for the hybrid rule-plus-ML approach: rules provide built-in explanations, and SHAP/LIME provides approximate explanations for ML-driven decisions.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone