Detecting 'unusual behavior' using machine learning with CouchDB and Python?

Machine Learning

Unusual Behavior Detection

CouchDB

Python

Data Science

Detecting 'unusual behavior' using machine learning with CouchDB and Python?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

If you want to detect unusual behavior with CouchDB and Python, the useful architecture is to treat CouchDB as the event store and Python as the feature-engineering and model-execution layer. CouchDB is good at collecting JSON documents and querying them; it is not the place where anomaly detection models should train.

That separation keeps the system simple. Store events in CouchDB, build features in Python, train an unsupervised model such as Isolation Forest, and write anomaly scores or alerts back to the database or another downstream system.

Start With Structured Event Documents

Anomaly detection is only as good as the features you log. Raw message text is rarely enough by itself. Store events with fields that describe who acted, when, from where, and with what result.

A useful event document might include:

'user_id'
'ip_address'
'timestamp'
'endpoint'
'status_code'
'bytes_sent'
'latency_ms'

That gives Python enough signal to compute features such as request rate, failure ratio, byte volume, and time-of-day behavior.

Pull Data From CouchDB in Python

CouchDB exposes an HTTP API, so a simple pipeline can use requests to fetch documents.

python

1import requests
2
3COUCHDB_URL = "http://localhost:5984/events/_find"
4AUTH = ("admin", "password")
5
6query = {
7    "selector": {
8        "type": "request_event"
9    },
10    "fields": [
11        "user_id",
12        "timestamp",
13        "status_code",
14        "bytes_sent",
15        "latency_ms"
16    ],
17    "limit": 1000
18}
19
20response = requests.post(COUCHDB_URL, json=query, auth=AUTH, timeout=30)
21response.raise_for_status()
22documents = response.json()["docs"]
23print(len(documents))

This keeps the extraction logic explicit and easy to test.

Build Features That Represent “Normal” Behavior

Anomaly models work on numbers, not on raw JSON. For request logs, a useful first pass is to aggregate by user and time window.

python

1import pandas as pd
2
3
4df = pd.DataFrame(documents)
5df["timestamp"] = pd.to_datetime(df["timestamp"], utc=True)
6df["is_error"] = (df["status_code"] >= 400).astype(int)
7
8grouped = (
9    df.set_index("timestamp")
10      .groupby("user_id")
11      .resample("15min")
12      .agg({
13          "bytes_sent": "sum",
14          "latency_ms": "mean",
15          "is_error": "sum",
16      })
17      .fillna(0)
18      .reset_index()
19)
20
21grouped["hour"] = grouped["timestamp"].dt.hour
22print(grouped.head())

Now each row represents a small behavioral slice instead of one isolated event. That is much more useful for anomaly detection.

Train an Unsupervised Model

In many real systems you do not have labeled examples of “bad” behavior, so start with an unsupervised detector. Isolation Forest is a sensible baseline because it works well on tabular features and does not require anomaly labels.

python

1from sklearn.ensemble import IsolationForest
2
3feature_cols = ["bytes_sent", "latency_ms", "is_error", "hour"]
4X = grouped[feature_cols].fillna(0)
5
6model = IsolationForest(
7    n_estimators=200,
8    contamination=0.02,
9    random_state=42,
10)
11
12grouped["anomaly_flag"] = model.fit_predict(X)
13grouped["anomaly_score"] = model.decision_function(X)
14
15anomalies = grouped[grouped["anomaly_flag"] == -1]
16print(anomalies[["user_id", "timestamp", "anomaly_score"]].head())

This does not magically detect attacks. It detects observations that look unlike the rest of the baseline. That is still valuable, but it needs domain interpretation.

Where CouchDB Fits Well

CouchDB is a good fit for event capture and downstream review because:

JSON documents map naturally to logs
replication helps if events originate in multiple locations
you can store anomaly results as documents too

A practical pattern is to write anomaly results back into another database or design document for dashboards and manual investigation. Keep training and scoring in Python or a separate service, not inside CouchDB views.

How to Evaluate the System

Anomaly detection systems fail when teams skip evaluation. Even without labeled attacks, you can still review the highest-scoring anomalies, compare them with known incidents, and track false positive rates.

You should also baseline by entity. “Normal” for one user, service, or API key may be abnormal for another. Per-user or per-service models often outperform one global model.

Common Pitfalls

A common mistake is feeding raw event rows directly into a model without feature engineering. That usually produces noisy results because one request is rarely enough context.

Another mistake is treating CouchDB as the ML engine. CouchDB is the storage layer here, not the place to train anomaly models.

People also underestimate label drift. A deployment change, a new customer, or a traffic spike can make last month’s “normal” baseline obsolete.

Finally, do not deploy anomaly scores without a review loop. Unsupervised models always produce surprises, and some of those surprises are just legitimate product behavior.

Summary

Use CouchDB to store structured event documents, not to run the ML logic itself.
Engineer windowed behavioral features in Python before training a model.
Start with an unsupervised baseline such as Isolation Forest.
Write anomaly scores back to storage and review them with domain context.
The quality of features and evaluation matters more than the choice of anomaly library.