Active Learning e.g. Pool Sampling for SVM in python

active learning

pool sampling

SVM

Python

machine learning

Active Learning e.g. Pool Sampling for SVM in python

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Active learning is useful when unlabeled data is plentiful but labels are expensive. Instead of labeling examples at random, you train a model on a small seed set and repeatedly ask for labels on the most informative items.

Support Vector Machines fit this setting especially well because their decision boundary gives you a natural uncertainty signal. In pool-based active learning, you keep a large pool of unlabeled examples and query the points closest to the current margin.

Why SVMs Work Well for Pool Sampling

For a binary SVM, the decision_function returns signed distances from the separating hyperplane. Examples with large positive or negative values are classified confidently. Examples close to zero are ambiguous, which makes them good candidates for labeling.

That leads to a common query strategy:

Train on the labeled subset.
Score the unlabeled pool with decision_function.
Select the samples with the smallest absolute margins.
Ask the oracle for their labels.
Add them to the labeled set and repeat.

This is often called uncertainty sampling or margin sampling.

A Simple Pool-Based Loop in Python

The example below simulates an oracle by reading labels from the full dataset. In a real system, that step would be a human annotation workflow.

python

1import numpy as np
2from sklearn.datasets import make_classification
3from sklearn.metrics import accuracy_score
4from sklearn.model_selection import train_test_split
5from sklearn.pipeline import make_pipeline
6from sklearn.preprocessing import StandardScaler
7from sklearn.svm import SVC
8
9X, y = make_classification(
10    n_samples=600,
11    n_features=12,
12    n_informative=6,
13    n_redundant=2,
14    class_sep=1.0,
15    random_state=42,
16)
17
18X_train, X_test, y_train, y_test = train_test_split(
19    X, y, test_size=0.3, stratify=y, random_state=42
20)
21
22rng = np.random.default_rng(42)
23initial_size = 20
24query_size = 10
25rounds = 8
26
27all_indices = np.arange(len(X_train))
28labeled_idx = rng.choice(all_indices, size=initial_size, replace=False)
29unlabeled_idx = np.setdiff1d(all_indices, labeled_idx)
30
31model = make_pipeline(
32    StandardScaler(),
33    SVC(kernel="linear")
34)
35
36for step in range(rounds):
37    model.fit(X_train[labeled_idx], y_train[labeled_idx])
38    test_pred = model.predict(X_test)
39    test_acc = accuracy_score(y_test, test_pred)
40    print(f"round={step} labeled={len(labeled_idx)} acc={test_acc:.3f}")
41
42    decision_scores = model.decision_function(X_train[unlabeled_idx])
43    query_order = np.argsort(np.abs(decision_scores))
44    chosen = unlabeled_idx[query_order[:query_size]]
45
46    labeled_idx = np.concatenate([labeled_idx, chosen])
47    unlabeled_idx = np.setdiff1d(unlabeled_idx, chosen)

The important line is np.argsort(np.abs(decision_scores)). It ranks the unlabeled pool by how close each point is to the current boundary.

Choosing the Right SVM Setup

A linear SVM is usually the easiest place to start. It is fast, and the margin is straightforward to interpret. For nonlinear problems you can switch to an RBF kernel, but query selection gets more expensive because each retraining step costs more.

If class imbalance matters, evaluate with something more informative than raw accuracy. Precision, recall, balanced accuracy, or a confusion matrix can tell you whether the active learner is only improving the majority class.

You should also seed the process with a labeled set that contains examples from every class. An active learner cannot discover a missing class if the initial model has never seen it.

Adding a Stopping Rule

A production active learning loop needs a stopping rule. Common choices include:

A fixed annotation budget.
Validation performance that stops improving.
Margin scores that are no longer concentrated near zero.
Human review capacity for the current sprint or batch.

Without a stopping rule, the loop tends to drift into ordinary supervised training with repeated retraining overhead.

When to Use a Library

If you want a research-friendly workflow, a library such as modAL can save time. It provides ready-made query strategies and cleaner abstractions for pool-based sampling. Still, it is worth implementing the loop once yourself, because it makes the mechanics obvious and lets you customize the acquisition logic.

Common Pitfalls

The most common mistake is starting with too few or too narrow labels. If the seed set misses one class or contains only one region of the feature space, the SVM margin is not informative yet.

Another pitfall is forgetting preprocessing. SVMs are sensitive to feature scale, so use StandardScaler or an equivalent step inside a pipeline. Otherwise your margin-based query scores may reflect units rather than genuine uncertainty.

People also confuse uncertainty with usefulness. Points near the margin can include noisy or mislabeled cases. In some domains it helps to mix uncertainty sampling with diversity sampling so you do not query ten nearly identical records.

Finally, do not report the model score on the growing labeled pool as if it were a real evaluation. Always keep a separate validation or test set, because the active learner is choosing training examples adaptively.

Summary

Pool-based active learning keeps a large unlabeled pool and queries the most informative items.
For SVMs, the distance from decision_function to zero is a practical uncertainty measure.
A simple loop is train, score the pool, query low-margin items, relabel, and retrain.
Use scaling, a representative seed set, and a real holdout test set.
Add a stopping rule so the process matches your annotation budget and business goal.