Python
Machine Learning
Clustering
Data Science
Proximity Analysis

Clustering values by their proximity in python machine learning?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

If you want to group values by how close they are, the right tool depends on what "close" means in your problem. For simple one-dimensional data, a sorted threshold rule may be enough, while more general clustering problems often fit a distance-based algorithm such as DBSCAN.

Start by deciding what kind of clustering you need

Many questions about "clustering by proximity" really mean one of two tasks:

  1. group nearby values in a sorted one-dimensional list
  2. run a general clustering algorithm that can handle arbitrary shapes and noise

For one-dimensional numeric values, a manual threshold can be perfectly valid and easier to explain than a full machine-learning algorithm.

Suppose values that differ by at most 2 should belong to the same group:

python
1def cluster_by_gap(values, max_gap):
2    if not values:
3        return []
4
5    values = sorted(values)
6    clusters = [[values[0]]]
7
8    for value in values[1:]:
9        if value - clusters[-1][-1] <= max_gap:
10            clusters[-1].append(value)
11        else:
12            clusters.append([value])
13
14    return clusters
15
16
17values = [1, 2, 3, 10, 11, 20, 21, 22]
18print(cluster_by_gap(values, max_gap=2))

This produces intuitive groups when the data is already one-dimensional and the gap rule is clear.

Use DBSCAN when the distance rule should define the clusters

If you want a proper clustering algorithm based on distance, DBSCAN is often a strong fit. It groups points that are within an eps radius of one another and can also mark isolated points as noise.

For one-dimensional values in scikit-learn, reshape the data into a two-dimensional array with one feature column:

python
1import numpy as np
2from sklearn.cluster import DBSCAN
3
4values = np.array([1, 2, 3, 10, 11, 20, 21, 22], dtype=float).reshape(-1, 1)
5
6model = DBSCAN(eps=2.0, min_samples=1)
7labels = model.fit_predict(values)
8
9print(labels)

You can then group the original values by label:

python
1clusters = {}
2for label, value in zip(labels, values.flatten()):
3    clusters.setdefault(int(label), []).append(float(value))
4
5print(clusters)

With min_samples=1, every point belongs to some cluster. If you increase min_samples, isolated points may receive the noise label -1.

Why DBSCAN is often better than k-means here

k-means requires you to choose the number of clusters in advance. If the real question is "group points that are close together," that requirement is often awkward.

DBSCAN matches the idea of proximity more directly because you specify:

  • how close points must be, via eps
  • how many nearby points are needed to form a dense region, via min_samples

That makes it especially useful when the number of groups is unknown beforehand.

Practical example with noisy data

Here is a slightly more realistic example:

python
1import numpy as np
2from sklearn.cluster import DBSCAN
3
4raw = np.array([0.0, 0.3, 0.6, 5.0, 5.2, 9.8, 50.0]).reshape(-1, 1)
5
6model = DBSCAN(eps=0.5, min_samples=2)
7labels = model.fit_predict(raw)
8
9for value, label in zip(raw.flatten(), labels):
10    print(f"value={value:.1f}, label={label}")

With those parameters, values around 0 and 5 can form clusters, while a point like 50.0 may be treated as noise.

That behavior is often exactly what people want when they say "cluster by proximity."

When not to use machine-learning clustering

If the rule is simply "start a new group whenever the sorted gap exceeds X," then a manual grouping function is often the best answer. It is faster to explain, easier to test, and does not introduce tuning parameters that are irrelevant to the real requirement.

Machine-learning clustering earns its keep when:

  • the data has more than one dimension
  • you want density-based grouping
  • you want noise detection
  • the cluster count is unknown

Common Pitfalls

The biggest mistake is reaching for a full clustering library before defining what "close" should mean. If the business rule is a simple gap threshold on sorted numbers, a custom grouping function may be the clearest solution.

Another issue is feeding one-dimensional values into scikit-learn without reshaping them into a two-dimensional array. Many scikit-learn estimators expect shape (n_samples, n_features), so plain lists often need reshape(-1, 1).

Developers also misuse k-means for problems that are really about distance connectivity. k-means optimizes centroids, not "cluster anything within this proximity radius."

Finally, DBSCAN parameters matter. An eps that is too small will fragment groups; an eps that is too large will merge clusters that should stay separate.

Summary

  • For simple one-dimensional data, a sorted gap-threshold rule may be enough.
  • 'DBSCAN is a strong choice when clusters should be defined by proximity rather than a fixed cluster count.'
  • One-dimensional scikit-learn input should usually be reshaped to (n, 1).
  • Use eps to control neighborhood distance and min_samples to control density.
  • Pick the simplest method that matches the real definition of "close" in your problem.

Course illustration
Course illustration

All Rights Reserved.