Clustering values by their proximity in python machine learning?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
If you want to group values by how close they are, the right tool depends on what "close" means in your problem. For simple one-dimensional data, a sorted threshold rule may be enough, while more general clustering problems often fit a distance-based algorithm such as DBSCAN.
Start by deciding what kind of clustering you need
Many questions about "clustering by proximity" really mean one of two tasks:
- group nearby values in a sorted one-dimensional list
- run a general clustering algorithm that can handle arbitrary shapes and noise
For one-dimensional numeric values, a manual threshold can be perfectly valid and easier to explain than a full machine-learning algorithm.
Suppose values that differ by at most 2 should belong to the same group:
This produces intuitive groups when the data is already one-dimensional and the gap rule is clear.
Use DBSCAN when the distance rule should define the clusters
If you want a proper clustering algorithm based on distance, DBSCAN is often a strong fit. It groups points that are within an eps radius of one another and can also mark isolated points as noise.
For one-dimensional values in scikit-learn, reshape the data into a two-dimensional array with one feature column:
You can then group the original values by label:
With min_samples=1, every point belongs to some cluster. If you increase min_samples, isolated points may receive the noise label -1.
Why DBSCAN is often better than k-means here
k-means requires you to choose the number of clusters in advance. If the real question is "group points that are close together," that requirement is often awkward.
DBSCAN matches the idea of proximity more directly because you specify:
- how close points must be, via
eps - how many nearby points are needed to form a dense region, via
min_samples
That makes it especially useful when the number of groups is unknown beforehand.
Practical example with noisy data
Here is a slightly more realistic example:
With those parameters, values around 0 and 5 can form clusters, while a point like 50.0 may be treated as noise.
That behavior is often exactly what people want when they say "cluster by proximity."
When not to use machine-learning clustering
If the rule is simply "start a new group whenever the sorted gap exceeds X," then a manual grouping function is often the best answer. It is faster to explain, easier to test, and does not introduce tuning parameters that are irrelevant to the real requirement.
Machine-learning clustering earns its keep when:
- the data has more than one dimension
- you want density-based grouping
- you want noise detection
- the cluster count is unknown
Common Pitfalls
The biggest mistake is reaching for a full clustering library before defining what "close" should mean. If the business rule is a simple gap threshold on sorted numbers, a custom grouping function may be the clearest solution.
Another issue is feeding one-dimensional values into scikit-learn without reshaping them into a two-dimensional array. Many scikit-learn estimators expect shape (n_samples, n_features), so plain lists often need reshape(-1, 1).
Developers also misuse k-means for problems that are really about distance connectivity. k-means optimizes centroids, not "cluster anything within this proximity radius."
Finally, DBSCAN parameters matter. An eps that is too small will fragment groups; an eps that is too large will merge clusters that should stay separate.
Summary
- For simple one-dimensional data, a sorted gap-threshold rule may be enough.
- '
DBSCANis a strong choice when clusters should be defined by proximity rather than a fixed cluster count.' - One-dimensional scikit-learn input should usually be reshaped to
(n, 1). - Use
epsto control neighborhood distance andmin_samplesto control density. - Pick the simplest method that matches the real definition of "close" in your problem.

