Finding K-nearest neighbors and its implementation

K-nearest neighbors

machine learning

algorithm implementation

data science

classification

Finding K-nearest neighbors and its implementation

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

K-nearest neighbors, usually shortened to KNN, is one of the simplest machine learning algorithms to understand and implement. Instead of learning explicit model parameters, it stores the training examples and predicts by comparing a new sample to the closest known samples.

How KNN works

KNN is an instance-based algorithm. During training, it mostly just keeps the dataset. During prediction, it computes the distance between the new point and every labeled point in the training set, picks the k closest neighbors, and uses them to make a decision.

For classification, the model usually takes a majority vote. For regression, it averages the neighbor values.

Suppose we have points labeled red or blue. If the three nearest neighbors to a new point are red, red, and blue, then a KNN classifier with k = 3 predicts red.

Distance metrics matter

The distance function defines what near means. Euclidean distance is common for continuous numeric features.

python

1import math
2
3
4def euclidean(a, b):
5    return math.sqrt(sum((x - y) ** 2 for x, y in zip(a, b)))
6
7
8print(euclidean([1.0, 2.0], [4.0, 6.0]))

Other distance metrics such as Manhattan distance can work better in some problems. The correct choice depends on the data. If your features use different scales, distance calculations can become misleading, which is why feature scaling is often essential.

A small implementation in Python

Here is a simple KNN classifier from scratch.

python

1from collections import Counter
2import math
3
4
5def euclidean(a, b):
6    return math.sqrt(sum((x - y) ** 2 for x, y in zip(a, b)))
7
8
9class KNNClassifier:
10    def __init__(self, k=3):
11        self.k = k
12        self.features = []
13        self.labels = []
14
15    def fit(self, features, labels):
16        self.features = features
17        self.labels = labels
18
19    def predict_one(self, point):
20        distances = []
21        for features, label in zip(self.features, self.labels):
22            distances.append((euclidean(point, features), label))
23
24        distances.sort(key=lambda item: item[0])
25        nearest_labels = [label for _, label in distances[: self.k]]
26        return Counter(nearest_labels).most_common(1)[0][0]
27
28    def predict(self, points):
29        return [self.predict_one(point) for point in points]
30
31
32X_train = [
33    [1.0, 1.0],
34    [1.5, 1.2],
35    [4.0, 4.5],
36    [5.0, 5.0],
37]
38y_train = ["red", "red", "blue", "blue"]
39
40model = KNNClassifier(k=3)
41model.fit(X_train, y_train)
42print(model.predict([[1.2, 1.1], [4.8, 4.7]]))

This implementation is intentionally small so the logic is visible: compute distances, sort, keep the first k, and vote.

Choosing a good value of `k`

A small k makes the model sensitive to noise. A large k smooths the decision boundary but may blur meaningful local structure. There is no universal best value, so teams usually choose k with validation data or cross-validation.

Odd values are often used in binary classification to reduce ties, although ties can still happen when classes are imbalanced or distances are equal.

Why scaling is important

Imagine one feature is age and another is annual income. If age ranges from 18 to 70 and income ranges from 20_000 to 200_000, the income feature will dominate Euclidean distance unless you normalize the data.

A common preprocessing step is standardization or min-max scaling. Without it, KNN may produce poor predictions even when the implementation is otherwise correct.

Efficient implementations

The simple algorithm checks every training point for every prediction, which is easy to understand but not always fast. On large datasets, approximate nearest-neighbor search, KD-trees, ball trees, or vector indexes can improve performance.

If you want a production-ready implementation, scikit-learn is usually the better choice.

python

1from sklearn.neighbors import KNeighborsClassifier
2
3model = KNeighborsClassifier(n_neighbors=3)
4model.fit(X_train, y_train)
5print(model.predict([[1.2, 1.1]]))

Common Pitfalls

The most common problem is forgetting feature scaling. Distance-based algorithms are extremely sensitive to mismatched feature ranges.

Another issue is choosing k arbitrarily without validation. A value that looks reasonable on paper may perform poorly on real data.

It is also easy to underestimate prediction cost. KNN shifts work from training time to query time, so it can become slow when the dataset is large and predictions are frequent.

Summary

KNN predicts by comparing a new sample to the closest labeled training samples.
Classification uses neighbor votes, while regression usually uses averaging.
The distance metric and feature scaling strongly influence model quality.
A small from-scratch implementation is useful for learning, but optimized libraries are better for real workloads.
Choose k with validation rather than guessing.

Finding K-nearest neighbors and its implementation

Master System Design with Codemia

Introduction

How KNN works

Distance metrics matter

A small implementation in Python

Choosing a good value of k

Why scaling is important

Efficient implementations

Common Pitfalls

Summary

Choosing a good value of `k`