Machine Learning
Nearest-Neighbor Algorithm
Data Classification
Python Programming
k-Nearest Neighbors

How can I classify data with the nearest-neighbor algorithm using Python?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

k-Nearest Neighbors is one of the easiest classification algorithms to implement, but good results still depend on preprocessing and validation discipline. Because KNN predicts based on distance, feature scale and class balance strongly affect performance. A robust approach builds a pipeline with scaling, cross-validation, and metric reporting rather than relying on one quick fit call.

Build a Reliable Baseline

Start with a train and test split and a simple KNN model. Use stratified splitting for classification tasks so class distribution is preserved.

python
1from sklearn.datasets import load_iris
2from sklearn.model_selection import train_test_split
3from sklearn.neighbors import KNeighborsClassifier
4from sklearn.metrics import accuracy_score
5
6X, y = load_iris(return_X_y=True)
7
8X_train, X_test, y_train, y_test = train_test_split(
9    X,
10    y,
11    test_size=0.2,
12    random_state=42,
13    stratify=y,
14)
15
16knn = KNeighborsClassifier(n_neighbors=5)
17knn.fit(X_train, y_train)
18
19pred = knn.predict(X_test)
20print("accuracy:", accuracy_score(y_test, pred))

This gives a baseline but should not be final in production.

Scale Features Before Distance-Based Learning

Distance metrics are sensitive to feature magnitude. A feature with larger scale can dominate neighbor selection even when less informative.

Use a pipeline with standard scaling.

python
1from sklearn.pipeline import Pipeline
2from sklearn.preprocessing import StandardScaler
3
4pipe = Pipeline([
5    ("scaler", StandardScaler()),
6    ("knn", KNeighborsClassifier(n_neighbors=5)),
7])
8
9pipe.fit(X_train, y_train)
10print("scaled accuracy:", pipe.score(X_test, y_test))

Pipelines also reduce training-serving skew by ensuring the same transforms are applied at inference.

Tune Hyperparameters Systematically

KNN quality depends on n_neighbors, distance metric, and weighting rule. Tune with cross-validation instead of fixed defaults.

python
1from sklearn.model_selection import GridSearchCV
2
3param_grid = {
4    "knn__n_neighbors": [3, 5, 7, 11, 15],
5    "knn__weights": ["uniform", "distance"],
6    "knn__metric": ["minkowski", "manhattan"],
7}
8
9search = GridSearchCV(
10    estimator=pipe,
11    param_grid=param_grid,
12    cv=5,
13    scoring="accuracy",
14    n_jobs=-1,
15)
16
17search.fit(X_train, y_train)
18print("best params:", search.best_params_)
19print("best cv score:", search.best_score_)

Use the best estimator for final holdout evaluation.

Evaluate Beyond Accuracy

Accuracy can hide poor performance on minority classes. Report confusion matrix and class-level metrics.

python
1from sklearn.metrics import confusion_matrix, classification_report
2
3best = search.best_estimator_
4y_pred = best.predict(X_test)
5
6print(confusion_matrix(y_test, y_pred))
7print(classification_report(y_test, y_pred))

For imbalanced datasets, focus on precision, recall, and F1 rather than accuracy alone.

Performance and Deployment Considerations

KNN has low training cost but can have high prediction cost because inference compares each sample against many training points.

For larger datasets:

  • reduce feature dimensions
  • limit training set size with representative sampling
  • benchmark latency under realistic request volume

Save the final pipeline and reuse it consistently.

python
1import joblib
2
3joblib.dump(best, "knn_model.joblib")
4loaded = joblib.load("knn_model.joblib")
5print(loaded.predict(X_test[:5]))

This keeps preprocessing and model parameters synchronized between training and serving.

Data Leakage Prevention

Always split data before fitting scalers or feature transforms. Fitting transformations on full dataset leaks test information and inflates metrics. Pipeline plus cross-validation helps avoid this, but you still need consistent experiment scripts and seed control. Keep training logs with data split version and parameter settings so results are reproducible.

When KNN Is a Good Fit

KNN works well when:

  • dataset is moderate in size
  • decision boundary is nonlinear
  • feature engineering is available
  • interpretability of local neighborhood behavior is useful

KNN may be less suitable when low-latency inference on very large datasets is required.

Common Pitfalls

A common pitfall is training KNN without scaling numerical features. Another is picking k by intuition without validation. Teams often report only accuracy and miss class-specific failures. Inference cost is frequently ignored until deployment load tests. Finally, inconsistent preprocessing between notebook experiments and production code causes unexplained prediction drift. KNN looks simple enough to skip rigor, which is exactly why these mistakes show up so often.

Summary

  • Start with a simple KNN baseline and stratified split.
  • Scale features before fitting distance-based models.
  • Tune k, metric, and weighting with cross-validation.
  • Evaluate with class-level metrics, not only accuracy.
  • Benchmark prediction latency for realistic dataset sizes.
  • Deploy the full preprocessing-plus-model pipeline as one artifact.

Course illustration
Course illustration

All Rights Reserved.