Choose the closest k points from given n points
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Given n points in a d-dimensional space and a query point Q, finding the k closest points is a fundamental problem in computational geometry, machine learning (k-NN), recommendation systems, and spatial search. The naive approach computes all n distances and selects the k smallest, running in O(n log n) with sorting or O(n log k) with a max-heap. For repeated queries, spatial data structures like KD-trees and ball trees reduce query time to O(log n) on average.
Problem Definition
- Input: n points P =
{P1, P2, ..., Pn}, a query point Q, and an integer k - Output: The k points from P closest to Q
- Distance metric: Typically Euclidean, but the approach generalizes to Manhattan, Cosine, and others
Method 1: Sort All Distances — O(n log n)
The simplest approach — compute all distances, sort, take the first k:
Time: O(n log n). Space: O(n). Simple but does unnecessary work when k << n.
Method 2: Max-Heap — O(n log k)
Use a max-heap of size k. For each point, push it onto the heap and pop the farthest if the heap exceeds size k:
Time: O(n log k). Space: O(k). Optimal when k << n because the heap stays small.
Method 3: Quickselect — O(n) Average
Quickselect (partitioning) finds the k-th smallest element in O(n) average time:
Time: O(n) average, O(n^2) worst case. Space: O(1) (in-place). The result is unordered — sort the k results if needed.
Method 4: KD-Tree — O(log n) Per Query
For repeated queries on the same point set, build a KD-tree once in O(n log n), then query in O(log n):
scikit-learn Implementation
Comparison of Methods
| Method | Build Time | Query Time | Best When |
| Sort | — | O(n log n) | Single query, small n |
| Max-Heap | — | O(n log k) | Single query, k << n |
| Quickselect | — | O(n) avg | Single query, unordered result OK |
| KD-Tree | O(n log n) | O(log n) avg | Many queries on same data |
| Ball Tree | O(n log n) | O(log n) avg | High dimensions (d > 20) |
Handling High Dimensions
KD-trees degrade to O(n) in high dimensions (d > 20). Alternatives:
For very large datasets (millions of points), use approximate nearest neighbors:
Common Pitfalls
- Comparing squared vs actual distances: When only ranking points (not needing actual distances), skip the
sqrtoperation — squared distances preserve the ordering and are faster to compute. - KD-tree in high dimensions: KD-trees are efficient for d < 20 but degrade to brute-force speed in higher dimensions. Use ball trees, approximate methods (FAISS, Annoy), or locality-sensitive hashing instead.
- Mutable input with quickselect: Quickselect rearranges the input array in-place. Copy the array first (
points[:]) if you need to preserve the original order. - Tie-breaking: When multiple points have the same distance to the query, the choice of which k to return is arbitrary. Be consistent (e.g., use index as a tiebreaker) if determinism matters.
- Distance metric mismatch: Euclidean distance works well for numeric features on similar scales. For text, use cosine similarity. For geographic coordinates, use Haversine distance. Choosing the wrong metric gives meaningless results.
Summary
- For a single query: use a max-heap for O(n log k) or quickselect for O(n) average
- For repeated queries on the same data: build a KD-tree for O(log n) per query
- Use
scipy.spatial.KDTreeorsklearn.neighbors.NearestNeighborsfor production implementations - Skip
sqrtwhen only comparing distances — squared Euclidean is sufficient for ranking - For high dimensions (d > 20) or millions of points, use FAISS or approximate nearest neighbor libraries

