Decision boundaries for nearest centroid
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
In the realm of machine learning, classification tasks often require an understanding of how algorithms differentiate between data points. One method, the Nearest Centroid Classifier (NCC), leverages the notion of centroids to make these classifications. Let's explore how NCC operates and, in particular, how decision boundaries are established in this context.
Nearest Centroid Classifier Overview
The Nearest Centroid Classifier is a type of lazy learning algorithm used mainly for classification. It functions by computing the centroid (mean vector) of each class in the training dataset and assigning new data points to the class whose centroid they are closest to in terms of distance—usually Euclidean.
Centroids and Decision Boundaries
Centroid Calculation
For each class in the dataset, the centroid is calculated as:
where is the number of samples in class , and is an instance in that class. The centroid represents the average position of all points within a particular class.
Decision Boundary Construction
Decision boundaries define the regions in the feature space where the classification decision changes. For NCC, they are formed at the equidistant line (or hyperplane, in higher dimensions) between any two centroids.
For two centroids and , the decision boundary is composed of points for which the following holds:
Given Euclidean distance, this can be expanded and simplified into a linear equation in terms of and the centroids:
which simplifies to:
This is the equation of a hyperplane midway between the centroids, with a normal vector .
Example: 2D Feature Space
Consider a binary classification problem with two classes, represented in a 2D feature space. Suppose we have the following centroids:
• Class 1 centroid, • Class 2 centroid,
The equation of the decision boundary is:
Simplifying further:
Thus, is the normal vector, with our decision boundary being a line that determines how new points are classified based on proximity to these centroids.
Key Points Table
| Aspect | Explanation |
| Centroid Calculation | Average vector (mean) of class points: |
| Decision Boundary | Equidistant region between centroids in a feature space. |
| Distance Method | Typically Euclidean distance is used for simplicity and effectiveness. |
| Complexity | Efficient with low computational cost; scales well with data size. |
| Performance | Ideal for simple, well-separated data; may not capture complex patterns. |
Further Considerations
Advantages and Limitations
Advantages:
• Simplicity: NCC is easy to implement and understand. • Efficiency: Especially beneficial for large datasets with few classes.
Limitations:
• Assumes sphere-like distribution around centroids. May not perform well with elongated or skewed distributions. • Sensitive to outliers, as they can significantly affect the centroid calculation. • Does not inherently handle overlapping classes well.
Enhancements
To address the limitations of NCC, several strategies may be employed:
• Weighted Centroids: Prioritize certain features based on their significance to improve robustness. • Robust Mean Estimation: Use median or trimmed mean to compute centroids less sensitive to outliers. • Dimensionality Reduction: Pre-process data through PCA or LDA to focus classification on the most informative features.
Conclusion
The Nearest Centroid Classifier, with its straightforward approach, is particularly suitable for quick initial classifications and deployments. Understanding how it carves decision boundaries is crucial for leveraging its speed and simplicity while being aware of scenarios where its assumptions might be violated or its outputs need refinement.

