Implementing PCA with Numpy
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Principal Component Analysis, or PCA, reduces dimensionality by projecting data onto directions of greatest variance. Implementing PCA with NumPy is a useful exercise because it shows the full workflow directly: center the data, compute a covariance structure, extract principal directions, and project the original samples into the new coordinate system.
The Core Steps
For a data matrix where rows are samples and columns are features, PCA usually follows this sequence:
- center the features by subtracting column means
- compute the covariance matrix
- find eigenvalues and eigenvectors
- sort components by descending eigenvalue
- project the data onto the top components
Each step is short in NumPy, but the order matters.
A Working NumPy Implementation
This version uses np.linalg.eigh because the covariance matrix is symmetric, making it a better fit than the generic eigenvalue routine.
Why Centering Matters
PCA looks for directions of variance around the mean. If you skip centering, the first component can be dominated by the absolute magnitude of the features rather than their variation around the dataset's center.
That is why subtracting the feature means is not optional. It is part of the definition of the method in standard PCA workflows.
Interpreting the Results
The returned pieces have different roles:
- '
transformedis the data in the lower-dimensional component space' - '
componentscontains the principal directions' - '
eigenvaluesmeasure how much variance each component explains' - '
meanis needed if you later want to transform new data consistently'
If you want explained variance ratios, divide each selected eigenvalue by the sum of all eigenvalues.
This tells you how much of the dataset's variance is preserved by the chosen components.
Covariance PCA vs SVD
For learning, covariance plus eigendecomposition is clear. In production numerical work, many libraries implement PCA through Singular Value Decomposition, which can be more stable and efficient, especially for high-dimensional data.
That does not make the covariance approach wrong. It just means NumPy PCA by eigendecomposition is best viewed as the transparent educational implementation.
Scaling Before PCA
Centering is required, but scaling is a separate decision. If one feature is measured on a much larger numeric scale than another, PCA may mostly reflect that scale difference instead of meaningful shared structure.
For that reason, many workflows standardize features to unit variance before PCA, especially when columns represent different physical units. Whether you should do that depends on the meaning of the raw feature magnitudes in your problem.
Common Pitfalls
- Forgetting to center the data before computing the covariance matrix.
- Mixing up rows and columns and computing covariance over the wrong axis.
- Using unsorted eigenvectors, which gives components in arbitrary variance order.
- Expecting principal components to be meaningful without considering feature scaling.
- Recomputing means separately for training and prediction data instead of reusing the original training mean.
Summary
- PCA with NumPy is a sequence of centering, covariance calculation, eigendecomposition, sorting, and projection.
- '
np.linalg.eighis appropriate for the symmetric covariance matrix.' - Centering is essential, not optional.
- Eigenvalues tell you how much variance each component explains.
- For production pipelines, library PCA implementations often use SVD, but the NumPy version is excellent for understanding the method.

