Implementing PCA with Numpy

PCA

Numpy

Principal Component Analysis

Data Science

Python

Implementing PCA with Numpy

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Principal Component Analysis, or PCA, reduces dimensionality by projecting data onto directions of greatest variance. Implementing PCA with NumPy is a useful exercise because it shows the full workflow directly: center the data, compute a covariance structure, extract principal directions, and project the original samples into the new coordinate system.

The Core Steps

For a data matrix where rows are samples and columns are features, PCA usually follows this sequence:

center the features by subtracting column means
compute the covariance matrix
find eigenvalues and eigenvectors
sort components by descending eigenvalue
project the data onto the top components

Each step is short in NumPy, but the order matters.

A Working NumPy Implementation

python

1import numpy as np
2
3
4def pca(X, n_components):
5    X = np.asarray(X, dtype=float)
6
7    mean = np.mean(X, axis=0)
8    X_centered = X - mean
9
10    covariance = np.cov(X_centered, rowvar=False)
11    eigenvalues, eigenvectors = np.linalg.eigh(covariance)
12
13    order = np.argsort(eigenvalues)[::-1]
14    eigenvalues = eigenvalues[order]
15    eigenvectors = eigenvectors[:, order]
16
17    components = eigenvectors[:, :n_components]
18    transformed = X_centered @ components
19
20    return transformed, components, eigenvalues[:n_components], mean
21
22
23X = np.array([
24    [2.5, 2.4],
25    [0.5, 0.7],
26    [2.2, 2.9],
27    [1.9, 2.2],
28    [3.1, 3.0],
29    [2.3, 2.7],
30    [2.0, 1.6],
31    [1.0, 1.1],
32    [1.5, 1.6],
33    [1.1, 0.9],
34])
35
36reduced, components, values, mean = pca(X, n_components=1)
37print(reduced[:3])
38print(values)

This version uses np.linalg.eigh because the covariance matrix is symmetric, making it a better fit than the generic eigenvalue routine.

Why Centering Matters

PCA looks for directions of variance around the mean. If you skip centering, the first component can be dominated by the absolute magnitude of the features rather than their variation around the dataset's center.

That is why subtracting the feature means is not optional. It is part of the definition of the method in standard PCA workflows.

Interpreting the Results

The returned pieces have different roles:

'transformed is the data in the lower-dimensional component space'
'components contains the principal directions'
'eigenvalues measure how much variance each component explains'
'mean is needed if you later want to transform new data consistently'

If you want explained variance ratios, divide each selected eigenvalue by the sum of all eigenvalues.

python

total_variance = np.sum(np.linalg.eigvalsh(np.cov(X - np.mean(X, axis=0), rowvar=False)))
print(values / total_variance)

This tells you how much of the dataset's variance is preserved by the chosen components.

Covariance PCA vs SVD

For learning, covariance plus eigendecomposition is clear. In production numerical work, many libraries implement PCA through Singular Value Decomposition, which can be more stable and efficient, especially for high-dimensional data.

That does not make the covariance approach wrong. It just means NumPy PCA by eigendecomposition is best viewed as the transparent educational implementation.

Scaling Before PCA

Centering is required, but scaling is a separate decision. If one feature is measured on a much larger numeric scale than another, PCA may mostly reflect that scale difference instead of meaningful shared structure.

For that reason, many workflows standardize features to unit variance before PCA, especially when columns represent different physical units. Whether you should do that depends on the meaning of the raw feature magnitudes in your problem.

Common Pitfalls

Forgetting to center the data before computing the covariance matrix.
Mixing up rows and columns and computing covariance over the wrong axis.
Using unsorted eigenvectors, which gives components in arbitrary variance order.
Expecting principal components to be meaningful without considering feature scaling.
Recomputing means separately for training and prediction data instead of reusing the original training mean.

Summary

PCA with NumPy is a sequence of centering, covariance calculation, eigendecomposition, sorting, and projection.
'np.linalg.eigh is appropriate for the symmetric covariance matrix.'
Centering is essential, not optional.
Eigenvalues tell you how much variance each component explains.
For production pipelines, library PCA implementations often use SVD, but the NumPy version is excellent for understanding the method.