Linear Algebra for Deep Learning

Topics Covered

Vectors and Embeddings

Why Dense Vectors Instead of One-Hot Encodings

The Dot Product as a Similarity Measure

Vector Norms and Normalization

The Curse of Dimensionality

Matrix Operations and Transformations

Matrix Multiplication as a Linear Transformation

Batch Matrix Multiplication

The Jacobian: A Matrix of Partial Derivatives

Why Matrix Shape Matters for Memory and Compute

Eigendecomposition and SVD

Eigenvectors and Eigenvalues

From Eigendecomposition to SVD

Low-Rank Approximation via Truncated SVD

PCA as Truncated SVD

Low-Rank Approximation and the Curse of Dimensionality

Why Weight Matrices Are Approximately Low-Rank

The Intrinsic Dimension of Neural Network Tasks

Reconstruction Error and When Low-Rank Fails

Practice: Implement SVD-Based Dimensionality Reduction

Every neural network, whether it classifies images or generates text, ultimately operates on vectors. Understanding why vectors are the right abstraction unlocks intuition for attention, embeddings, and the geometry of learned representations.

A vector is a point in space, or equivalently, an arrow from the origin to that point. In deep learning, vectors represent things: a word, a document, an image patch, a user's taste profile. The dimensionality of the vector (how many numbers it contains) determines how rich a description it can encode.

Why Dense Vectors Instead of One-Hot Encodings

Before word embeddings, the dominant representation for words was one-hot encoding: a vector of length V (vocabulary size, often 50,000+) with a single 1 at the word's index. One-hot vectors have a fatal flaw, they encode no similarity. The vectors for "cat" and "dog" are exactly as far apart as the vectors for "cat" and "quantum". Every word is equidistant from every other word.

Concretely, imagine a 5-word vocabulary where cat = [0, 0, 1, 0, 0], dog = [0, 1, 0, 0, 0], and quantum = [1, 0, 0, 0, 0]. The dot product of any two distinct one-hot vectors is always 0, and their Euclidean distance is always 2\sqrt{2}. Cat-vs-dog and cat-vs-quantum score identically, the representation has no way to encode that the first pair is more related.

Dense embeddings fix this by mapping each word to a learned vector of 100-300 dimensions where geometric proximity encodes semantic proximity. For contrast, a trained 3-dimensional sketch might give cat = [0.82, 0.14, -0.33], dog = [0.74, 0.21, -0.12], and quantum = [-0.22, 0.65, 0.41]. Cosine similarity gives cat-vs-dog approximately 0.97 and cat-vs-quantum approximately 0.05, the geometry now reflects meaning. "Cat" ends up near "dog", "kitten", and "feline". This emerges from training, the network adjusts embeddings so that words appearing in similar contexts end up in similar locations.

Individual dimensions are learned, not hand-designed. Directionally they end up capturing latent features like is-alive, formality, or tense, but no single dimension is interpretable in isolation. Meaning lives in the vector's overall direction, not in any one coordinate.

Embedding space
-2-1012-2-1012kingqueenmanwomancatdogfishrunwalkjumphappysadangry
Word embeddings cluster by semantic similarity: cat, dog, and kitten land near each other while unrelated words are distant.

The Dot Product as a Similarity Measure

The dot product of two vectors aa and bb is defined as the sum of their element-wise products. For vectors in Rn\mathbb{R}^n, this equals abcos(θ)\|a\|\|b\|\cos(\theta), where θ\theta is the angle between them. This geometric interpretation is what makes dot products so useful in deep learning.

When two vectors are parallel (θ=0\theta = 0), their dot product equals the product of their magnitudes, maximum similarity. When they are orthogonal (θ=90\theta = 90 degrees), the dot product is zero, no relationship. When they point in opposite directions, the dot product is negative.

This is exactly how attention works in transformers. The query vector QQ and key vector KK for each token pair are compared via dot product, producing an attention score that determines how much each token should attend to every other token. High dot product means strong attention. We will implement this exact QKQK^\top computation from scratch in the Attention Mechanism lesson.

python
1import numpy as np
2# Dot product and cosine similarity
3a = np.array([0.8, 0.1, -0.3])
4b = np.array([0.7, 0.2, -0.1])
5dot = a @ b                                    # 0.61
6cos_sim = dot / (np.linalg.norm(a) * np.linalg.norm(b))  # 0.93
Key Insight

The dot product is doing geometry, not just arithmetic. When a transformer computes Q times K-transpose, it is asking: for each query vector, which key vectors point in the same direction? The angle between vectors, not their magnitude, encodes the semantic relationship. This is why embedding vectors are often L2-normalized before similarity comparisons.

Common Pitfall

If your embedding retrieval returns seemingly random results, check that you normalized your vectors before computing cosine similarity. Unnormalized dot products are dominated by vector magnitude, not direction — a long vector pointing the wrong way scores higher than a short vector pointing the right way.

Vector Norms and Normalization

The L2L_2 norm of a vector vv is the square root of the sum of squares of its components, Euclidean distance from the origin. Normalizing a vector (dividing by its L2L_2 norm) projects it onto the unit sphere, so all vectors have magnitude 1.

Why normalize? When using dot products as similarity scores, unnormalized vectors conflate two signals: direction (semantic similarity) and magnitude (frequency or confidence). A common word like "the" gets updated more often during training and ends up with a large magnitude, artificially inflating its similarity scores. Normalization isolates the directional signal. This is why cosine similarity (dot product after normalization) is the standard metric for embedding similarity in retrieval systems.

The Curse of Dimensionality

Embeddings live in high-dimensional spaces, BERT uses 768 dimensions, GPT-3 uses 12,288. High dimensionality brings a counterintuitive problem: most of the volume of a high-dimensional space is concentrated near the surface, not the interior.

In 2D, a circle with radius 0.99 contains 98% of the area of the unit disk. In 1,000 dimensions, a sphere with radius 0.99 contains only (0.99)1000(0.99)^{1000} which is essentially 0% of the volume of the unit hypersphere. Almost all the volume is in a thin shell near the surface.

Distances concentrate as dimensionality grows
Low-d (2D disk)High-d: distances concentrate
Schematic. Left: 20 points uniform in a 2D disk — distances from the center vary widely. Right: in high-d, uniformly sampled points concentrate on a thin shell of nearly equal distance from the center. The radial axis represents distance from the centroid, so all points landing at one radius means all pairwise distances are nearly equal. In 768-d BERT space, the 1st nearest neighbor and the 100th differ by less than 0.05 in cosine similarity.

A more direct way to see the same effect: plot how much of the mass sits inside radius r for different dimensions. In low-d, mass spreads across all radii. In high-d, the curve stays flat for almost all r and only rises at the very edge, which quantifies what the scatter schematic above is trying to convey.

Where mass lives in a d-dimensional unit ball
00.200.400.600.80100.200.400.600.801radius r (distance from center)fraction of mass with ‖x‖ ≤ r50% of massd = 2d = 10d = 100
Fraction of uniformly sampled points with distance ≤ r from the center. In d=2, half the mass is inside r ≈ 0.71. In d=10, half is inside r ≈ 0.93. In d=100, half is inside r ≈ 0.993 — virtually every point lands in a razor-thin shell near the surface. This is the curse of dimensionality: the interior of high-dimensional spaces is almost empty, which is why nearly all pairs of random high-d vectors end up at similar distances.

The practical consequence follows directly from the shell picture. If almost every vector sits at nearly the same distance from the center, then almost every pair of random vectors has nearly the same dot product. The spread of similarity scores collapses: the "meaningful" signal still exists, but it is drowning in a narrow band of near-identical noise.

Concretely, in a 768-dimensional BERT embedding space, the cosine similarity between a query's 1st nearest neighbor and its 100th nearest neighbor often differs by less than 0.05. Read that carefully: the 100 closest results to any query are essentially tied. The ordering is still correct (top-1 really is more relevant than top-100), but the numerical margin separating them is tiny.

This is counterintuitively what makes approximate nearest neighbor search work. You might expect that razor-thin margins demand more precision, not less. The opposite is true: if exact search only beats a slightly-wrong answer by 0.02 in cosine similarity, then a method like HNSW or FAISS that occasionally returns the 3rd nearest neighbor instead of the 1st loses almost nothing in retrieval quality, while running 10-100x faster. Exact brute-force comparison is paying for precision that the geometry itself has already eroded.

Dimensionality reduction (PCA, SVD) helps for a related reason. Fewer dimensions means less concentration on the shell, so margins between neighbors widen. PCA also drops axes with low variance, which are typically noise; after projection, the remaining axes carry more of the real semantic signal per unit of compute. This is why reducing 768-d embeddings to 128-d often improves retrieval quality, not just speed.

A neural network layer is, at its core, a matrix multiplication followed by a nonlinearity. Understanding what matrix multiplication does geometrically, not just how to compute it, is the key to understanding why networks can learn complex functions.

Matrix Multiplication as a Linear Transformation

Given a vector xRnx \in \mathbb{R}^n and a matrix WW of shape (m,n)(m, n), the product WxWx produces a vector in Rm\mathbb{R}^m. This is a linear transformation: it takes a point in one space and maps it to a point in another, possibly higher- or lower-dimensional space.

What can a linear transformation do geometrically? It can rotate, scale, shear, and project. What can it NOT do? It cannot translate (shift) and it cannot curve, a linear map always sends the origin to the origin and always maps straight lines to straight lines.

Matrix transform — [[2, 1], [0.5, 1.5]]
-3-2-10123-3-2-10123xyλ₁ = 2.50λ₂ = 1.00
A 2x2 matrix stretches and rotates a unit square, showing how linear transformations warp space.

This is why activations are necessary. A stack of linear layers without activations collapses to a single linear layer. If L1(x)=W1xL_1(x) = W_1 x and L2(x)=W2xL_2(x) = W_2 x, then L2(L1(x))=W2(W1x)=(W2W1)xL_2(L_1(x)) = W_2 (W_1 x) = (W_2 W_1) x, which is just a single matrix W=W2W1W = W_2 W_1. No matter how many linear layers you stack, you get one linear transformation. Nonlinear activations (ReLU, GELU) break this collapse and allow the network to approximate nonlinear functions.

Batch Matrix Multiplication

In practice, neural networks process batches of inputs simultaneously. If xx has shape (batch_size,n)(\text{batch\_size}, n), and WW has shape (n,m)(n, m), then x@Wx @ W produces an output of shape (batch_size,m)(\text{batch\_size}, m). Each row of xx is transformed independently by WW.

python
1import torch
2# Batch matmul: x has shape (batch, seq_len, d_model), W has shape (d_model, d_ff)
3x = torch.randn(32, 128, 512)    # (batch=32, seq=128, d_model=512)
4W = torch.randn(512, 2048)        # (d_model=512, d_ff=2048)
5out = x @ W                       # (32, 128, 2048)
6# Equivalent using einsum (explicit shape annotation):
7out2 = torch.einsum('bsd,df->bsf', x, W)  # same result
Common Pitfall

Shape errors are the #1 source of PyTorch bugs. When you see 'RuntimeError: mat1 and mat2 shapes cannot be multiplied', it means your matrix dimensions don't align for multiplication. The rule: (batch, M, K) @ (batch, K, N) = (batch, M, N). The inner dimensions (K) must match. Print x.shape before every matrix multiply until this becomes automatic.

This is why GPUs are central to deep learning: batch matrix multiplication is embarrassingly parallel. Each row of x can be processed simultaneously, and GPUs can execute thousands of parallel multiply-accumulate operations. A modern A100 GPU can perform roughly 312 teraFLOPS of float16 operations per second, almost all of which go toward matrix multiplications during training.

The Jacobian: A Matrix of Partial Derivatives

When a function ff maps a vector xRnx \in \mathbb{R}^n to a vector yRmy \in \mathbb{R}^m, its derivative is not a single number but a matrix, the Jacobian JJ of shape (m,n)(m, n). Entry JijJ_{ij} is the partial derivative yi/xj\partial y_i / \partial x_j.

For a linear layer y=Wxy = Wx, the Jacobian with respect to xx is WW itself. This is what makes backpropagation tractable: the gradient of the loss with respect to the layer's input is computed by multiplying the upstream gradient by WW^\top (the transpose of WW). We build full Jacobian intuition, including how vector-Jacobian products drive reverse-mode AD, in the Calculus and Automatic Differentiation lesson.

Interview Tip

The transpose appears in backpropagation for a specific geometric reason: the transpose of a linear transformation undoes the transformation in the gradient direction. If the forward pass compresses a 1000-dimensional space to 100 dimensions (via W of shape 100x1000), the backward pass must expand a 100-dimensional gradient back to 1000 dimensions (via W^T of shape 1000x100). The network learns both the forward compression AND the backward gradient routing simultaneously.

Why Matrix Shape Matters for Memory and Compute

A linear layer with nn input features and mm output features has an n×mn \times m weight matrix. For a layer with 4096 inputs and 4096 outputs (typical for a GPT-3 feed-forward layer), that is 4096×4096=16.74096 \times 4096 = 16.7 million parameters, about 67MB in float32. GPT-3 has 96 such layers. Weight storage alone is why large models require hundreds of gigabytes of GPU memory.

This is the motivation for LoRA (Low-Rank Adaptation): instead of storing the full weight update ΔW\Delta W of shape (n,m)(n, m), represent it as a product of two small matrices AA (n×rn \times r) and BB (r×mr \times m) where rr is much smaller than nn or mm. With r=8r=8 and n=m=4096n=m=4096, LoRA stores only 2×4096×8=65,5362 \times 4096 \times 8 = 65{,}536 parameters instead of 16.7 million, a 256x reduction. The math behind this is SVD, covered in the next section.

Interview Tip

When a paper says 'rank-r approximation', think 'r independent directions of variation.' A rank-16 LoRA adapter means the fine-tuning update has 16 degrees of freedom — enough for most task adaptations, not enough to overwrite the pretrained knowledge.

python
1# LoRA forward pass: frozen W plus low-rank adaptation
2# W: (4096, 4096), A: (4096, 8), B: (8, 4096)
3out = x @ W + x @ A @ B   # x @ W is frozen; x @ A @ B is the learned delta
4# At inference, merge: W_new = W + A @ B, then out = x @ W_new (zero overhead)

Eigendecomposition is the answer to a single question: are there directions in space that a linear transformation stretches or compresses but does not rotate? These directions, eigenvectors, reveal the intrinsic geometry of a transformation. SVD generalizes this idea to non-square matrices and is the mathematical engine behind PCA, LoRA, and latent factor models.

Level Expectations

Beginner: understand dot product as similarity and matrix multiply as transformation.

Intermediate: implement SVD and verify the low-rank approximation error.

Advanced: read Aghajanyan et al. (2020) 'Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning' to understand why LoRA works theoretically.

Eigenvectors and Eigenvalues

A vector vv is an eigenvector of matrix AA if Av=λvA v = \lambda v for some scalar λ\lambda. The transformation AA, applied to vv, just scales vv by λ\lambda. It does not change vv's direction. λ\lambda is the corresponding eigenvalue.

Why does this matter? The eigenvectors of the covariance matrix of a dataset are the principal directions of variance (this is PCA). The eigenvectors of the Hessian (matrix of second derivatives) reveal the curvature of the loss landscape, which determines whether minima are sharp or flat and how optimizers navigate the surface. We explore this Hessian eigenvalue analysis in detail in the Optimization lesson, where eigenvalues of the Hessian explain sharp versus flat minima and the generalization gap. The eigenvectors of a graph adjacency matrix encode community structure. Eigenanalysis distills the most important structural information from a matrix.

For a symmetric positive semi-definite matrix (like a covariance matrix), all eigenvalues are non-negative and the eigenvectors are orthogonal. This means PCA produces orthogonal principal components, each capturing an independent axis of variance in the data.

From Eigendecomposition to SVD

Eigendecomposition works on square matrices. SVD generalizes to any matrix of shape (m,n)(m, n):

X=UΣVX = U \Sigma V^\top

where UU is orthogonal (m×mm \times m), Σ\Sigma is diagonal (m×nm \times n) with non-negative values (singular values), and VV^\top is orthogonal (n×nn \times n). The singular values in Σ\Sigma are the square roots of the eigenvalues of XXX^\top X, sorted in descending order.

The geometric interpretation: SVD decomposes any linear transformation into three steps, a rotation/reflection (VV^\top), a scaling along axes (Σ\Sigma), and another rotation/reflection (UU). The singular values are the scaling factors. Large singular values indicate directions where the transformation has large effect; small singular values indicate near-null directions.

SVD: A = U Σ Vᵀ
A(m×n)=U(m×r)Σ(r×r diag)Vᵀ(r×n)
Every matrix factors as orthogonal U, diagonal Σ (singular values), and orthogonal Vᵀ. Truncating small singular values in Σ gives the best low-rank approximation.

Low-Rank Approximation via Truncated SVD

The Eckart-Young theorem is one of the most powerful results in linear algebra: the best rank-kk approximation of a matrix XX (in terms of Frobenius norm) is obtained by keeping only the top kk singular values:

Xk=U:,:kΣ:k,:kV:k,:X_k = U_{:,:k} \, \Sigma_{:k,:k} \, V_{:k,:}^\top

Setting singular values below a threshold to zero gives you the mathematically optimal compression for a given rank budget. The fraction of variance explained is i=1kσi2/iσi2\sum_{i=1}^{k} \sigma_i^2 \,/\, \sum_i \sigma_i^2.

python
1import numpy as np
2X = np.random.randn(1000, 768)  # e.g., 1000 embeddings of dim 768
3U, S, Vt = np.linalg.svd(X, full_matrices=False)
4k = 50  # rank-50 reconstruction
5X_k = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
6variance_explained = np.sum(S[:k]**2) / np.sum(S**2)
7# Typically 85-95% of variance in first 50 components for text embeddings

This is the math behind LoRA. The hypothesis is that fine-tuning weight updates have low intrinsic rank, a rank-kk SVD approximation captures most of the useful adaptation. Empirically, k=8k=8 or k=16k=16 often suffices for style and domain adaptation tasks.

Common Pitfall

SVD is exact. It tells you the best possible rank-k approximation. But LoRA does not actually compute SVD of the weight update; it learns low-rank matrices A and B directly via gradient descent. The initialization of A as Gaussian noise and B as zeros means the product A*B starts at zero (no perturbation to pretrained weights) and grows during training. This is not the same as truncated SVD of the optimal update, but it works in practice because gradient descent finds a useful low-rank subspace.

PCA as Truncated SVD

Principal Component Analysis (PCA) can be derived entirely from SVD. Given a centered data matrix XX of shape (nsamples,nfeatures)(n_{\text{samples}}, n_{\text{features}}), PCA finds the kk directions of maximum variance. These are exactly the first kk right singular vectors of XX (the first kk rows of VV^\top).

The projection of XX onto the first kk principal components is XV:,:kX V_{:, :k}, giving a shape (nsamples,k)(n_{\text{samples}}, k), a dimensionality-reduced representation. The proportion of variance explained by each component is σi2/jσj2\sigma_i^2 / \sum_j \sigma_j^2.

Why does this matter for deep learning? Embeddings are often high-dimensional (768, 1536, 3072 dims). Before clustering, retrieval, or visualization, projecting to lower dimensions (50-100) via PCA removes noise dimensions and speeds up downstream computation without significant loss of information.

Low-rank approximation is the bridge between abstract linear algebra and practical deep learning engineering. Understanding when and why it works, and when it fails, is essential for designing efficient models and understanding the limits of techniques like LoRA, PCA, and compressed attention.

Why Weight Matrices Are Approximately Low-Rank

The empirical observation that drives most of modern parameter-efficient fine-tuning is striking: the weight matrices of large pre-trained language models have most of their meaningful structure in a small number of dimensions, even though the matrices are full-rank in the strict mathematical sense.

Why? Neural networks learn to solve tasks using structured transformations, rotating and scaling features in a way that separates classes or predicts tokens. These structured transformations tend to involve correlated changes across weight entries. When you update a GPT attention layer to focus more on subject-verb agreement, the change has a specific semantic direction. It does not require independently adjusting every one of the 4096×4096=16.74096 \times 4096 = 16.7M entries. The update is "low-rank" in the sense that it can be well-expressed as a sum of a few rank-1 outer products.

The Intrinsic Dimension of Neural Network Tasks

Research has shown that the intrinsic dimension of language model fine-tuning, the minimum dimension of a subspace needed to achieve near-optimal performance, is remarkably small. Aghajanyan et al. (2021) found that many NLP tasks can be solved by fine-tuning in as few as 100-1000 dimensions out of a billion-parameter space.

This has practical consequences beyond LoRA. It suggests that full fine-tuning is almost always parameter-wasteful for adaptation tasks. The model does not need all its degrees of freedom to adapt to a new domain; it just needs to move in the right low-dimensional subspace of parameter space.

Key Insight

The intrinsic dimension of a fine-tuning task is not a property of the model size. It is a property of the task. Adapting GPT-3 to write formal emails requires moving in a small subspace of its parameter space. Adapting it to perform competitive-level mathematics requires a much larger subspace because it requires changing reasoning patterns encoded across many independent weight dimensions.

Reconstruction Error and When Low-Rank Fails

The reconstruction error of a rank-kk approximation XkX_k is bounded by the (k+1)(k+1)-th singular value: XXk2=σk+1\|X - X_k\|_2 = \sigma_{k+1}. For a matrix with a sharp singular value drop, σk+1\sigma_{k+1} is small and rank-kk gives an excellent approximation. For a matrix with a flat spectrum, every singular value carries similar weight and there is no good low-rank approximation.

In practice, this means low-rank methods fail when:

  • The task requires learning many independent new capabilities simultaneously
  • The domain shift is extreme (e.g., English model adapting to a low-resource language without any cognate vocabulary)
  • The weight matrix in question has meaningful structure across many independent directions (some MLP layers have flatter spectra than attention layers)

Practice: Implement SVD-Based Dimensionality Reduction

Implement the truncated SVD operation: given a matrix X and target rank k, return the rank-k reconstruction and the fraction of variance explained. This is the core computation behind PCA, collaborative filtering, and the analysis of LoRA weight updates.

Loading problem...