Deep Learning Foundations
Neural Network Foundations
Representation Learning
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
Linear Algebra for Deep Learning
Every neural network, whether it classifies images or generates text, ultimately operates on vectors. Understanding why vectors are the right abstraction unlocks intuition for attention, embeddings, and the geometry of learned representations.
A vector is a point in space, or equivalently, an arrow from the origin to that point. In deep learning, vectors represent things: a word, a document, an image patch, a user's taste profile. The dimensionality of the vector (how many numbers it contains) determines how rich a description it can encode.
Why Dense Vectors Instead of One-Hot Encodings
Before word embeddings, the dominant representation for words was one-hot encoding: a vector of length V (vocabulary size, often 50,000+) with a single 1 at the word's index. One-hot vectors have a fatal flaw, they encode no similarity. The vectors for "cat" and "dog" are exactly as far apart as the vectors for "cat" and "quantum". Every word is equidistant from every other word.
Concretely, imagine a 5-word vocabulary where cat = [0, 0, 1, 0, 0], dog = [0, 1, 0, 0, 0], and quantum = [1, 0, 0, 0, 0]. The dot product of any two distinct one-hot vectors is always 0, and their Euclidean distance is always . Cat-vs-dog and cat-vs-quantum score identically, the representation has no way to encode that the first pair is more related.
Dense embeddings fix this by mapping each word to a learned vector of 100-300 dimensions where geometric proximity encodes semantic proximity. For contrast, a trained 3-dimensional sketch might give cat = [0.82, 0.14, -0.33], dog = [0.74, 0.21, -0.12], and quantum = [-0.22, 0.65, 0.41]. Cosine similarity gives cat-vs-dog approximately 0.97 and cat-vs-quantum approximately 0.05, the geometry now reflects meaning. "Cat" ends up near "dog", "kitten", and "feline". This emerges from training, the network adjusts embeddings so that words appearing in similar contexts end up in similar locations.
Individual dimensions are learned, not hand-designed. Directionally they end up capturing latent features like is-alive, formality, or tense, but no single dimension is interpretable in isolation. Meaning lives in the vector's overall direction, not in any one coordinate.
Embedding space
The Dot Product as a Similarity Measure
The dot product of two vectors and is defined as the sum of their element-wise products. For vectors in , this equals , where is the angle between them. This geometric interpretation is what makes dot products so useful in deep learning.
When two vectors are parallel (), their dot product equals the product of their magnitudes, maximum similarity. When they are orthogonal ( degrees), the dot product is zero, no relationship. When they point in opposite directions, the dot product is negative.
This is exactly how attention works in transformers. The query vector and key vector for each token pair are compared via dot product, producing an attention score that determines how much each token should attend to every other token. High dot product means strong attention. We will implement this exact computation from scratch in the Attention Mechanism lesson.
The dot product is doing geometry, not just arithmetic. When a transformer computes Q times K-transpose, it is asking: for each query vector, which key vectors point in the same direction? The angle between vectors, not their magnitude, encodes the semantic relationship. This is why embedding vectors are often L2-normalized before similarity comparisons.
If your embedding retrieval returns seemingly random results, check that you normalized your vectors before computing cosine similarity. Unnormalized dot products are dominated by vector magnitude, not direction — a long vector pointing the wrong way scores higher than a short vector pointing the right way.
Vector Norms and Normalization
The norm of a vector is the square root of the sum of squares of its components, Euclidean distance from the origin. Normalizing a vector (dividing by its norm) projects it onto the unit sphere, so all vectors have magnitude 1.
Why normalize? When using dot products as similarity scores, unnormalized vectors conflate two signals: direction (semantic similarity) and magnitude (frequency or confidence). A common word like "the" gets updated more often during training and ends up with a large magnitude, artificially inflating its similarity scores. Normalization isolates the directional signal. This is why cosine similarity (dot product after normalization) is the standard metric for embedding similarity in retrieval systems.
The Curse of Dimensionality
Embeddings live in high-dimensional spaces, BERT uses 768 dimensions, GPT-3 uses 12,288. High dimensionality brings a counterintuitive problem: most of the volume of a high-dimensional space is concentrated near the surface, not the interior.
In 2D, a circle with radius 0.99 contains 98% of the area of the unit disk. In 1,000 dimensions, a sphere with radius 0.99 contains only which is essentially 0% of the volume of the unit hypersphere. Almost all the volume is in a thin shell near the surface.
Distances concentrate as dimensionality grows
A more direct way to see the same effect: plot how much of the mass sits inside radius r for different dimensions. In low-d, mass spreads across all radii. In high-d, the curve stays flat for almost all r and only rises at the very edge, which quantifies what the scatter schematic above is trying to convey.
Where mass lives in a d-dimensional unit ball
The practical consequence follows directly from the shell picture. If almost every vector sits at nearly the same distance from the center, then almost every pair of random vectors has nearly the same dot product. The spread of similarity scores collapses: the "meaningful" signal still exists, but it is drowning in a narrow band of near-identical noise.
Concretely, in a 768-dimensional BERT embedding space, the cosine similarity between a query's 1st nearest neighbor and its 100th nearest neighbor often differs by less than 0.05. Read that carefully: the 100 closest results to any query are essentially tied. The ordering is still correct (top-1 really is more relevant than top-100), but the numerical margin separating them is tiny.
This is counterintuitively what makes approximate nearest neighbor search work. You might expect that razor-thin margins demand more precision, not less. The opposite is true: if exact search only beats a slightly-wrong answer by 0.02 in cosine similarity, then a method like HNSW or FAISS that occasionally returns the 3rd nearest neighbor instead of the 1st loses almost nothing in retrieval quality, while running 10-100x faster. Exact brute-force comparison is paying for precision that the geometry itself has already eroded.
Dimensionality reduction (PCA, SVD) helps for a related reason. Fewer dimensions means less concentration on the shell, so margins between neighbors widen. PCA also drops axes with low variance, which are typically noise; after projection, the remaining axes carry more of the real semantic signal per unit of compute. This is why reducing 768-d embeddings to 128-d often improves retrieval quality, not just speed.
A neural network layer is, at its core, a matrix multiplication followed by a nonlinearity. Understanding what matrix multiplication does geometrically, not just how to compute it, is the key to understanding why networks can learn complex functions.
Matrix Multiplication as a Linear Transformation
Given a vector and a matrix of shape , the product produces a vector in . This is a linear transformation: it takes a point in one space and maps it to a point in another, possibly higher- or lower-dimensional space.
What can a linear transformation do geometrically? It can rotate, scale, shear, and project. What can it NOT do? It cannot translate (shift) and it cannot curve, a linear map always sends the origin to the origin and always maps straight lines to straight lines.
Matrix transform — [[2, 1], [0.5, 1.5]]
This is why activations are necessary. A stack of linear layers without activations collapses to a single linear layer. If and , then , which is just a single matrix . No matter how many linear layers you stack, you get one linear transformation. Nonlinear activations (ReLU, GELU) break this collapse and allow the network to approximate nonlinear functions.
Batch Matrix Multiplication
In practice, neural networks process batches of inputs simultaneously. If has shape , and has shape , then produces an output of shape . Each row of is transformed independently by .
Shape errors are the #1 source of PyTorch bugs. When you see 'RuntimeError: mat1 and mat2 shapes cannot be multiplied', it means your matrix dimensions don't align for multiplication. The rule: (batch, M, K) @ (batch, K, N) = (batch, M, N). The inner dimensions (K) must match. Print x.shape before every matrix multiply until this becomes automatic.
This is why GPUs are central to deep learning: batch matrix multiplication is embarrassingly parallel. Each row of x can be processed simultaneously, and GPUs can execute thousands of parallel multiply-accumulate operations. A modern A100 GPU can perform roughly 312 teraFLOPS of float16 operations per second, almost all of which go toward matrix multiplications during training.
The Jacobian: A Matrix of Partial Derivatives
When a function maps a vector to a vector , its derivative is not a single number but a matrix, the Jacobian of shape . Entry is the partial derivative .
For a linear layer , the Jacobian with respect to is itself. This is what makes backpropagation tractable: the gradient of the loss with respect to the layer's input is computed by multiplying the upstream gradient by (the transpose of ). We build full Jacobian intuition, including how vector-Jacobian products drive reverse-mode AD, in the Calculus and Automatic Differentiation lesson.
The transpose appears in backpropagation for a specific geometric reason: the transpose of a linear transformation undoes the transformation in the gradient direction. If the forward pass compresses a 1000-dimensional space to 100 dimensions (via W of shape 100x1000), the backward pass must expand a 100-dimensional gradient back to 1000 dimensions (via W^T of shape 1000x100). The network learns both the forward compression AND the backward gradient routing simultaneously.
Why Matrix Shape Matters for Memory and Compute
A linear layer with input features and output features has an weight matrix. For a layer with 4096 inputs and 4096 outputs (typical for a GPT-3 feed-forward layer), that is million parameters, about 67MB in float32. GPT-3 has 96 such layers. Weight storage alone is why large models require hundreds of gigabytes of GPU memory.
This is the motivation for LoRA (Low-Rank Adaptation): instead of storing the full weight update of shape , represent it as a product of two small matrices () and () where is much smaller than or . With and , LoRA stores only parameters instead of 16.7 million, a 256x reduction. The math behind this is SVD, covered in the next section.
When a paper says 'rank-r approximation', think 'r independent directions of variation.' A rank-16 LoRA adapter means the fine-tuning update has 16 degrees of freedom — enough for most task adaptations, not enough to overwrite the pretrained knowledge.
Eigendecomposition is the answer to a single question: are there directions in space that a linear transformation stretches or compresses but does not rotate? These directions, eigenvectors, reveal the intrinsic geometry of a transformation. SVD generalizes this idea to non-square matrices and is the mathematical engine behind PCA, LoRA, and latent factor models.
Beginner: understand dot product as similarity and matrix multiply as transformation.
Intermediate: implement SVD and verify the low-rank approximation error.
Advanced: read Aghajanyan et al. (2020) 'Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning' to understand why LoRA works theoretically.
Eigenvectors and Eigenvalues
A vector is an eigenvector of matrix if for some scalar . The transformation , applied to , just scales by . It does not change 's direction. is the corresponding eigenvalue.
Why does this matter? The eigenvectors of the covariance matrix of a dataset are the principal directions of variance (this is PCA). The eigenvectors of the Hessian (matrix of second derivatives) reveal the curvature of the loss landscape, which determines whether minima are sharp or flat and how optimizers navigate the surface. We explore this Hessian eigenvalue analysis in detail in the Optimization lesson, where eigenvalues of the Hessian explain sharp versus flat minima and the generalization gap. The eigenvectors of a graph adjacency matrix encode community structure. Eigenanalysis distills the most important structural information from a matrix.
For a symmetric positive semi-definite matrix (like a covariance matrix), all eigenvalues are non-negative and the eigenvectors are orthogonal. This means PCA produces orthogonal principal components, each capturing an independent axis of variance in the data.
From Eigendecomposition to SVD
Eigendecomposition works on square matrices. SVD generalizes to any matrix of shape :
where is orthogonal (), is diagonal () with non-negative values (singular values), and is orthogonal (). The singular values in are the square roots of the eigenvalues of , sorted in descending order.
The geometric interpretation: SVD decomposes any linear transformation into three steps, a rotation/reflection (), a scaling along axes (), and another rotation/reflection (). The singular values are the scaling factors. Large singular values indicate directions where the transformation has large effect; small singular values indicate near-null directions.
SVD: A = U Σ Vᵀ
Low-Rank Approximation via Truncated SVD
The Eckart-Young theorem is one of the most powerful results in linear algebra: the best rank- approximation of a matrix (in terms of Frobenius norm) is obtained by keeping only the top singular values:
Setting singular values below a threshold to zero gives you the mathematically optimal compression for a given rank budget. The fraction of variance explained is .
This is the math behind LoRA. The hypothesis is that fine-tuning weight updates have low intrinsic rank, a rank- SVD approximation captures most of the useful adaptation. Empirically, or often suffices for style and domain adaptation tasks.
SVD is exact. It tells you the best possible rank-k approximation. But LoRA does not actually compute SVD of the weight update; it learns low-rank matrices A and B directly via gradient descent. The initialization of A as Gaussian noise and B as zeros means the product A*B starts at zero (no perturbation to pretrained weights) and grows during training. This is not the same as truncated SVD of the optimal update, but it works in practice because gradient descent finds a useful low-rank subspace.
PCA as Truncated SVD
Principal Component Analysis (PCA) can be derived entirely from SVD. Given a centered data matrix of shape , PCA finds the directions of maximum variance. These are exactly the first right singular vectors of (the first rows of ).
The projection of onto the first principal components is , giving a shape , a dimensionality-reduced representation. The proportion of variance explained by each component is .
Why does this matter for deep learning? Embeddings are often high-dimensional (768, 1536, 3072 dims). Before clustering, retrieval, or visualization, projecting to lower dimensions (50-100) via PCA removes noise dimensions and speeds up downstream computation without significant loss of information.
Low-rank approximation is the bridge between abstract linear algebra and practical deep learning engineering. Understanding when and why it works, and when it fails, is essential for designing efficient models and understanding the limits of techniques like LoRA, PCA, and compressed attention.
Why Weight Matrices Are Approximately Low-Rank
The empirical observation that drives most of modern parameter-efficient fine-tuning is striking: the weight matrices of large pre-trained language models have most of their meaningful structure in a small number of dimensions, even though the matrices are full-rank in the strict mathematical sense.
Why? Neural networks learn to solve tasks using structured transformations, rotating and scaling features in a way that separates classes or predicts tokens. These structured transformations tend to involve correlated changes across weight entries. When you update a GPT attention layer to focus more on subject-verb agreement, the change has a specific semantic direction. It does not require independently adjusting every one of the M entries. The update is "low-rank" in the sense that it can be well-expressed as a sum of a few rank-1 outer products.
The Intrinsic Dimension of Neural Network Tasks
Research has shown that the intrinsic dimension of language model fine-tuning, the minimum dimension of a subspace needed to achieve near-optimal performance, is remarkably small. Aghajanyan et al. (2021) found that many NLP tasks can be solved by fine-tuning in as few as 100-1000 dimensions out of a billion-parameter space.
This has practical consequences beyond LoRA. It suggests that full fine-tuning is almost always parameter-wasteful for adaptation tasks. The model does not need all its degrees of freedom to adapt to a new domain; it just needs to move in the right low-dimensional subspace of parameter space.
The intrinsic dimension of a fine-tuning task is not a property of the model size. It is a property of the task. Adapting GPT-3 to write formal emails requires moving in a small subspace of its parameter space. Adapting it to perform competitive-level mathematics requires a much larger subspace because it requires changing reasoning patterns encoded across many independent weight dimensions.
Reconstruction Error and When Low-Rank Fails
The reconstruction error of a rank- approximation is bounded by the -th singular value: . For a matrix with a sharp singular value drop, is small and rank- gives an excellent approximation. For a matrix with a flat spectrum, every singular value carries similar weight and there is no good low-rank approximation.
In practice, this means low-rank methods fail when:
- The task requires learning many independent new capabilities simultaneously
- The domain shift is extreme (e.g., English model adapting to a low-resource language without any cognate vocabulary)
- The weight matrix in question has meaningful structure across many independent directions (some MLP layers have flatter spectra than attention layers)
Practice: Implement SVD-Based Dimensionality Reduction
Implement the truncated SVD operation: given a matrix X and target rank k, return the rank-k reconstruction and the fraction of variance explained. This is the core computation behind PCA, collaborative filtering, and the analysis of LoRA weight updates.
Loading problem...