Deep Learning Foundations
Mathematical Foundations
Neural Network Foundations
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
Word Embeddings
Why does "bank" mean something different in "river bank" and "bank account"? Your brain answers this question using context, the words surrounding "bank" tell you which sense is active. This intuition is at the core of every word representation method in NLP, and it was formalized long before neural networks by linguist John Firth in 1957: "You shall know a word by the company it keeps."
That sentence is called the distributional hypothesis. It says that words with similar meanings appear in similar contexts. "Dog" and "cat" both appear near "pet", "feed", "chase", and "fur". They rarely appear near "engine" or "invoice". This distributional similarity is the signal that word embedding models learn to capture.
Before embeddings, NLP represented words as one-hot vectors: a vector of vocabulary size with a single 1 and all other entries 0. One-hot has two fatal problems. Every pair of distinct words has dot product exactly 0, so the model has no signal that "dog" and "cat" are more related than "dog" and "invoice". And a 100,000-word vocabulary produces 100,000-dimensional vectors that are 99.99% zeros, wasting memory and making learning slow.
One-hot vs. dense embeddings
Dense embeddings fix both problems. Instead of 10,000 dimensions with one 1, a dense embedding uses 100-300 dimensions with all positions nonzero. The dimensions are learned, the model discovers which directions in this space are meaningful. Words that appear in similar contexts end up close together in this space, so the geometry of the space reflects semantic relationships.
The distributional hypothesis is an empirical claim, not a definition. It says that observing word co-occurrence statistics is sufficient to recover semantic similarity. This is remarkable: you can learn that 'dog' and 'wolf' are related without any labeled data, dictionaries, or human annotation, just by counting which words appear near each other in billions of sentences.
GloVe (Global Vectors, Pennington et al., 2014) makes the connection explicit. It directly factorizes the word co-occurrence matrix, a matrix where entry (i, j) counts how often word i appears near word j in a large corpus. The learned embeddings are the low-dimensional factors of this matrix. GloVe shows that the entire geometry of a word embedding space is compressed co-occurrence statistics.
This insight has a practical implication for modern AI systems. When you use a pretrained embedding model, whether word2vec, GloVe, or the embeddings inside a transformer, you are using a model that has compressed distributional statistics from a massive corpus into a compact vector space. Every downstream task that uses these embeddings inherits whatever distributional regularities were present in the training data, including biases. Words like "doctor" and "nurse" may encode occupational gender associations that appeared in the training corpus, even if those associations are stereotypes.