Word Embeddings

Deep Learning Foundations

Practical Training Decisions

Word Embeddings

Topics Covered

The Distributional Hypothesis

Word2Vec, CBOW and Skip-gram

CBOW, Predicting the Center Word from Context

Skip-gram, Predicting Context from the Center Word

Two Embedding Matrices

FastText and Subword Embeddings

Negative Sampling

Negative Sampling Distribution

Subsampling Frequent Words

Embedding Properties and Limitations

The Polysemy Problem

What Word Embeddings Are Still Good For

Why does "bank" mean something different in "river bank" and "bank account"? Your brain answers this question using context, the words surrounding "bank" tell you which sense is active. This intuition is at the core of every word representation method in NLP, and it was formalized long before neural networks by linguist John Firth in 1957: "You shall know a word by the company it keeps."

That sentence is called the distributional hypothesis. It says that words with similar meanings appear in similar contexts. "Dog" and "cat" both appear near "pet", "feed", "chase", and "fur". They rarely appear near "engine" or "invoice". This distributional similarity is the signal that word embedding models learn to capture.

Before embeddings, NLP represented words as one-hot vectors: a vector of vocabulary size with a single 1 and all other entries 0. One-hot has two fatal problems. Every pair of distinct words has dot product exactly 0, so the model has no signal that "dog" and "cat" are more related than "dog" and "invoice". And a 100,000-word vocabulary produces 100,000-dimensional vectors that are 99.99% zeros, wasting memory and making learning slow.

One-hot vs. dense embeddings

One-hot vectors are long, sparse, and have zero similarity with every other word. Dense embeddings pack the same information into a short continuous vector where similar words sit nearby.

Dense embeddings fix both problems. Instead of 10,000 dimensions with one 1, a dense embedding uses 100-300 dimensions with all positions nonzero. The dimensions are learned, the model discovers which directions in this space are meaningful. Words that appear in similar contexts end up close together in this space, so the geometry of the space reflects semantic relationships.

Key Insight

The distributional hypothesis is an empirical claim, not a definition. It says that observing word co-occurrence statistics is sufficient to recover semantic similarity. This is remarkable: you can learn that 'dog' and 'wolf' are related without any labeled data, dictionaries, or human annotation, just by counting which words appear near each other in billions of sentences.

GloVe (Global Vectors, Pennington et al., 2014) makes the connection explicit. It directly factorizes the word co-occurrence matrix, a matrix where entry (i, j) counts how often word i appears near word j in a large corpus. The learned embeddings are the low-dimensional factors of this matrix. GloVe shows that the entire geometry of a word embedding space is compressed co-occurrence statistics.

This insight has a practical implication for modern AI systems. When you use a pretrained embedding model, whether word2vec, GloVe, or the embeddings inside a transformer, you are using a model that has compressed distributional statistics from a massive corpus into a compact vector space. Every downstream task that uses these embeddings inherits whatever distributional regularities were present in the training data, including biases. Words like "doctor" and "nurse" may encode occupational gender associations that appeared in the training corpus, even if those associations are stereotypes.

Course

Deep Learning Foundations

Mathematical Foundations

Neural Network Foundations

Representation Learning

Generative Models Beyond Language

Vision and Modern Self-Supervised Learning

Practical Training Decisions

Word Embeddings

The Distributional Hypothesis

One-hot vs. dense embeddings

1/16