Contrastive Learning

Topics Covered

Contrastive Loss and InfoNCE

SimCLR: Simple Framework for Contrastive Learning

MoCo: Decoupling Batch Size from Negatives

CLIP, Multimodal Embeddings

What Emerges from CLIP Training

Why CLIP Works at Scale

Sentence Transformers

Hard Negatives and Temperature

Supervised learning has a data bottleneck. To learn that a photo of a cat is similar to another photo of a cat, you need labels. Labels cost money, time, and human attention. Contrastive learning sidesteps this by asking a different question, not "what is this?" but "is this more like that, or more like that other thing?"

The contrastive objective trains an encoder to produce embeddings where similar examples are close together and dissimilar examples are far apart. No class labels required. You only need pairs, positive pairs (two examples that should be similar) and negative pairs (examples that should be dissimilar). Positive pairs come for free from data structure. A photo and its augmented version should be similar. An image and its caption should be similar. A sentence and its paraphrase should be similar. Everything else is a negative.

The original contrastive loss (Hadsell et al., 2006) operates on pairs. For a positive pair (anchor, positive), minimize their embedding distance. For a negative pair (anchor, negative), push the distance above a margin. This works but has problems, negative pairs require careful sampling, and the margin hyperparameter is finicky.

InfoNCE (van den Oord et al., 2018) is a more robust formulation. It operates on batches rather than pairs. For a batch of NN examples, each example is positive with its designated partner and negative with all other examples in the batch. The loss for one positive pair (i,j)(i, j) is:

Li=logexp(sim(i,j)/τ)k=1Nexp(sim(i,k)/τ)\mathcal{L}_i = -\log \dfrac{\exp(\text{sim}(i, j) / \tau)}{\sum_{k=1}^{N} \exp(\text{sim}(i, k) / \tau)}

where sim\text{sim} is typically cosine similarity, τ\tau is a temperature parameter, and the sum runs over all NN examples in the batch (including jj). This looks like a cross-entropy loss where the "correct class" is the designated positive partner and the "incorrect classes" are all other batch members.

Contrastive loss: pull positives in, push negatives out
pullpushpushpushanchorpositiveneg 1neg 2neg 3
Anchor is pulled toward its positive pair in embedding space while all negatives are pushed away. Repeated across the batch, this shapes the geometry of the embedding.
Key Insight

InfoNCE has a theoretical connection to mutual information. Maximizing the InfoNCE objective lower-bounds the mutual information between the representations of the two views of the data. This is why the I in InfoNCE stands for 'information', contrastive learning is really a mutual information maximization problem in disguise. Understanding this connection helps explain why the method generalizes across modalities: mutual information between an image and its caption is a meaningful quantity that the encoder learns to maximize.

The batch size matters enormously for InfoNCE. With N=256N=256, each anchor has 255 negatives. With N=4096N=4096 (the batch size used in SimCLR), each anchor has 4095 negatives. More negatives means a harder classification problem, which means the encoder must produce higher-quality representations to solve it. This is why contrastive learning methods almost universally benefit from large batch sizes, unlike most deep learning tasks where larger batches just mean faster gradient steps with more noise.

Common Pitfall

If your contrastive training produces embeddings where everything is equally similar (all cosine similarities near 1.0), you have representation collapse. The model learned to map all inputs to the same point. Common causes: (1) temperature too high (loss provides no gradient signal), (2) no hard negatives in the batch, (3) learning rate too high causing the encoder to jump to a degenerate solution. Fix: lower temperature to 0.05-0.1, ensure batch diversity, and reduce learning rate.

SimCLR: Simple Framework for Contrastive Learning

SimCLR (Chen et al., 2020) demonstrated that a simple contrastive framework, without specialized architectures or memory banks, can match or exceed supervised pretraining. The key ingredients are a carefully designed augmentation pipeline and a projection head. Each image passes through two random augmentations drawn from: random crop with resize, random color jitter (brightness, contrast, saturation, hue), and Gaussian blur. The two augmented views form a positive pair. The augmentation composition matters: random crop plus color jitter is by far the most important combination. Removing either one substantially degrades performance.

The projection head is a small MLP (2-layer, 128-dimensional output) applied after the encoder backbone. The contrastive loss is computed on the projection head's output, but downstream tasks use representations before the projection head. This counterintuitive result (Chen et al., 2020) means the projection head learns to discard information useful for downstream tasks but harmful for the contrastive objective. The projection head acts as an information filter: it strips away details (like color distribution or exact spatial arrangement) that the contrastive loss would penalize, while the encoder retains those details because no gradient from the contrastive loss flows backward through the discarded dimensions.

MoCo: Decoupling Batch Size from Negatives

MoCo (He et al., 2020) solves the batch size problem differently: instead of needing 4,096 or more samples in each batch, it maintains a queue of recent encoded representations and uses a momentum-updated encoder. The queue stores encoded representations from the last several mini-batches, providing thousands of negatives without requiring them all in a single batch. The momentum encoder is a slowly-updated copy of the main encoder (updated as θkmθk+(1m)θq\theta_k \leftarrow m \cdot \theta_k + (1-m) \cdot \theta_q with m=0.999m=0.999), which ensures that queue entries encoded at different training steps remain consistent. This decouples batch size from the number of negatives, making contrastive learning feasible on hardware that cannot fit 4,096-sample batches in GPU memory.

Here is the InfoNCE loss in PyTorch. This is the core computation shared by SimCLR, MoCo, and CLIP:

python
1import torch
2import torch.nn.functional as F
3
4def info_nce_loss(z1, z2, temperature=0.07):
5    """InfoNCE loss for a batch of positive pairs (z1[i], z2[i]).
6
7    z1, z2: (batch_size, embed_dim) - L2-normalized embeddings
8    """
9    batch_size = z1.shape[0]
10
11    # Cosine similarity matrix: (batch_size, batch_size)
12    sim_matrix = F.cosine_similarity(
13        z1.unsqueeze(1), z2.unsqueeze(0), dim=-1
14    ) / temperature
15
16    # Diagonal entries are positive pairs
17    labels = torch.arange(batch_size, device=z1.device)
18
19    # Cross-entropy: positive pair should score highest in each row
20    loss = F.cross_entropy(sim_matrix, labels)
21    return loss
Level Expectations

Beginner: understand that contrastive learning pulls similar pairs together and pushes dissimilar pairs apart in embedding space.

Intermediate: implement InfoNCE loss in PyTorch and fine-tune a sentence-transformer on a custom domain (the NotebookLink exercise).

Advanced: read 'Understanding Contrastive Representation Learning through Alignment and Uniformity' (Wang and Isola, 2020) to understand the two properties that make contrastive representations good: alignment (similar inputs map nearby) and uniformity (embeddings spread evenly on the hypersphere).

CLIP (Contrastive Language-Image Pretraining, Radford et al., 2021) applies the contrastive objective to a multimodal problem: align images and text in a shared embedding space. The core idea is conceptually simple. Take 400 million (image, caption) pairs from the internet. Train two encoders, one for images, one for text, so that an image and its caption are close together in the shared space, and unrelated images and captions are far apart.

 
1Image encoder (ViT or ResNet) -> image embedding (512 dims)
2Text encoder (Transformer)    -> text embedding (512 dims)
3
4Loss: InfoNCE over NxN similarity matrix (N = batch size)

The training procedure for a batch of NN (image, text) pairs creates an N×NN \times N similarity matrix. The diagonal entries are the positive pairs (each image with its own caption). All off-diagonal entries are negatives. The InfoNCE loss is applied symmetrically - for each image, classify which of the NN texts is its caption; for each text, classify which of the NN images is its subject. The loss is the average of these two directions.

CLIP dual encoder: image and text into a shared embedding space
alignImageCaptionVisionencoderTextencodervtShared embeddingspace
An image encoder and text encoder are trained jointly so matched (image, caption) pairs land nearby in the shared space. Cross-modal retrieval falls out for free.

What Emerges from CLIP Training

After training on 400 million pairs, the shared embedding space has remarkable properties. Zero-shot image classification becomes possible: to classify an image into categories {cat, dog, car, airplane}, encode each category as a text string ("a photo of a cat", "a photo of a dog", etc.), compute cosine similarity between the image embedding and each text embedding, and take the highest similarity as the predicted class. No classifier training required.

CLIP on ImageNet zero-shot classification matches supervised ResNet-50 trained on 1.28 million labeled examples. The model never saw an ImageNet label during training . It learned to recognize objects from internet captions alone.

Interview Tip

CLIP's zero-shot capability depends critically on prompt engineering. 'a photo of a {class}' outperforms just '{class}' by several percentage points on ImageNet. Ensembling multiple prompt templates ('a photo of a {class}', 'a photo of the small {class}', 'a photo of a large {class}') improves accuracy further. This is one of the earliest demonstrations that prompt format matters for model performance, a lesson that would scale dramatically with LLMs.

Why CLIP Works at Scale

Two factors explain CLIP's quality. First, 400 million pairs is a far larger multimodal dataset than any curated alternatives (like MS-COCO with 330K pairs). Natural internet captions are noisy, many captions do not precisely describe their images, but scale compensates for noise. Second, the contrastive objective is a strong learning signal. The image encoder must produce embeddings that are discriminative enough to identify the correct caption among NN others. With N=32,768N=32{,}768 (the batch size CLIP uses with a large compute budget), the task is extremely hard and forces the encoder to capture fine-grained semantic content.

Paper Study
Learning Transferable Visual Models From Natural Language Supervision

Radford et al., 2021

Key questions while reading
  1. Why does CLIP use a contrastive objective rather than image captioning?
  2. What is the computational cost comparison between CLIP and a supervised ViT trained on ImageNet?
  3. How does the prompt template affect zero-shot classification performance?

BERT produces contextual token embeddings, each token gets a different vector depending on its context. But BERT was not designed to produce a single embedding that represents an entire sentence. Naively pooling all token embeddings (averaging or taking the [CLS] token) produces sentence representations that perform poorly on semantic similarity tasks. The problem is that BERT's training objective (masked language modeling and next sentence prediction) does not directly optimize for sentence-level similarity.

Sentence-BERT (SBERT, Reimers & Gurevych, 2019) fixes this by fine-tuning BERT using a contrastive objective on sentence pairs. The architecture is a Siamese network, two BERT encoders sharing the same weights, each receiving one sentence from a pair, with mean pooling applied to produce a single sentence embedding. The fine-tuning objective trains these embeddings so that paraphrase pairs are close and unrelated sentence pairs are far apart.

The result is dramatic. Finding the most similar sentence among 10,000 candidates using vanilla BERT requires encoding all pairs jointly through the cross-encoder - 10,000 forward passes. SBERT pre-computes embeddings for all 10,000 candidates once, then answers any query with a single embedding lookup and a dot product. The speedup at inference time is roughly 10,000x for 10,000 candidates.

SBERT is trained on the Natural Language Inference (NLI) dataset, which contains sentence pairs labeled as entailment, contradiction, or neutral. Entailing pairs are positive examples; contradicting pairs are negative examples. The contrastive fine-tuning pushes the model to encode not just surface similarity but semantic entailment, making the embedding geometry reflect logical relationships, not just word overlap.

Interview Tip

Sentence transformers have largely replaced word embeddings for text similarity in production systems. The go-to model for general English text similarity is 'all-MiniLM-L6-v2' from the sentence-transformers library. It produces 384-dimensional embeddings, runs at CPU speed (roughly 15,000 sentences per second on a modern CPU), and outperforms Word2Vec and GloVe on every standard semantic similarity benchmark by a wide margin.

Modern sentence transformer training uses Multiple Negatives Ranking Loss, which is essentially InfoNCE applied to sentence pairs. In a batch of N (query, positive) pairs, all other positive sentences in the batch serve as negatives for each query. This avoids the need for explicitly labeled negative pairs, you only need a dataset of similar sentence pairs, and the batch structure provides negatives automatically.

The key insight about sentence transformers is that they decouple encoding from comparison. A BERT cross-encoder reads both sentences together and attends across them. It can capture very fine-grained interactions but must be run for every query-document pair at inference time. A bi-encoder (like SBERT) encodes each sentence independently. It cannot capture cross-sentence interactions but supports pre-computation and approximate nearest-neighbor search. For retrieval at scale, bi-encoders are the standard first stage, with cross-encoders applied as a re-ranking step on the top-k results.

Two hyperparameters control the character of contrastive training more than any others: the difficulty of your negatives and the temperature parameter. Getting these right is the difference between a system that learns meaningful embeddings and one that learns only to separate obviously unrelated examples.

Hard negatives are examples that are semantically similar to the anchor but should not be considered a match. In a medical image retrieval system, an X-ray showing pneumonia is a hard negative for a query about COVID-19 lung involvement, both show lung abnormalities, but they require different treatment. Easy negatives are X-rays of broken bones, which are trivially unrelated to lung conditions. The key insight is that easy negatives provide almost no learning signal once the model has learned basic domain boundaries. Once the encoder knows that lungs and bones look different, it can solve the easy negative problem with zero gradient. All the learning that matters happens on hard negatives.

Hard negatives are where the learning signal lives
AnchorPositives (same class)Hard negativesEasy negatives
Easy negatives (far from anchor) produce near-zero loss and waste gradient. The training signal comes from hard negatives — points just outside the positive cluster that the model currently gets wrong. Mining them explicitly is how modern contrastive systems (CLIP, SimCSE) stay sample-efficient.

Hard negative mining is the practice of explicitly finding challenging negatives to include in training batches. The most common approach is offline mining: after each epoch, run inference on the full dataset, find the nearest neighbors of each anchor that are not positive matches, and include those as explicit negatives. Online hard negative mining does this within each batch: after computing embeddings for the batch, identify the hardest negatives (highest similarity to anchor among the non-positive examples) and upweight them in the loss.

Interview Tip

When fine-tuning a sentence-transformer for your domain, the single most impactful decision is your choice of hard negatives, not the model architecture or training hyperparameters. A hard negative is a pair that LOOKS similar but should NOT be matched — for example, two questions about different topics that share keywords. Mining these from your own data (using a first-pass retrieval model to find near-misses) typically improves retrieval quality by 10-20% versus random negatives.

Common Pitfall

Collapsed negatives are a training failure mode in contrastive learning. If hard negative mining selects examples that are actually semantically equivalent to the anchor but just happen to have a different label, the loss will push together examples that should be together while also pushing them apart. This is called false negative contamination. Symptoms: training loss decreases but retrieval quality does not improve, and nearest neighbors in the embedding space are often the correct answer but labeled as negatives. Always inspect hard negatives manually before using them in training.

The temperature parameter τ\tau in the InfoNCE loss controls how the similarity distribution over the batch is sharpened or flattened. A small temperature (τ=0.07\tau = 0.07, used in CLIP) makes the softmax very sharp, the loss concentrates on the hardest cases in the batch. A large temperature (τ=1.0\tau = 1.0) produces a flat distribution where all negatives receive roughly equal gradient. The effect of temperature is equivalent to scaling all similarity scores by 1/τ1/\tau before applying softmax.

Temperature in softmax: from flat to peaked
0123456700.200.400.600.801Class indexProbabilityτ = 0.1 (sharp)τ = 1.0τ = 5.0 (flat)
Low temperature sharpens the distribution — one class dominates. High temperature flattens it — all classes get similar mass. InfoNCE uses temperature as a hardness knob.

Too low a temperature creates a problem. When temperature is very small, the loss gradients are dominated entirely by the hardest negatives, examples that score almost as high as the positive pair. If those hard negatives are actually false negatives (semantically equivalent to the anchor despite a different label), very low temperature accelerates the model toward a bad local minimum. Too high a temperature and the model gets small, uniform gradients from all negatives, learning slowly and converging to coarse representations.

The sweet spot, typically τ[0.05,0.2]\tau \in [0.05, 0.2] for normalized embeddings, provides gradients that are concentrated on hard cases but not so extreme that the entire batch loss is determined by one or two noisy hard negatives. The learnable temperature used in CLIP starts at 0.07 and is learned during training, which allows the model to discover the optimal sharpening automatically.

Contrastive learning is the training paradigm behind many systems you will build in later courses. The embedding models used in RAG retrieval (Course 3) are trained with contrastive objectives on query-document pairs. The multimodal understanding in GPT-4V descends directly from CLIP's contrastive image-text alignment. And the reward models in RLHF (Course 2) use contrastive-style preference learning to rank model outputs. The transformer encoders that power all of these systems are the subject of Course 2, where you will build the self-attention mechanism from scratch and see how it creates the representations that contrastive learning trains.