Deep Learning Foundations
Mathematical Foundations
Neural Network Foundations
Representation Learning
Generative Models Beyond Language
Practical Training Decisions
Self-Supervised Learning Beyond Contrastive
Contrastive learning gave self-supervised vision its first big win. SimCLR and MoCo proved that you could train image encoders from unlabeled data and match, or nearly match, supervised ImageNet pretraining on downstream tasks. That was the good news. The bad news, which anyone who actually trained these systems quickly discovered, is that the contrastive recipe depends on one ingredient that is painful to supply at scale: a very large pool of informative negative samples.
This section explains why the field moved on. The story is not "contrastive learning was wrong." Contrastive methods still work, and CLIP's contrastive objective is the foundation of modern multimodal models. The story is that for pure visual representation learning, researchers discovered around 2020-2021 that you can drop the negatives entirely and still learn excellent features, provided you add the right architectural ingredient to prevent the obvious failure mode. That discovery, and the wave of methods that followed, is what this lesson is about.
The Negative-Sample Tax
InfoNCE works by making a positive pair score higher than all negative pairs in the batch. The denominator of the softmax sums over every negative, which means the number of negatives directly determines how hard the discrimination task is. SimCLR found that representation quality keeps improving as you push the batch size from 256 to 4096 to 8192. Bigger batch, more negatives, harder task, better features.
This creates a compute wall. Every negative has to be encoded on every step. A batch of 4096 images through a ResNet-50 does not fit on a single V100. The SimCLR paper used a 128-TPU pod. The follow-up work, MoCo, introduced a memory bank (later a momentum queue) to decouple the pool of negatives from the batch size, so you could get 65,536 negatives without 65,536 forward passes per step. But MoCo introduced its own complications: the momentum encoder, the queue management, the distinction between the key and query networks. The architecture started to feel like a collection of workarounds.
BYOL: collapse prevention without negatives
The deeper question is philosophical. Contrastive loss says: "learn representations where positives are closer than negatives." But you do not actually care about the negatives. You care about the positives being close and about the representation space being non-collapsed. The negatives are a means to an end. A clever researcher asked: is there a way to prevent collapse without negatives?
The Trivial Solution and Why It Matters
Imagine you delete the negative term from the InfoNCE loss and just minimize the distance between positive pairs. What happens? The encoder immediately learns the trivial solution: map every input to the same constant vector. Now every positive pair has distance zero. The loss is minimized. The representation is useless.
This is called representation collapse, and it is the first thing every self-supervised researcher runs into when they try to eliminate negatives. Collapse is not subtle. It happens almost instantly if you do not prevent it. The negatives in InfoNCE exist precisely to prevent collapse: they force the encoder to produce different outputs for different inputs, because if it did not, the loss would punish it for failing to discriminate.
So the question becomes: is there a way to prevent collapse without explicitly pushing negatives apart? For several years, the answer was assumed to be no. Then BYOL came out and broke that assumption.
The key insight of non-contrastive SSL is that collapse prevention can be achieved architecturally instead of through the loss function. BYOL uses an EMA target network and a predictor head. SimSiam uses a stop-gradient and a predictor. MAE uses the masked reconstruction objective itself. Each method finds a different way to make collapse an unstable equilibrium rather than a stable minimum.
The Landscape of Methods
The last four years have produced a rich taxonomy of self-supervised methods for vision. Knowing the names helps you read papers without getting lost. Here is the lay of the land.
Contrastive methods (require negatives):
- SimCLR (Chen et al., 2020) — InfoNCE over in-batch negatives, massive batch sizes.
- MoCo v1/v2 (He et al., 2020) — momentum encoder plus a queue of past keys as negatives.
- MoCo v3 (Chen et al., 2021) — MoCo applied to Vision Transformers, with InfoNCE.
Non-contrastive methods (no negatives):
- BYOL (Grill et al., 2020) — EMA target network, predictor head, stop-gradient.
- SimSiam (Chen and He, 2020) — BYOL without the EMA, just stop-gradient and a predictor.
- Barlow Twins (Zbontar et al., 2021) — decorrelation-based collapse prevention.
- VICReg (Bardes et al., 2022) — variance, invariance, covariance regularization.
Masked image modeling (reconstruction-based, no pairs at all):
- BEiT (Bao et al., 2021) — predict discrete tokens of masked patches.
- MAE (He et al., 2021) — reconstruct pixels of masked patches, 75 percent mask ratio.
- SimMIM (Xie et al., 2021) — concurrent work to MAE, simpler reconstruction target.
Self-distillation methods (student-teacher with shared architecture):
- DINO (Caron et al., 2021) — self-distillation with centering and sharpening.
- iBOT (Zhou et al., 2021) — DINO plus masked image modeling on patch tokens.
- DINOv2 (Oquab et al., 2023) — scaled DINO with curated data, current state of the art.
Each family makes different trade-offs. Contrastive methods are conceptually simple but compute-hungry. Non-contrastive methods are sample-efficient but have training instabilities. Masked modeling is architecturally elegant and scales with model size better than anything else. Self-distillation produces the best-quality features for dense prediction tasks. You will see these names everywhere in modern vision, and you should know which family each belongs to.
The Collapse Fixed Point, Explained More Carefully
When researchers talk about "representation collapse" in self-supervised learning, they usually mean one of two specific failure modes. Knowing the difference is useful because different methods collapse in different ways and the debugging steps are different.
The first type is complete collapse, where the encoder maps every input to the same constant vector. This is the failure mode that InfoNCE is designed to prevent by pushing negatives apart, and it is what happens if you naively drop the negative term from InfoNCE. Complete collapse is easy to diagnose: the feature covariance matrix has rank 1, linear probe accuracy is at chance level, and the encoder outputs are literally identical across different inputs. You can see this by printing a few feature vectors from the encoder and noticing they are all nearly the same.
The second type is partial collapse or dimensional collapse, where the encoder uses only a few dimensions of its output space and the rest are essentially constant. This is more subtle and can be missed if you only check the mean of features. A model with dimensional collapse might still produce different outputs for different inputs, but the variation lives in a low-dimensional subspace of the nominal feature space. Dimensional collapse hurts downstream performance because the effective dimensionality of the features is much smaller than it looks.
The diagnostic for dimensional collapse is to compute the singular values of the feature covariance matrix. A healthy representation has many non-trivial singular values. A dimensionally collapsed representation has a few large singular values and many near-zero ones. Visualizing the singular value spectrum is a quick way to check whether a self-supervised model is producing high-quality features.
Methods like Barlow Twins and VICReg are specifically designed to prevent dimensional collapse by including a covariance decorrelation term in the loss. BYOL and SimSiam rely on the predictor head to prevent both complete and dimensional collapse implicitly, which is more elegant but less directly controllable. Understanding these failure modes helps you pick the right method and helps you diagnose training runs that produce mediocre features.
A Closer Look at Why SimCLR Hit a Wall
It is worth understanding the specific numbers that made the contrastive approach feel unsustainable. SimCLR's scaling curve, reproduced in the original paper and in follow-up ablations, shows the following pattern on ImageNet linear probe accuracy. At batch size 256 you get about 60 percent. At batch size 1024 you get about 65 percent. At batch size 4096 you get about 68-69 percent. At batch size 8192 you get about 70 percent. Each doubling of batch size gives diminishing returns, but the curve keeps climbing, and the practical best number requires batch sizes that only a TPU pod can provide.
The memory math is where it hurts. A ResNet-50 with ImageNet-scale inputs needs roughly 2 GB of activation memory per batch of 32 during the forward pass (for the standard 224x224 resolution with typical augmentations). At batch size 4096, that is 256 GB of activations, far beyond any single GPU. SimCLR's authors used TPUs with their gradient-accumulation-friendly architecture and large HBM pools, but that is not accessible to most researchers. The method was practical at Google but not at smaller labs.
MoCo addressed this by introducing a queue of past encodings as negatives, so you did not need a huge batch size on every step. The queue could be 65,536 entries long while the actual batch size was 256. But MoCo introduced the momentum encoder to keep the queue entries consistent (otherwise they would go stale as the query network changed), and that added its own hyperparameters and failure modes. MoCo v3 applied the same ideas to Vision Transformers and became the main contrastive baseline through 2021, but the field was already looking for simpler alternatives.
What the Non-Contrastive Revolution Actually Proved
The BYOL paper (Grill et al., 2020) landed with a bombshell result: 74 percent ImageNet linear probe accuracy with no negatives, using a batch size of 4096 on TPUs but crucially not requiring the batch size to be large for quality (later runs at batch size 256 reached similar quality). The paper was initially controversial. Several researchers suggested that batch normalization in the projector was secretly providing negative signal — because batch norm couples the outputs of different examples in the same batch, perhaps the decorrelating effect of BN was what prevented collapse. This hypothesis was tested carefully in follow-up work.
The SimSiam paper (Chen and He, 2020) drove the nail in by showing that stop-gradient alone — without EMA, without batch norm in specific places — was enough to prevent collapse, as long as the predictor head was present. This made the "batch norm is secretly contrastive" hypothesis hard to sustain. Non-contrastive SSL was real. The mechanism was architectural asymmetry (predictor head) plus stop-gradient, and the EMA in BYOL was a helpful addition but not the core ingredient.
This was a genuinely surprising result for the field. It meant that the "positive versus negative" framing, which had felt fundamental to contrastive learning, was actually a contingent choice. You could learn useful features from positive pairs alone if you set up the loss correctly. The implications went beyond just reducing compute cost: it opened the door to a whole family of new methods that did not have to fit the contrastive mold.
Why This Matters for Downstream Work
If you are using a pretrained vision encoder today, you are almost certainly using features from one of the methods in this lesson. CLIP gets you language-aligned features. MAE gives you the best ViT initialization for fine-tuning. DINOv2 gives you the strongest features for zero-shot transfer, dense prediction, and anything where you do not want to fine-tune at all. Each one is the answer to a different question, and knowing which question each method answers is the difference between picking the right pretrained backbone and wasting weeks on the wrong one.
The practical guidance most teams need is simple. If you have a downstream task and a specific compute budget, pick the method whose strengths match your needs. If you need frozen features out of the box for multiple tasks, reach for DINOv2. If you have labeled fine-tuning data and plenty of compute to run end-to-end training, reach for MAE. If you need text-aligned features for retrieval or generation, reach for CLIP. If you are doing research on pretraining objectives or need the most compute-efficient setup possible, reach for SimSiam. Knowing the landscape lets you skip past the "which paper did I read last week" heuristic and pick based on actual requirements.