Deep Learning Foundations
Mathematical Foundations
Representation Learning
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
CNNs and Residual Learning
When a fully connected network processes an image, it treats every pixel as an independent input. An image with pixels becomes a vector of 150,528 values, and the first hidden layer requires 150,528 weights per neuron. A hidden layer of 1,000 neurons needs 150 million weights. Before the network has learned anything, you have already spent your parameter budget on just describing what a pixel is.
This is the wrong inductive bias. A horizontal edge detector at pixel and a horizontal edge detector at pixel should use the exact same weights, because they are detecting the same thing in different locations. Convolution formalizes this insight.
Convolution: a filter slides, multiplies, and sums
The Mechanics of Convolution
A convolutional layer has a small filter (also called a kernel), typically or in modern networks. This filter slides across the input image at every position. At each position, it computes an element-wise product between the filter weights and the overlapping input patch, sums the results, and writes one output value. The complete set of output values forms the feature map.
A filter has 9 weights. Those 9 weights are reused at every spatial position in the image. For a image, the filter visits roughly 50,000 positions, but with only 9 parameters. This is weight sharing, and it is why CNNs are so parameter-efficient compared to fully connected layers on image data.
A single filter detects a single type of feature (a horizontal edge, a specific color gradient, a corner). A convolutional layer uses many filters in parallel, typically 32, 64, 128, or more. Each filter produces its own feature map. The output of the layer is a stack of feature maps, one per filter. This is why the output of a convolutional layer has three dimensions: height x width x channels, where channels equals the number of filters.
What Convolution Computes
Convolution is a dot product between the filter and each local patch of the input. A filter that looks like a horizontal edge (positive weights on top row, negative weights on bottom row) produces high responses where horizontal edges exist in the input and near-zero responses elsewhere. Learning the filter weights is equivalent to learning which local patterns to detect.
The receptive field of a neuron in a convolutional layer is the region of the input that affects its value. For the first convolutional layer, the receptive field equals the filter size (). After two convolutional layers with filters, the receptive field grows to . After 10 layers, it is . Deep networks build large receptive fields through composition of small filters.
Convolution is cross-correlation in most deep learning frameworks. True convolution flips the filter before sliding. For learning purposes this does not matter, the network learns whichever set of weights produces correct outputs, flipped or not. PyTorch's nn.Conv2d implements cross-correlation and calls it convolution. The terminology is settled by convention, not mathematical precision.
Stride and Padding
Two hyperparameters control the output spatial dimensions:
- Stride: How many pixels the filter moves per step. Stride 1 moves one pixel at a time (dense sampling). Stride 2 moves two pixels, halving the spatial dimensions. Strided convolutions replace pooling in some modern architectures.
- Padding: Whether to add zeros around the input border. Padding of 0 (valid padding) reduces spatial dimensions; padding of (same padding) preserves spatial dimensions.
output_size = floor((input + 2*padding - kernel) / stride) + 1. Memorize this — you will use it constantly.
When debugging CNNs, print feature map shapes after every layer. The most common bug is a spatial dimension collapsing to 1x1 too early (too much striding or pooling) or not collapsing at all (no pooling before the classifier). A healthy CNN feature map progression for a 224x224 input looks like: 224 -> 112 -> 56 -> 28 -> 14 -> 7 -> global_avg_pool -> classifier. If your shapes deviate from this pattern, check your stride and pooling configuration.
For a input with a filter, stride 1, and same padding, the output is . With stride 2, the output is . Understanding this arithmetic is necessary for designing architectures where dimensions must match (e.g., skip connections in ResNets require matching spatial dimensions).
nn.Conv2d in Practice
Here is a convolutional layer in PyTorch with shape annotations:
The padding=1 with kernel_size=3 preserves spatial dimensions (same padding). Each of the 64 filters produces one channel of the output feature map. The total parameter count is , tiny compared to a fully connected equivalent.