CNNs and Residual Learning

Topics Covered

Convolution as Feature Detection

The Mechanics of Convolution

What Convolution Computes

Stride and Padding

nn.Conv2d in Practice

Pooling and Hierarchical Features

The Feature Hierarchy

Average Pooling and Global Average Pooling

Residual Connections

The Skip Connection

The Residual Stream View

Depth vs Width

What Depth Buys You

What Width Buys You

EfficientNet and Compound Scaling

Depth and Transformers

When a fully connected network processes an image, it treats every pixel as an independent input. An image with 224×224224 \times 224 pixels becomes a vector of 150,528 values, and the first hidden layer requires 150,528 weights per neuron. A hidden layer of 1,000 neurons needs 150 million weights. Before the network has learned anything, you have already spent your parameter budget on just describing what a pixel is.

This is the wrong inductive bias. A horizontal edge detector at pixel (50,50)(50, 50) and a horizontal edge detector at pixel (150,150)(150, 150) should use the exact same weights, because they are detecting the same thing in different locations. Convolution formalizes this insight.

Convolution: a filter slides, multiplies, and sums
InputKernelOutput0099900999009990099900999-101-202-10136position (0, 0): -1·0 + 0·0 + 1·9 + -2·0 + 0·0 + 2·9 + -1·0 + 0·0 + 1·9 = 36
At every valid position, the 3×3 kernel overlaps a patch of the input. Each element-wise product is summed into a single output value, then the kernel slides one step. Here the kernel is a Sobel-x edge detector — it fires strongly on the left side where the image transitions 0 → 9 (detects a vertical edge) and returns 0 on the uniform right side. The output is a feature map of "how much edge is present at each location".

The Mechanics of Convolution

A convolutional layer has a small filter (also called a kernel), typically 3×33 \times 3 or 5×55 \times 5 in modern networks. This filter slides across the input image at every position. At each position, it computes an element-wise product between the filter weights and the overlapping input patch, sums the results, and writes one output value. The complete set of output values forms the feature map.

A 3×33 \times 3 filter has 9 weights. Those 9 weights are reused at every spatial position in the image. For a 224×224224 \times 224 image, the filter visits roughly 50,000 positions, but with only 9 parameters. This is weight sharing, and it is why CNNs are so parameter-efficient compared to fully connected layers on image data.

A single filter detects a single type of feature (a horizontal edge, a specific color gradient, a corner). A convolutional layer uses many filters in parallel, typically 32, 64, 128, or more. Each filter produces its own feature map. The output of the layer is a stack of feature maps, one per filter. This is why the output of a convolutional layer has three dimensions: height x width x channels, where channels equals the number of filters.

What Convolution Computes

Convolution is a dot product between the filter and each local patch of the input. A filter that looks like a horizontal edge (positive weights on top row, negative weights on bottom row) produces high responses where horizontal edges exist in the input and near-zero responses elsewhere. Learning the filter weights is equivalent to learning which local patterns to detect.

The receptive field of a neuron in a convolutional layer is the region of the input that affects its value. For the first convolutional layer, the receptive field equals the filter size (3×33 \times 3). After two convolutional layers with 3×33 \times 3 filters, the receptive field grows to 5×55 \times 5. After 10 layers, it is 21×2121 \times 21. Deep networks build large receptive fields through composition of small filters.

Key Insight

Convolution is cross-correlation in most deep learning frameworks. True convolution flips the filter before sliding. For learning purposes this does not matter, the network learns whichever set of weights produces correct outputs, flipped or not. PyTorch's nn.Conv2d implements cross-correlation and calls it convolution. The terminology is settled by convention, not mathematical precision.

Stride and Padding

Two hyperparameters control the output spatial dimensions:

  • Stride: How many pixels the filter moves per step. Stride 1 moves one pixel at a time (dense sampling). Stride 2 moves two pixels, halving the spatial dimensions. Strided convolutions replace pooling in some modern architectures.
  • Padding: Whether to add zeros around the input border. Padding of 0 (valid padding) reduces spatial dimensions; padding of (k1)/2(k-1)/2 (same padding) preserves spatial dimensions.
Interview Tip

output_size = floor((input + 2*padding - kernel) / stride) + 1. Memorize this — you will use it constantly.

Interview Tip

When debugging CNNs, print feature map shapes after every layer. The most common bug is a spatial dimension collapsing to 1x1 too early (too much striding or pooling) or not collapsing at all (no pooling before the classifier). A healthy CNN feature map progression for a 224x224 input looks like: 224 -> 112 -> 56 -> 28 -> 14 -> 7 -> global_avg_pool -> classifier. If your shapes deviate from this pattern, check your stride and pooling configuration.

For a 224×224224 \times 224 input with a 3×33 \times 3 filter, stride 1, and same padding, the output is 224×224224 \times 224. With stride 2, the output is 112×112112 \times 112. Understanding this arithmetic is necessary for designing architectures where dimensions must match (e.g., skip connections in ResNets require matching spatial dimensions).

nn.Conv2d in Practice

Here is a convolutional layer in PyTorch with shape annotations:

python
1import torch
2import torch.nn as nn
3
4# 3 input channels (RGB), 64 output channels, 3x3 kernel, same padding
5conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)
6
7x = torch.randn(1, 3, 224, 224)  # (batch, channels, height, width)
8y = conv(x)                       # (1, 64, 224, 224) — same spatial size
9# Parameters: 64 filters * (3 * 3 * 3 weights + 1 bias) = 1,792

The padding=1 with kernel_size=3 preserves spatial dimensions (same padding). Each of the 64 filters produces one channel of the output feature map. The total parameter count is 64×(3×3×3+1)=1,79264 \times (3 \times 3 \times 3 + 1) = 1{,}792, tiny compared to a fully connected equivalent.