backpropagation
pooling layer
subsampling layer
convolutional neural networks
CNN

Backpropagation in Pooling Layer Subsamplig layer in CNN

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Pooling layers reduce spatial size, but they do not have trainable weights, so their backward pass is different from a convolution layer's backward pass. Backpropagation through pooling is really about routing the upstream gradient back to the correct input locations. The exact rule depends on whether the pooling is max pooling or average pooling.

Pooling changes shape, not learned parameters

A pooling layer takes a local window such as 2 x 2 and reduces it to one value. Common choices are:

  • max pooling: keep the largest value
  • average pooling: keep the mean of the window

Because there are no learned filters, the pooling layer does not need weight gradients. It only needs to compute the gradient of the loss with respect to its input activations.

That is why the backward rule is simpler than convolution, but it still matters because earlier layers depend on it.

Backpropagation through max pooling

In max pooling, only the input element that won the max operation receives the upstream gradient. Every other value in that window gets zero.

Suppose the forward window is:

python
1import numpy as np
2
3x = np.array([[1.0, 3.0],
4              [2.0, 0.5]])
5
6pooled = np.max(x)
7print(pooled)

The max is 3.0, located at position (0, 1). If the upstream gradient for this pooled output is 2.5, then the backward gradient for the input window is:

python
1grad_out = 2.5
2grad_x = np.zeros_like(x)
3
4max_index = np.unravel_index(np.argmax(x), x.shape)
5grad_x[max_index] = grad_out
6
7print(grad_x)

The result routes the full gradient to the max location only.

In actual implementations, frameworks store either the argmax indices or a mask from the forward pass so the backward step knows where to send the gradient.

Backpropagation through average pooling

Average pooling is different because the output depends equally on every element in the pooling window. So the upstream gradient is distributed evenly across the inputs in that region.

Example:

python
1x = np.array([[1.0, 3.0],
2              [2.0, 0.5]])
3
4grad_out = 2.0
5grad_x = np.full_like(x, grad_out / x.size)
6
7print(grad_x)

Because the 2 x 2 window has four elements, each one receives 2.0 / 4 = 0.5.

That gives the correct derivative of the average operation.

Stride and overlapping windows matter

If pooling windows do not overlap, each input activation belongs to at most one output window. In that case, the backward pass is straightforward.

If the pooling windows overlap, an input activation may influence more than one pooled output. Then its input gradient becomes the sum of the contributions from all relevant output windows.

That summation behavior is exactly the same principle as the rest of backpropagation: if multiple downstream paths depend on the same value, their gradients add together.

Why max pooling needs forward-pass bookkeeping

With max pooling, you cannot reconstruct the winning location from the pooled output alone. Many different input windows could produce the same max value. So during the forward pass, implementations usually record the argmax index or a mask.

Conceptually:

  1. forward pass finds the maximum
  2. forward pass records where the maximum came from
  3. backward pass sends the upstream gradient back to that position

Without that saved location, the layer would not know which input element should receive the gradient.

Common Pitfalls

The most common mistake is assuming pooling has trainable parameters and therefore needs weight updates. Pooling layers usually do not have weights at all.

Another common issue is thinking max-pooling gradients should be shared among all inputs in the window. That is true for average pooling, not for max pooling.

People also forget that overlapping pooling windows cause gradients to accumulate at shared input positions.

Finally, if you implement the layer yourself, you must save the argmax information during the forward pass for max pooling. Recomputing it later from pooled outputs alone is not reliable.

Summary

  • Pooling layers usually have no trainable weights, only input gradients to compute.
  • Max pooling routes the upstream gradient to the input element that won the max.
  • Average pooling distributes the upstream gradient evenly across the window.
  • Overlapping windows cause gradient contributions to add together.
  • Max pooling backward passes rely on argmax or mask information saved during the forward pass.

Course illustration
Course illustration

All Rights Reserved.