layer normalization
pytorch
deep learning
neural networks
machine learning

layer Normalization in pytorch?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Layer normalization is a normalization technique that operates across feature dimensions for each sample independently. In PyTorch it is implemented by torch.nn.LayerNorm, and it is especially common in transformers, language models, and other architectures where batch-dependent normalization is inconvenient or unstable.

What Layer Normalization Does

For each sample, layer normalization computes the mean and variance across the chosen feature dimensions, normalizes the activations, and then optionally applies learnable scale and bias parameters.

The practical effect is that each sample's representation is rescaled in a stable, predictable way without depending on the rest of the mini-batch.

That is the key difference from batch normalization.

Basic PyTorch Example

For a tensor shaped like batch_size x hidden_size, you usually normalize the last dimension.

python
1import torch
2import torch.nn as nn
3
4x = torch.tensor([
5    [1.0, 2.0, 3.0],
6    [10.0, 20.0, 30.0]
7])
8
9layer_norm = nn.LayerNorm(3)
10y = layer_norm(x)
11
12print(y)
13print("Row means:", y.mean(dim=1))
14print("Row variances:", y.var(dim=1, unbiased=False))

Each row is normalized independently. One row does not affect the other.

Understanding normalized_shape

The most important argument is normalized_shape. It should match the trailing dimensions you want to normalize.

Examples:

  • shape batch x hidden uses nn.LayerNorm(hidden)
  • shape batch x sequence x hidden often still uses nn.LayerNorm(hidden)
  • shape batch x channels x height x width can normalize multiple trailing dimensions if that is what you want

Sequence example:

python
1import torch
2import torch.nn as nn
3
4x = torch.randn(2, 4, 8)  # batch, sequence, hidden
5norm = nn.LayerNorm(8)
6y = norm(x)
7
8print(x.shape)
9print(y.shape)

The tensor shape stays the same. Layer normalization changes the values, not the structure.

LayerNorm Inside a Model

A common use case is a residual block or transformer-style sublayer.

python
1import torch
2import torch.nn as nn
3
4class FeedForwardBlock(nn.Module):
5    def __init__(self, hidden_size: int):
6        super().__init__()
7        self.linear1 = nn.Linear(hidden_size, hidden_size * 2)
8        self.activation = nn.ReLU()
9        self.linear2 = nn.Linear(hidden_size * 2, hidden_size)
10        self.norm = nn.LayerNorm(hidden_size)
11
12    def forward(self, x: torch.Tensor) -> torch.Tensor:
13        residual = x
14        x = self.linear1(x)
15        x = self.activation(x)
16        x = self.linear2(x)
17        return self.norm(x + residual)
18
19
20x = torch.randn(3, 8)
21model = FeedForwardBlock(hidden_size=8)
22output = model(x)
23print(output.shape)

This pattern is not a full transformer block, but it demonstrates the common idea: transform, add a residual connection, then normalize.

LayerNorm Versus BatchNorm

Batch normalization uses statistics across the mini-batch. Layer normalization uses statistics from each sample independently.

That makes layer normalization attractive when:

  • batch size is small
  • sequence lengths vary
  • training and inference should behave similarly
  • the architecture is token- or embedding-centric

This is one reason transformers rely heavily on layer normalization rather than batch normalization.

Affine Parameters

By default, nn.LayerNorm includes learnable affine parameters. That means after normalization, the model can still learn a useful rescaling and shift.

python
norm = nn.LayerNorm(8, elementwise_affine=True)

This is usually what you want. Turning affine parameters off is possible, but it changes the expressiveness of the layer.

Common Pitfalls

The biggest mistake is passing the wrong normalized_shape. It must match the trailing dimensions that should be normalized, not the batch dimension.

Another issue is expecting layer normalization to behave like batch normalization. It does not aggregate statistics across different samples in the batch.

People also often check variance with the default unbiased estimator and then think the layer is wrong. For sanity checks after normalization, unbiased=False usually matches the intuition better.

Finally, layer normalization improves training stability, but it will not rescue a bad optimizer setup, broken tensor shapes, or a fundamentally poor model design.

Summary

  • 'nn.LayerNorm normalizes each sample across chosen feature dimensions.'
  • It is especially common in transformers and small-batch sequence models.
  • 'normalized_shape is the key argument and usually matches the trailing feature dimensions.'
  • Layer normalization differs from batch normalization because it does not depend on other samples in the batch.
  • It helps stability, but it still needs a sound model and training setup around it.

Course illustration
Course illustration

All Rights Reserved.