PyTorch
machine learning
model parameters
deep learning
neural networks

Check the total number of parameters in a PyTorch model

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Counting parameters in a PyTorch model is useful for debugging, reporting model size, and estimating memory and compute cost. The core operation is simple because every learnable weight tensor is exposed through the module API.

The details matter when you want more than one number. In practice, you usually care about total parameters, trainable parameters, and sometimes a per-layer breakdown.

Total Parameters in One Line

Every nn.Module exposes its parameters through model.parameters(). Each parameter tensor knows how many scalar values it contains through numel().

python
1import torch
2import torch.nn as nn
3
4
5class TinyNet(nn.Module):
6    def __init__(self):
7        super().__init__()
8        self.fc1 = nn.Linear(8, 16)
9        self.fc2 = nn.Linear(16, 4)
10
11    def forward(self, x):
12        return self.fc2(torch.relu(self.fc1(x)))
13
14
15model = TinyNet()
16total_params = sum(p.numel() for p in model.parameters())
17print(total_params)

For this model, the count includes both weights and biases.

Count Only Trainable Parameters

Some tensors may be frozen by setting requires_grad to False. If you are fine-tuning a model, this distinction matters more than the raw total.

python
1import torch
2import torch.nn as nn
3
4
5class TinyNet(nn.Module):
6    def __init__(self):
7        super().__init__()
8        self.fc1 = nn.Linear(8, 16)
9        self.fc2 = nn.Linear(16, 4)
10
11    def forward(self, x):
12        return self.fc2(torch.relu(self.fc1(x)))
13
14
15model = TinyNet()
16for param in model.fc1.parameters():
17    param.requires_grad = False
18
19trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
20total = sum(p.numel() for p in model.parameters())
21
22print("total:", total)
23print("trainable:", trainable)

This is the number most papers and training logs mean when they say “trainable parameters.”

How Parameter Counts Are Derived

It helps to know what the numbers mean.

For a linear layer nn.Linear(in_features, out_features):

  • weight parameters are in_features * out_features
  • bias parameters are out_features when bias is enabled

So nn.Linear(8, 16) has 8 * 16 + 16 = 144 parameters.

For a convolution, the count is:

  • 'out_channels * in_channels * kernel_height * kernel_width'
  • plus out_channels if bias is enabled

Knowing the formula lets you sanity-check large models quickly.

When totals look wrong, inspect named parameters.

python
for name, param in model.named_parameters():
    print(name, param.shape, param.numel(), param.requires_grad)

A slightly nicer summary groups the counts by parameter name:

python
1summary = []
2for name, param in model.named_parameters():
3    summary.append((name, tuple(param.shape), param.numel()))
4
5for name, shape, count in summary:
6    print(f"{name:20} {str(shape):18} {count}")

This is useful for catching mistakes such as accidentally duplicated heads, unexpectedly large embeddings, or layers that were supposed to be frozen but are still trainable.

Parameters Versus Buffers

Not every tensor inside a model is a parameter. PyTorch also has buffers, such as BatchNorm running statistics, that are part of the model state but are not optimized by gradient descent.

That means:

  • 'model.parameters() returns learnable parameter tensors'
  • 'model.buffers() returns registered non-parameter state'
  • 'model.state_dict() contains both parameters and buffers'

If you compare a parameter count with the size of state_dict(), the numbers will not always match. That is normal.

Why the Count Matters

Parameter count is not the same as accuracy, but it is still a strong operational signal.

A larger model usually means:

  • more memory for weights and optimizer state
  • more compute per forward and backward pass
  • higher risk of overfitting on small datasets

A smaller model is often easier to deploy, especially on mobile or edge hardware. That is why model reports frequently include parameter counts alongside latency and accuracy.

Common Pitfalls

A common mistake is counting only trainable parameters and then comparing that number with a paper that reports total parameters. Make sure you compare like with like.

Another mistake is forgetting that shared parameters are counted once in model.parameters(). If two modules reference the same tensor, PyTorch does not duplicate it in the iterator.

People also confuse parameter count with model file size. File size depends on dtype, buffers, serialization overhead, and sometimes optimizer state, not just the raw parameter total.

Finally, if you freeze layers during fine-tuning, re-run the count after freezing. Old totals in notebook output are easy to misread.

Summary

  • Count total parameters with sum(p.numel() for p in model.parameters()).
  • Count trainable parameters by filtering on p.requires_grad.
  • Use named_parameters() when you need a per-layer breakdown.
  • Buffers are part of the model state but are not parameters.
  • Parameter count is a practical proxy for memory, compute cost, and deployment difficulty.

Course illustration
Course illustration

All Rights Reserved.