TensorFlow
Keras
PyTorch
Dense Layer
Linear Layer

Difference between Tensorflow's tf.keras.layers.Dense and PyTorch's torch.nn.Linear?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

tf.keras.layers.Dense and torch.nn.Linear represent the same basic mathematical building block: an affine transformation from input features to output features. The differences that matter in practice come from framework conventions, especially activation handling, parameter layout, layer construction, and how models are assembled around the layer.

Core Sections

The shared operation is the same

Both layers compute an output of the form xW + b, where W is a learned weight matrix and b is an optional bias vector.

In TensorFlow:

python
1import tensorflow as tf
2
3layer = tf.keras.layers.Dense(4, use_bias=True)
4x = tf.constant([[1.0, 2.0, 3.0]])
5y = layer(x)
6print(y.shape)

In PyTorch:

python
1import torch
2
3layer = torch.nn.Linear(3, 4, bias=True)
4x = torch.tensor([[1.0, 2.0, 3.0]])
5y = layer(x)
6print(y.shape)

In both cases, the result has shape batch_size x out_features. So the short answer is that they are functionally equivalent layers in different ecosystems.

Keras can bundle activation, PyTorch usually keeps it separate

One of the biggest practical differences is that Keras Dense can include an activation directly in the layer definition.

python
layer = tf.keras.layers.Dense(4, activation="relu")

PyTorch Linear only performs the affine transform. Activation is normally added explicitly as another module or function.

python
1model = torch.nn.Sequential(
2    torch.nn.Linear(3, 4),
3    torch.nn.ReLU(),
4)

Neither design is more correct. Keras optimizes for concise model definitions, while PyTorch emphasizes explicit composition.

Input shape handling is more implicit in Keras

Keras often delays weight creation until the layer first sees input, or until the model provides an input shape earlier in the graph.

python
1model = tf.keras.Sequential([
2    tf.keras.layers.Input(shape=(3,)),
3    tf.keras.layers.Dense(4),
4])

PyTorch requires in_features and out_features when you create the layer.

python
model = torch.nn.Sequential(
    torch.nn.Linear(3, 4)
)

This difference influences debugging. PyTorch forces dimension choices earlier, while Keras can infer some of them later. That convenience is useful, but it can also hide dimension mistakes until build time.

The stored weight shapes are transposed between frameworks

This is the most important detail when porting weights manually.

In Keras, the kernel shape is typically:

  • 'in_features x out_features'

In PyTorch, the weight tensor shape is:

  • 'out_features x in_features'

That means weight transfer usually needs a transpose.

python
# TensorFlow or Keras
kernel, bias = keras_layer.get_weights()
print(kernel.shape)
python
# PyTorch
print(torch_layer.weight.shape)
print(torch_layer.bias.shape)

If you copy the values directly without transposing, the converted model will either fail shape checks or produce incorrect results.

Bias and initialization are similar but not identical in style

Both frameworks include bias by default. Keras uses use_bias=True. PyTorch uses bias=True. The default initializers also differ by framework, which means two freshly created layers with the same input and output dimensions will not necessarily produce numerically similar outputs until you align initialization.

This matters when comparing framework behavior. If two models disagree at startup, the layer type itself may not be the issue; initialization, dtype, random seeds, or activation placement may be the real cause.

The bigger difference is the surrounding training model

Dense in Keras is part of a high-level model system that integrates tightly with Model, Sequential, compile, and fit.

python
1model = tf.keras.Sequential([
2    tf.keras.layers.Input(shape=(8,)),
3    tf.keras.layers.Dense(16, activation="relu"),
4    tf.keras.layers.Dense(1),
5])
6model.compile(optimizer="adam", loss="mse")

PyTorch Linear usually lives inside a more explicit training loop.

python
1model = torch.nn.Sequential(
2    torch.nn.Linear(8, 16),
3    torch.nn.ReLU(),
4    torch.nn.Linear(16, 1),
5)
6
7optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
8loss_fn = torch.nn.MSELoss()

So although the layers are conceptually the same, the surrounding framework makes them feel different to use.

Common Pitfalls

  • Forgetting that Keras can include activation inside Dense while PyTorch requires a separate activation often causes mismatched model definitions.
  • Transferring weights between frameworks without transposing the matrix is a common source of incorrect outputs.
  • Comparing the two layers without aligning initializers, seeds, or dtype can make equivalent models look inconsistent.
  • Assuming Keras and PyTorch infer input dimensions the same way leads to avoidable shape errors during porting.
  • Focusing on layer names instead of the actual affine operation can obscure the fact that the underlying math is the same.

Summary

  • 'tf.keras.layers.Dense and torch.nn.Linear perform the same core affine transformation.'
  • Keras can attach activation inside the layer, while PyTorch generally composes activation separately.
  • Keras often infers input shape later, while PyTorch asks for in_features up front.
  • Weight tensor layout differs, so manual porting usually requires a transpose.
  • Most differences are framework-design choices rather than differences in model capability.

Course illustration
Course illustration

All Rights Reserved.