Is the L1 regularization in Keras/Tensorflow really L1-regularization?

L1 regularization

Keras

TensorFlow

machine learning

neural networks

Is the L1 regularization in Keras/Tensorflow really L1-regularization?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Yes, Keras and TensorFlow implement the standard L1 penalty: a constant multiplied by the sum of the absolute values of the weights. The confusion usually comes from how that penalty interacts with gradient-based training, which does not always behave exactly like the textbook sparse solutions people associate with Lasso.

What Keras Adds to the Loss

When you apply tf.keras.regularizers.L1, Keras computes a penalty proportional to sum(abs(weights)) and adds it to the model loss.

python

1import tensorflow as tf
2
3layer = tf.keras.layers.Dense(
4    2,
5    use_bias=False,
6    kernel_initializer="ones",
7    kernel_regularizer=tf.keras.regularizers.L1(0.01),
8)
9
10x = tf.ones((1, 3))
11_ = layer(x)
12
13print(layer.losses[0].numpy())

Because the kernel starts with ones and has 3 x 2 weights, the regularization term is 0.01 * 6. That is exactly the L1 norm scaled by the regularization strength.

So from a loss-definition perspective, the answer is straightforward: yes, the penalty is really L1 regularization.

Why It May Not Look Like Classical Lasso

In many statistics texts, L1 regularization is associated with coefficients becoming exactly zero. That often happens in optimization methods designed specifically for L1 objectives, such as coordinate descent or proximal updates.

Neural networks in Keras are usually trained with optimizers such as SGD or Adam, which use gradients or subgradients. L1 is not differentiable at zero, so frameworks use a subgradient there. That still gives you the correct penalty, but it does not guarantee the same sparse path you might expect from a classical linear-model solver.

In other words:

The objective really includes an L1 term
The optimization method may not drive weights to exact zero as aggressively as specialized Lasso solvers

That difference is about optimization behavior, not about whether the penalty is genuinely L1.

Inspect the Regularization Loss During Training

You can see the penalty being added by looking at model.losses.

python

1import tensorflow as tf
2
3model = tf.keras.Sequential(
4    [
5        tf.keras.layers.Dense(
6            4,
7            activation="relu",
8            kernel_regularizer=tf.keras.regularizers.L1(1e-4),
9            input_shape=(3,),
10        ),
11        tf.keras.layers.Dense(1),
12    ]
13)
14
15x = tf.random.normal((8, 3))
16y = tf.random.normal((8, 1))
17
18with tf.GradientTape() as tape:
19    predictions = model(x)
20    data_loss = tf.reduce_mean(tf.square(y - predictions))
21    reg_loss = tf.add_n(model.losses)
22    total_loss = data_loss + reg_loss
23
24print("data_loss:", float(data_loss))
25print("reg_loss:", float(reg_loss))
26print("total_loss:", float(total_loss))

This makes the mechanism visible: the model loss is the prediction loss plus the sum of regularization penalties collected from the layers.

What to Expect in Practice

L1 still encourages sparsity. Weights often shrink toward zero more strongly than with L2, and some may become exactly zero depending on the optimizer, learning rate, initialization, and problem structure. But if your expectation is "every irrelevant feature becomes a hard zero just because I added regularizers.L1," neural-network training may disappoint you.

If exact sparsity matters a lot, you may need one of these additional strategies:

Stronger regularization
Pruning after training
An optimizer or algorithm designed for sparse solutions

Common Pitfalls

Confusing "not many exact zeros appeared" with "the penalty is not really L1."
Comparing neural-network training behavior to specialized linear Lasso solvers as if they used the same optimization method.
Choosing an L1 coefficient that is too small to have any meaningful effect.
Forgetting that feature scaling influences how strongly the regularizer affects different weights.

Summary

Keras and TensorFlow do implement a true L1 penalty based on absolute weight values.
The penalty is added to the training loss through the layer regularizer mechanism.
Gradient-based optimizers can produce behavior that looks less sparse than classical Lasso implementations.
If you need stronger or exact sparsity, regularization alone may not be enough.