L2 regularization
TensorFlow
tf.layers
machine learning
deep learning

Add L2 regularization when using high level tf.layers

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When you use the older high-level tf.layers API, adding L2 regularization is a two-step job. You attach a regularizer to the layer weights, and then you make sure those regularization losses are actually included in the total loss used for training.

Attach a Regularizer to the Layer

In TensorFlow 1.x style code, tf.layers.dense and similar layers accept a kernel_regularizer. That regularizer is evaluated for the layer's weight tensor and stored in TensorFlow's regularization-loss collection.

Example:

python
1import tensorflow as tf
2
3tf.compat.v1.disable_eager_execution()
4
5features = tf.compat.v1.placeholder(tf.float32, shape=[None, 20])
6labels = tf.compat.v1.placeholder(tf.float32, shape=[None, 1])
7
8hidden = tf.compat.v1.layers.dense(
9    features,
10    units=64,
11    activation=tf.nn.relu,
12    kernel_regularizer=tf.keras.regularizers.l2(1e-4)
13)
14
15logits = tf.compat.v1.layers.dense(
16    hidden,
17    units=1,
18    kernel_regularizer=tf.keras.regularizers.l2(1e-4)
19)

That code does not directly modify the prediction values. It adds penalty terms tied to the layer kernels.

Add the Regularization Term to Training Loss

This is the part many people miss. Declaring a kernel_regularizer is not enough by itself unless your training code adds the collected regularization losses to the main objective.

python
1data_loss = tf.reduce_mean(
2    tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits)
3)
4
5reg_loss = tf.compat.v1.losses.get_regularization_loss()
6total_loss = data_loss + reg_loss
7
8train_op = tf.compat.v1.train.AdamOptimizer(1e-3).minimize(total_loss)

If you are already using the tf.compat.v1.losses helpers, you can also let TensorFlow build the combined loss for you:

python
data_loss = tf.compat.v1.losses.mean_squared_error(labels=labels, predictions=logits)
total_loss = tf.compat.v1.losses.get_total_loss(add_regularization_losses=True)

The important check is simple: optimize total_loss, not only data_loss.

What L2 Regularization Changes

L2 regularization penalizes large weights. In practice, that nudges the optimizer toward smaller parameter values, which often improves generalization when a model starts memorizing the training set.

The regularization strength matters:

  • Too small, and it does almost nothing
  • Too large, and the model underfits

For dense layers, values such as 1e-5, 1e-4, or 1e-3 are common starting points, but the correct choice depends on model size, optimizer, and dataset scale.

Another practical benefit is that L2 regularization smooths training decisions across correlated features. Instead of letting one weight grow very large while the rest stay near zero, the optimizer is encouraged to spread influence more evenly when that fits the data.

A Full Minimal Example

Here is a compact training graph using the old layers API:

python
1import numpy as np
2import tensorflow as tf
3
4tf.compat.v1.disable_eager_execution()
5
6x = tf.compat.v1.placeholder(tf.float32, shape=[None, 2])
7y = tf.compat.v1.placeholder(tf.float32, shape=[None, 1])
8
9net = tf.compat.v1.layers.dense(
10    x, 16, activation=tf.nn.relu,
11    kernel_regularizer=tf.keras.regularizers.l2(1e-4)
12)
13pred = tf.compat.v1.layers.dense(
14    net, 1,
15    kernel_regularizer=tf.keras.regularizers.l2(1e-4)
16)
17
18data_loss = tf.reduce_mean(tf.square(pred - y))
19loss = data_loss + tf.compat.v1.losses.get_regularization_loss()
20train_op = tf.compat.v1.train.AdamOptimizer(0.01).minimize(loss)
21
22with tf.compat.v1.Session() as sess:
23    sess.run(tf.compat.v1.global_variables_initializer())
24    batch_x = np.array([[0.0, 0.0], [1.0, 1.0]], dtype=np.float32)
25    batch_y = np.array([[0.0], [1.0]], dtype=np.float32)
26    sess.run(train_op, feed_dict={x: batch_x, y: batch_y})

Migration Note

tf.layers belongs to the older TensorFlow 1.x style graph API. In modern TensorFlow, the same idea is usually expressed with tf.keras.layers.Dense and a kernel_regularizer. The principle is unchanged: regularize the weights and ensure the regularization loss participates in optimization.

Common Pitfalls

  • Setting kernel_regularizer and then minimizing only the data loss.
  • Regularizing everything indiscriminately. Bias terms and batch-normalization parameters are often left unregularized.
  • Using a regularization coefficient that is far too large for the model scale.
  • Mixing tf.layers, tf.compat.v1.layers, and Keras code without checking how the total loss is assembled.

Summary

  • Add L2 regularization through the layer's kernel_regularizer argument.
  • Make sure the collected regularization losses are included in the optimized loss.
  • Tune the coefficient empirically rather than guessing once and leaving it fixed.
  • In old tf.layers code, tf.compat.v1.losses.get_regularization_loss() is the key helper.
  • In modern code, the same pattern exists in tf.keras, even though the API surface is different.

Course illustration
Course illustration

All Rights Reserved.