TensorFlow
softmax
cross entropy
neural networks
machine learning

About tf.nn.softmax_cross_entropy_with_logits_v2

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

tf.nn.softmax_cross_entropy_with_logits (and its deprecated _v2 variant) computes the softmax cross-entropy loss between logits and labels in a single, numerically stable operation. It combines the softmax activation and cross-entropy loss into one fused operation, avoiding the numerical instability that occurs when computing them separately.

The Math

Cross-entropy loss measures the difference between two probability distributions:

H(y, ŷ) = -Σᵢ yᵢ log(ŷᵢ)

Where:

  • y_i is the true probability distribution (usually a one-hot encoded vector)
  • ŷᵢ is the predicted probability distribution from the softmax function

The softmax function converts logits to probabilities:

ŷᵢ = e^(zᵢ) / Σⱼ e^(zⱼ)

Basic Usage

python
1import tensorflow as tf
2
3# Logits: raw, unnormalized output from the last layer
4logits = tf.constant([[2.0, 1.0, 0.1]])
5
6# Labels: one-hot encoded true class
7labels = tf.constant([[1.0, 0.0, 0.0]])
8
9# Compute loss
10loss = tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)
11print(loss)  # tf.Tensor([0.4170299], shape=(1,), dtype=float32)

Parameters

  • labels: A tensor of the same shape as logits. It contains the true class labels in a one-hot encoded format. Can also contain soft labels (probabilities that sum to 1).
  • logits: The unscaled output of the model — typically the last layer before applying softmax. They can have arbitrary real values.
  • name: An optional name for the operation.

Why Use This Instead of Manual Computation?

Computing softmax and cross-entropy separately causes numerical instability:

python
1# BAD: numerically unstable
2logits = tf.constant([[100.0, 0.0]])  # large logit
3probs = tf.nn.softmax(logits)          # softmax overflow risk
4loss = -tf.reduce_sum(labels * tf.math.log(probs + 1e-10))  # log(0) risk
5
6# GOOD: numerically stable (uses log-sum-exp trick internally)
7loss = tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)

The fused operation uses the log-sum-exp trick to avoid overflow and underflow.

v1 vs v2 vs Current

python
1# TF 1.x: v1 had a bug where labels gradient was computed (backprop through labels)
2loss = tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)
3# Warning: labels should be stopped gradient
4
5# TF 1.x v2: fixed — labels gradient is stopped by default
6loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=logits)
7
8# TF 2.x: the original function now behaves like v2 (labels gradient stopped)
9loss = tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)
10# v2 is deprecated — just use the base function

In TF2, always use tf.nn.softmax_cross_entropy_with_logits (without _v2).

Using in a Model

With tf.keras

python
1model = tf.keras.Sequential([
2    tf.keras.layers.Dense(128, activation='relu'),
3    tf.keras.layers.Dense(10)  # no softmax — output raw logits
4])
5
6# Option 1: Use from_logits=True in the loss
7model.compile(
8    optimizer='adam',
9    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
10    metrics=['accuracy']
11)
12
13# Option 2: Use SparseCategoricalCrossentropy for integer labels
14model.compile(
15    optimizer='adam',
16    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
17    metrics=['accuracy']
18)

Custom Training Loop

python
1@tf.function
2def train_step(images, labels):
3    with tf.GradientTape() as tape:
4        logits = model(images, training=True)
5        loss = tf.reduce_mean(
6            tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)
7        )
8    gradients = tape.gradient(loss, model.trainable_variables)
9    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
10    return loss

Soft Labels and Label Smoothing

The function supports soft labels (non-one-hot distributions):

python
1# Soft labels (knowledge distillation)
2soft_labels = tf.constant([[0.7, 0.2, 0.1]])
3loss = tf.nn.softmax_cross_entropy_with_logits(labels=soft_labels, logits=logits)
4
5# Label smoothing
6def smooth_labels(labels, smoothing=0.1):
7    n_classes = tf.cast(tf.shape(labels)[-1], tf.float32)
8    return labels * (1 - smoothing) + smoothing / n_classes
9
10smoothed = smooth_labels(tf.constant([[1.0, 0.0, 0.0]]), 0.1)
11# [[0.9333, 0.0333, 0.0333]]
FunctionLabels FormatUse Case
softmax_cross_entropy_with_logitsOne-hot [0,1,0]Multi-class, soft labels
sparse_softmax_cross_entropy_with_logitsInteger 2Multi-class, integer labels
sigmoid_cross_entropy_with_logitsMulti-hot [1,0,1]Multi-label (multiple true classes)
python
1# Sparse version — takes integer labels directly
2labels_sparse = tf.constant([0])  # class 0
3loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
4    labels=labels_sparse, logits=logits
5)

Common Pitfalls

  • Do not apply softmax before this function: The function expects raw logits, not probabilities. Applying softmax first computes softmax(softmax(logits)), which gives wrong results.
  • Labels must sum to 1: For correct cross-entropy, each label vector should sum to 1. One-hot labels naturally satisfy this.
  • Shape mismatch: Labels and logits must have the same shape. The function operates on the last dimension.
  • v2 deprecation: softmax_cross_entropy_with_logits_v2 is deprecated in TF2. Use the base function — it already stops gradients through labels.
  • Keras from_logits: When using CategoricalCrossentropy in Keras, always set from_logits=True if your model outputs raw logits. The default from_logits=False assumes probabilities.

Summary

  • tf.nn.softmax_cross_entropy_with_logits computes softmax + cross-entropy in one numerically stable step
  • Pass raw logits (not softmax output) and one-hot labels
  • In TF2, the base function behaves like _v2 — no need for the _v2 variant
  • In Keras, use CategoricalCrossentropy(from_logits=True) for the equivalent
  • Use sparse_softmax_cross_entropy_with_logits for integer labels instead of one-hot

Course illustration
Course illustration

All Rights Reserved.