tensorflow
sampled_softmax_loss
softmax_cross_entropy_with_logits
machine_learning
deep_learning

In Tensorflow, what is the difference between sampled_softmax_loss and softmax_cross_entropy_with_logits

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

softmax_cross_entropy_with_logits and sampled_softmax_loss both train multi-class models, but they solve different scale problems. Full softmax computes loss over every class exactly, while sampled softmax approximates that objective by comparing the true class against only a sampled set of negatives.

Full Softmax Uses All Classes

With full softmax, the model produces one logit per class and normalizes across the entire output space. That is exact, which makes it the simplest and most faithful loss when the number of classes is manageable.

python
1import tensorflow as tf
2
3logits = tf.constant([[2.0, 0.5, -1.0]], dtype=tf.float32)
4labels = tf.constant([[1.0, 0.0, 0.0]], dtype=tf.float32)
5
6loss = tf.nn.softmax_cross_entropy_with_logits(
7    labels=labels,
8    logits=logits
9)
10
11print(loss.numpy())

This is ideal when your output space is small enough that computing all logits is cheap.

Sampled Softmax Uses a Subset of Classes

sampled_softmax_loss exists for very large output spaces such as language-model vocabularies or huge recommendation catalogs. Instead of comparing against every class, it compares the true class against a sampled set of negatives.

python
1import tensorflow as tf
2
3num_classes = 10000
4embedding_dim = 64
5batch_size = 4
6
7weights = tf.Variable(tf.random.normal([num_classes, embedding_dim]))
8biases = tf.Variable(tf.zeros([num_classes]))
9inputs = tf.random.normal([batch_size, embedding_dim])
10labels = tf.constant([[1], [5], [42], [999]], dtype=tf.int64)
11
12loss = tf.nn.sampled_softmax_loss(
13    weights=weights,
14    biases=biases,
15    labels=labels,
16    inputs=inputs,
17    num_sampled=100,
18    num_classes=num_classes
19)
20
21print(tf.reduce_mean(loss).numpy())

This is cheaper because it avoids computing the full normalization over all 10,000 classes on every step.

The APIs Are Structurally Different

The input signatures already show that these losses are not interchangeable.

softmax_cross_entropy_with_logits expects:

  • a full logits tensor with shape batch_size x num_classes
  • labels representing the target distribution

sampled_softmax_loss expects:

  • output-layer weights
  • output-layer biases
  • hidden inputs
  • integer labels

That difference exists because sampled softmax constructs the comparison set internally rather than consuming a full matrix of logits.

Exact Loss Versus Approximate Training

A common misunderstanding is that sampled softmax is just a faster implementation of the exact same loss. It is not. It is an approximation designed to reduce training cost when full softmax is too expensive.

That approximation quality depends on things such as:

  • number of sampled negatives
  • negative-sampling distribution
  • class-frequency structure in the data

If num_sampled is too small, training quality can suffer. If it is too large, the performance benefit shrinks.

Training-Time Trick Versus Evaluation-Time Need

Sampled softmax is mainly a training optimization. If you need exact probabilities or exact top-class comparisons across all classes at evaluation time, you still need the full output computation.

That is why sampled softmax does not make the large output space disappear. It reduces the cost of learning, not the cost of every possible downstream task.

When to Use Which Loss

Use full softmax when:

  • the number of classes is moderate
  • you want the exact objective
  • computing all logits is affordable

Use sampled softmax when:

  • the output space is very large
  • the output layer dominates training cost
  • an approximate objective is acceptable

As a rough rule, small or medium classification problems should usually stay with full softmax because it is simpler and exact.

Sparse Full Softmax Is Still Full Softmax

TensorFlow also offers sparse_softmax_cross_entropy_with_logits, which often gets confused with sampled softmax. It still computes the exact full-softmax objective; it simply accepts integer class IDs instead of one-hot labels.

python
1logits = tf.constant([[2.0, 0.5, -1.0]], dtype=tf.float32)
2labels = tf.constant([0], dtype=tf.int32)
3
4loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
5    labels=labels,
6    logits=logits
7)
8
9print(loss.numpy())

That is still full softmax, just with a different label format.

Common Pitfalls

The biggest mistake is assuming sampled softmax and full softmax are numerically equivalent. They are not; one is exact and the other is approximate.

Another common issue is trying to pass precomputed full logits into sampled_softmax_loss. That API expects output-layer parameters and hidden activations instead.

People also use sampled softmax on small class counts where the exact loss is simpler and perfectly affordable.

Finally, do not forget that exact evaluation over all classes may still require full output computation even if training used sampled negatives.

Summary

  • 'softmax_cross_entropy_with_logits computes the exact loss across all classes.'
  • 'sampled_softmax_loss approximates that objective with sampled negatives.'
  • Sampled softmax is mainly a training-time optimization for very large output spaces.
  • The two APIs differ because sampled softmax does not consume a full logit matrix.
  • Use full softmax for manageable class counts and sampled softmax when the output layer becomes the bottleneck.

Course illustration
Course illustration

All Rights Reserved.