Keras
total loss
multiple classes
loss calculation
machine learning

how is total loss calculated over multiple classes in Keras?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

In multi-class classification, Keras does not compute one separate training objective per class and then treat them as unrelated numbers. It computes a per-sample classification loss from the model’s predicted distribution, reduces those sample losses across the batch, and then adds any extra regularization or weighted output losses that apply.

What Happens for One Sample

For a softmax classifier with one-hot labels, the usual loss is categorical cross-entropy. Keras compares the true target distribution to the predicted class probabilities and measures how much probability the model assigned to the correct class.

If the true class is represented by a one-hot vector, only the correct class contributes to the loss for that sample. In practice, that means the loss becomes the negative log of the predicted probability assigned to the correct class.

python
1import tensorflow as tf
2
3y_true = tf.constant([[0.0, 1.0, 0.0]])
4y_pred = tf.constant([[0.1, 0.7, 0.2]])
5
6loss_fn = tf.keras.losses.CategoricalCrossentropy()
7loss = loss_fn(y_true, y_pred)
8print(float(loss))

In this example, the model predicted class two with probability 0.7, so the loss is -log(0.7) after the usual reduction rules.

From Per-Class Scores to Batch Loss

The model outputs one score or probability per class, but Keras reduces that vector into one loss value per sample. Then it reduces those per-sample losses across the batch, usually by taking the mean.

You can inspect the unreduced values by setting the reduction mode to none.

python
1import tensorflow as tf
2
3y_true = tf.constant([
4    [1.0, 0.0, 0.0],
5    [0.0, 1.0, 0.0],
6])
7
8y_pred = tf.constant([
9    [0.8, 0.1, 0.1],
10    [0.2, 0.5, 0.3],
11])
12
13loss_fn = tf.keras.losses.CategoricalCrossentropy(reduction="none")
14per_sample_loss = loss_fn(y_true, y_pred)
15print(per_sample_loss.numpy())
16print(tf.reduce_mean(per_sample_loss).numpy())

That is usually the easiest way to understand how Keras gets from class probabilities to the scalar loss displayed during training.

Sparse Labels, Logits, and Other Variants

If your labels are integer class IDs instead of one-hot vectors, use SparseCategoricalCrossentropy. If the model outputs raw logits instead of probabilities, pass from_logits=True so Keras applies the right numerical transformation internally.

python
1model.compile(
2    optimizer="adam",
3    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
4    metrics=["accuracy"],
5)

This does not change the conceptual process. Keras still computes one classification loss per sample and then reduces it.

Class Weights and Sample Weights

When some classes matter more than others, Keras can weight examples before the final batch reduction. That changes the contribution of each training example to the total loss without changing the model architecture.

python
1model.fit(
2    x_train,
3    y_train,
4    class_weight={0: 1.0, 1: 2.5, 2: 1.5},
5    epochs=5,
6)

With class weights, misclassifying a sample from a heavier class increases the batch loss more than misclassifying a sample from a lighter class.

Regularization and Multi-Output Models

If your layers include regularizers, those penalties are added to the data loss. If your model has multiple outputs, Keras computes one loss per output and combines them, optionally using loss_weights.

That is an important distinction. Multiple classes inside one softmax output are not the same as multiple outputs. A single multi-class head still produces one classification loss per sample.

Common Pitfalls

  • Mixing one-hot labels with SparseCategoricalCrossentropy gives incorrect results because the loss expects integer class IDs.
  • Forgetting from_logits=True when the model outputs raw logits changes the numerical meaning of the loss.
  • Assuming Keras sums losses over classes independently can lead to wrong manual calculations.
  • Ignoring class weights or sample weights makes it harder to explain why the displayed loss changed.
  • Confusing multi-class classification with multi-output models causes many debugging mistakes.

Summary

  • For one multi-class output, Keras computes one classification loss per sample from the predicted class distribution.
  • Those per-sample losses are then reduced across the batch, usually by averaging.
  • One-hot labels and sparse integer labels use different cross-entropy variants.
  • Class weights, sample weights, and regularization can change the final scalar loss.
  • Multiple classes in one output are different from multiple outputs with separate losses.

Course illustration
Course illustration

All Rights Reserved.