Keras
deep learning
loss function
multi-class classification
machine learning

how is total loss calculated over multiple classes in Keras?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

In a multi-class Keras model, loss is not calculated by adding one separate loss per class output node in an ad hoc way. Instead, Keras computes a per-sample classification loss from the full predicted class distribution, then reduces those sample losses across the batch, and finally adds any extra regularization or multi-output loss terms if the model defines them.

Per-Sample Cross-Entropy for Multi-Class Classification

If you use categorical_crossentropy, each training example has:

  • a one-hot target vector
  • a predicted probability distribution over classes

The per-sample loss is:

-sum(y_true_i * log(y_pred_i))

Because the target is one-hot, this usually reduces to the negative log probability of the correct class.

Example:

python
1import tensorflow as tf
2
3y_true = tf.constant([[0.0, 1.0, 0.0]])
4y_pred = tf.constant([[0.1, 0.7, 0.2]])
5
6loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
7print(loss.numpy())

In that example, the model assigned probability 0.7 to the correct class, so the loss is -log(0.7).

What Happens Over Multiple Classes

The important detail is that the loss is already computed across all classes for each sample. Keras does not compute one independent classification loss per class and then average those blindly. The softmax output and the cross-entropy formula work together as one multi-class objective.

That is why the correct class distribution matters. The predicted probabilities for all classes influence the loss because they share the same normalized output.

If you use integer class labels instead of one-hot labels, sparse_categorical_crossentropy computes the same idea with a different target representation:

python
1y_true = tf.constant([1])
2y_pred = tf.constant([[0.1, 0.7, 0.2]])
3
4loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)
5print(loss.numpy())

Batch Reduction

Once Keras has a loss value for each sample in the batch, it reduces them, typically by averaging:

python
1import tensorflow as tf
2
3y_true = tf.constant([
4    [1.0, 0.0, 0.0],
5    [0.0, 1.0, 0.0],
6])
7
8y_pred = tf.constant([
9    [0.8, 0.1, 0.1],
10    [0.2, 0.6, 0.2],
11])
12
13per_sample = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
14print("per-sample:", per_sample.numpy())
15print("batch mean:", tf.reduce_mean(per_sample).numpy())

This reduced value is what you typically see reported as batch loss during training.

Over an epoch, Keras aggregates batch losses into the epoch-level loss you see in training logs.

Extra Terms Can Increase the Total Loss

The reported total loss may include more than the classification loss alone. Two common additions are:

  • regularization losses from layers such as kernel_regularizer
  • multiple output losses in multi-head models

For example:

python
1from tensorflow import keras
2
3model = keras.Sequential([
4    keras.layers.Dense(
5        3,
6        activation="softmax",
7        kernel_regularizer=keras.regularizers.l2(1e-3),
8        input_shape=(4,),
9    )
10])
11
12model.compile(
13    optimizer="adam",
14    loss="categorical_crossentropy",
15)

Here the total loss reported by Keras includes the data loss plus the regularization term.

If a model has multiple outputs, Keras computes each output loss and combines them, optionally with loss weights.

Common Pitfalls

The biggest mistake is thinking "total loss over multiple classes" means one separate binary loss per class for a standard softmax classifier. For ordinary multi-class classification, the loss is one cross-entropy over the full class distribution per sample.

Another common issue is mixing up categorical_crossentropy and sparse_categorical_crossentropy. The first expects one-hot labels, and the second expects integer labels.

It is also easy to forget that regularization losses and multi-output loss terms can make the reported total loss larger than the raw classification cross-entropy alone.

Summary

  • In Keras multi-class classification, each sample gets one cross-entropy loss computed from the full class distribution.
  • Keras then reduces those sample losses across the batch, usually by averaging.
  • 'categorical_crossentropy uses one-hot labels; sparse_categorical_crossentropy uses integer labels.'
  • Reported total loss may also include regularization losses or additional output losses.
  • The "multiple classes" part is already handled inside the cross-entropy formula, not by separate ad hoc class-by-class loss calculations.

Course illustration
Course illustration

All Rights Reserved.