Keras
categorical crossentropy
point-wise loss
deep learning
tutorial

How to do point-wise categorical crossentropy loss in Keras?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Point-wise categorical crossentropy means you want one loss value per example instead of a single reduced scalar for the whole batch. That is useful when you need custom weighting, masking, debugging, or a custom training loop. In Keras, the core trick is simple: keep the loss reduction set to NONE, then reduce it yourself only after you have applied whatever per-example logic you need.

Getting Per-Sample Crossentropy

For one-hot labels, use tf.keras.losses.CategoricalCrossentropy. By default, Keras reduces the batch to one scalar. To keep each sample loss, set reduction=tf.keras.losses.Reduction.NONE.

python
1import tensorflow as tf
2
3y_true = tf.constant([
4    [1.0, 0.0, 0.0],
5    [0.0, 1.0, 0.0],
6    [0.0, 0.0, 1.0],
7], dtype=tf.float32)
8
9y_pred = tf.constant([
10    [0.80, 0.10, 0.10],
11    [0.20, 0.70, 0.10],
12    [0.10, 0.25, 0.65],
13], dtype=tf.float32)
14
15loss_fn = tf.keras.losses.CategoricalCrossentropy(
16    from_logits=False,
17    reduction=tf.keras.losses.Reduction.NONE,
18)
19
20point_losses = loss_fn(y_true, y_pred)
21print(point_losses.numpy())

The returned tensor has shape (batch_size,). Each element is the categorical crossentropy for one row of the batch. That is the point-wise result most people are looking for.

Logits Versus Probabilities

Another common source of confusion is the from_logits flag. If your model output already passed through softmax, use from_logits=False. If your final layer produces raw scores and you intentionally skipped softmax, use from_logits=True.

python
1logits = tf.constant([
2    [3.0, 1.0, 0.5],
3    [0.2, 2.5, 0.1],
4], dtype=tf.float32)
5
6y_true_logits = tf.constant([
7    [1.0, 0.0, 0.0],
8    [0.0, 1.0, 0.0],
9], dtype=tf.float32)
10
11loss_fn_logits = tf.keras.losses.CategoricalCrossentropy(
12    from_logits=True,
13    reduction=tf.keras.losses.Reduction.NONE,
14)
15
16print(loss_fn_logits(y_true_logits, logits).numpy())

Do not apply softmax yourself and also set from_logits=True. That double-handles the outputs and produces the wrong numbers.

Using Point-Wise Losses for Custom Weighting

Once you have a vector of per-sample losses, you can apply sample weights or custom masks before computing a final scalar for backpropagation. This is a common pattern when some examples should count more than others.

python
1weights = tf.constant([1.0, 2.0, 0.5], dtype=tf.float32)
2weighted_losses = point_losses * weights
3weighted_mean = tf.reduce_sum(weighted_losses) / tf.reduce_sum(weights)
4
5print("weighted mean:", float(weighted_mean))

This manual reduction is especially useful when weights depend on runtime logic rather than a static dataset column.

Sequence Models and Token-Level Loss

For sequence tasks, "point-wise" often really means one loss per token or one loss per time step. In that case, you usually want the loss function to operate on the last axis and keep the batch and sequence dimensions intact.

Here is a token-level example with a mask:

python
1token_true = tf.constant([
2    [[1.0, 0.0], [0.0, 1.0], [1.0, 0.0]],
3    [[0.0, 1.0], [1.0, 0.0], [0.0, 1.0]],
4], dtype=tf.float32)
5
6token_pred = tf.constant([
7    [[0.9, 0.1], [0.2, 0.8], [0.7, 0.3]],
8    [[0.1, 0.9], [0.8, 0.2], [0.3, 0.7]],
9], dtype=tf.float32)
10
11mask = tf.constant([
12    [1.0, 1.0, 0.0],
13    [1.0, 1.0, 1.0],
14], dtype=tf.float32)
15
16token_losses = tf.keras.losses.categorical_crossentropy(token_true, token_pred)
17masked_mean = tf.reduce_sum(token_losses * mask) / tf.reduce_sum(mask)
18
19print("token losses shape:", token_losses.shape)
20print("masked mean:", float(masked_mean))

The functional helper tf.keras.losses.categorical_crossentropy is handy here because it naturally returns unreduced values across higher-rank inputs.

Integrating With model.fit

If you only need simple per-example weighting, Keras already supports sample_weight in model.fit. You do not need a custom training loop for that. But if you need dynamic weighting, custom masking, or extra logging of the unreduced losses, a custom train_step is often cleaner.

python
1class WeightedModel(tf.keras.Model):
2    def train_step(self, data):
3        x, y, sample_weight = data
4
5        with tf.GradientTape() as tape:
6            y_pred = self(x, training=True)
7            losses = tf.keras.losses.categorical_crossentropy(y, y_pred)
8            loss = tf.reduce_sum(losses * sample_weight) / tf.reduce_sum(sample_weight)
9
10        grads = tape.gradient(loss, self.trainable_variables)
11        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))
12        return {"loss": loss}

The important pattern is unchanged: produce point-wise losses first, then reduce them exactly the way your problem requires.

Common Pitfalls

The first pitfall is mixing label formats. CategoricalCrossentropy expects one-hot encoded labels. If your labels are integer class ids such as 0, 1, and 2, use SparseCategoricalCrossentropy instead.

Another pitfall is misusing from_logits. Probabilities require from_logits=False, while raw output scores require from_logits=True.

People also expect reduction=NONE to preserve every possible dimension. For a standard classification batch, you get one value per example, not one value per class. If you need token-level results for sequences, use higher-rank inputs or the functional loss helper.

Finally, remember that gradient descent still needs a scalar objective. Point-wise losses are for inspection and custom weighting, but before backpropagation you still reduce them to one scalar.

Summary

  • Use reduction=tf.keras.losses.Reduction.NONE to keep one categorical crossentropy value per sample.
  • Match from_logits to the actual form of your model outputs.
  • Apply custom weights or masks after computing the unreduced loss vector.
  • For sequence work, use token-level unreduced losses and reduce them with a mask.
  • Always end training with a scalar reduced loss, even if you start from point-wise values.

Course illustration
Course illustration

All Rights Reserved.