Categorical focal loss on keras

keras

focal loss

machine learning

deep learning

neural networks

Categorical focal loss on keras

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Categorical focal loss is useful when a multi-class model sees many easy examples and only a small number of difficult or minority-class examples. Instead of letting confident predictions dominate the optimization signal, focal loss reduces the weight of those easy cases so the model spends more effort on mistakes that still matter.

Why Plain Cross-Entropy Can Struggle

Standard categorical cross-entropy treats each example according to its predicted probability for the true class. On imbalanced datasets, the majority class can overwhelm training simply because it appears so often. The model gets rewarded over and over for solving the same easy cases.

Focal loss adds a focusing term. When the model already predicts the correct class confidently, the loss contribution shrinks. When the prediction is poor, the example keeps a strong gradient signal. Two parameters usually control the behavior:

'gamma controls how aggressively easy examples are down-weighted'
'alpha adds class weighting'

If gamma is 0, focal loss reduces to a weighted form of cross-entropy. As gamma increases, the loss focuses more strongly on hard examples.

A Keras Implementation

For multi-class classification with one-hot targets and softmax predictions, a custom Keras loss can be written directly with TensorFlow:

python

1import tensorflow as tf
2
3def categorical_focal_loss(alpha=0.25, gamma=2.0):
4    alpha = tf.constant(alpha, dtype=tf.float32)
5
6    def loss(y_true, y_pred):
7        y_pred = tf.clip_by_value(y_pred, 1e-7, 1.0 - 1e-7)
8        cross_entropy = -y_true * tf.math.log(y_pred)
9        focal_weight = alpha * tf.pow(1.0 - y_pred, gamma)
10        return tf.reduce_sum(focal_weight * cross_entropy, axis=-1)
11
12    return loss

This implementation assumes the final model layer already applies softmax. If you want to work from logits, modify the implementation accordingly rather than mixing probability-based code with raw logits.

Training a Multi-Class Model with Focal Loss

Here is a small runnable example:

python

1import numpy as np
2import tensorflow as tf
3
4rng = np.random.default_rng(42)
5
6x = rng.normal(size=(300, 10)).astype("float32")
7y_index = rng.choice([0, 1, 2], size=300, p=[0.8, 0.15, 0.05])
8y = tf.keras.utils.to_categorical(y_index, num_classes=3)
9
10model = tf.keras.Sequential(
11    [
12        tf.keras.layers.Input(shape=(10,)),
13        tf.keras.layers.Dense(32, activation="relu"),
14        tf.keras.layers.Dense(3, activation="softmax"),
15    ]
16)
17
18model.compile(
19    optimizer="adam",
20    loss=categorical_focal_loss(alpha=0.25, gamma=2.0),
21    metrics=["accuracy"],
22)
23
24model.fit(x, y, epochs=5, batch_size=16, verbose=2)

This is appropriate for single-label multi-class classification. Each sample belongs to exactly one class, and the targets are one-hot encoded.

Choosing `alpha` and `gamma`

Do not treat the common defaults as universal truths. gamma=2.0 is a popular starting point, not a law. If you raise it too much, the model may stop learning from easier examples that still provide useful signal. If you keep it too low, the loss behaves almost like ordinary cross-entropy.

alpha can be a scalar, but in heavily imbalanced problems it is often more useful as a per-class weight vector. That lets rare classes receive additional emphasis even before the focusing term is applied.

Focal loss is not the only answer to imbalance. Better sampling, cleaner labels, threshold tuning, and class-weighted baselines should still be part of the evaluation process. A complicated loss cannot rescue a poor dataset.

When You Should Not Use It

Focal loss is a poor default for every classification task. If your dataset is reasonably balanced and cross-entropy already performs well, focal loss adds complexity without clear benefit. It can also overemphasize mislabeled examples because noisy examples look hard by definition.

For multi-label classification, categorical focal loss is usually the wrong formulation. Multi-label problems normally need an independent binary loss per class rather than a softmax over mutually exclusive classes.

Common Pitfalls

Using categorical focal loss with integer labels instead of one-hot encoded targets.
Passing logits into a loss implementation that expects softmax probabilities.
Applying the categorical version to a multi-label problem.
Copying gamma and alpha from a tutorial without checking validation behavior.
Assuming focal loss replaces the need for class weighting, resampling, or label cleanup.

Summary

Categorical focal loss is designed for imbalanced multi-class classification.
It down-weights easy examples so harder cases contribute more to learning.
In Keras, use it with one-hot labels and softmax probabilities.
Tune gamma and alpha based on validation results, not folklore.
Use it when class imbalance is a real problem, not as a universal replacement for cross-entropy.

Categorical focal loss on keras

Master System Design with Codemia

Introduction

Why Plain Cross-Entropy Can Struggle

A Keras Implementation

Training a Multi-Class Model with Focal Loss

Choosing alpha and gamma

When You Should Not Use It

Common Pitfalls

Summary

Choosing `alpha` and `gamma`