Categorical focal loss on keras
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Categorical focal loss is useful when a multi-class model sees many easy examples and only a small number of difficult or minority-class examples. Instead of letting confident predictions dominate the optimization signal, focal loss reduces the weight of those easy cases so the model spends more effort on mistakes that still matter.
Why Plain Cross-Entropy Can Struggle
Standard categorical cross-entropy treats each example according to its predicted probability for the true class. On imbalanced datasets, the majority class can overwhelm training simply because it appears so often. The model gets rewarded over and over for solving the same easy cases.
Focal loss adds a focusing term. When the model already predicts the correct class confidently, the loss contribution shrinks. When the prediction is poor, the example keeps a strong gradient signal. Two parameters usually control the behavior:
- '
gammacontrols how aggressively easy examples are down-weighted' - '
alphaadds class weighting'
If gamma is 0, focal loss reduces to a weighted form of cross-entropy. As gamma increases, the loss focuses more strongly on hard examples.
A Keras Implementation
For multi-class classification with one-hot targets and softmax predictions, a custom Keras loss can be written directly with TensorFlow:
This implementation assumes the final model layer already applies softmax. If you want to work from logits, modify the implementation accordingly rather than mixing probability-based code with raw logits.
Training a Multi-Class Model with Focal Loss
Here is a small runnable example:
This is appropriate for single-label multi-class classification. Each sample belongs to exactly one class, and the targets are one-hot encoded.
Choosing alpha and gamma
Do not treat the common defaults as universal truths. gamma=2.0 is a popular starting point, not a law. If you raise it too much, the model may stop learning from easier examples that still provide useful signal. If you keep it too low, the loss behaves almost like ordinary cross-entropy.
alpha can be a scalar, but in heavily imbalanced problems it is often more useful as a per-class weight vector. That lets rare classes receive additional emphasis even before the focusing term is applied.
Focal loss is not the only answer to imbalance. Better sampling, cleaner labels, threshold tuning, and class-weighted baselines should still be part of the evaluation process. A complicated loss cannot rescue a poor dataset.
When You Should Not Use It
Focal loss is a poor default for every classification task. If your dataset is reasonably balanced and cross-entropy already performs well, focal loss adds complexity without clear benefit. It can also overemphasize mislabeled examples because noisy examples look hard by definition.
For multi-label classification, categorical focal loss is usually the wrong formulation. Multi-label problems normally need an independent binary loss per class rather than a softmax over mutually exclusive classes.
Common Pitfalls
- Using categorical focal loss with integer labels instead of one-hot encoded targets.
- Passing logits into a loss implementation that expects softmax probabilities.
- Applying the categorical version to a multi-label problem.
- Copying
gammaandalphafrom a tutorial without checking validation behavior. - Assuming focal loss replaces the need for class weighting, resampling, or label cleanup.
Summary
- Categorical focal loss is designed for imbalanced multi-class classification.
- It down-weights easy examples so harder cases contribute more to learning.
- In Keras, use it with one-hot labels and softmax probabilities.
- Tune
gammaandalphabased on validation results, not folklore. - Use it when class imbalance is a real problem, not as a universal replacement for cross-entropy.

