Cost and activation functions for multiple independent labels
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
In a multi-label problem with independent labels, each output is its own yes-or-no decision. An example is tagging an image with both beach and sunset, or classifying an email as both important and finance. Because the labels are not mutually exclusive, the final layer and loss function should treat each label independently rather than forcing the probabilities to compete.
Use Sigmoid for the Output Layer
For independent labels, the standard activation is sigmoid on each output unit.
Each sigmoid output is interpreted independently as the probability of one label being present. That is exactly what you want when multiple labels can be true at the same time.
Use Binary Crossentropy as the Loss
The matching loss is binary crossentropy, applied independently to each output dimension and then aggregated.
This is the standard answer for independent multi-label classification in Keras and similar frameworks.
If you prefer to output logits instead of probabilities, remove the sigmoid and set from_logits=True:
That is mathematically equivalent when used correctly.
Why Softmax Is Usually Wrong Here
Softmax assumes the outputs compete and sum to one. That makes sense for single-label multiclass tasks such as choosing exactly one digit from 0 through 9.
It does not fit independent labels because softmax pushes probability mass away from other outputs. If one label becomes more likely, the others must become less likely even if several should all be active together.
That is why this is usually wrong for independent multi-label outputs:
Use softmax only when exactly one class should be correct for each example.
Label Format Matters
Binary crossentropy expects each label position to be represented independently, usually with 0 or 1.
Example target:
That means the first example has labels 0 and 2, and the second example has labels 1, 3, and 4.
This is different from single-label classification, where targets are often one class index such as 3.
Thresholding Predictions
After training, sigmoid outputs are probabilities, not final label decisions. You still need a threshold to convert them into predicted labels.
0.5 is a common starting point, but it is not sacred. Many real systems tune thresholds per label based on validation precision and recall.
Handle Imbalanced Labels Deliberately
Multi-label problems often have rare labels. In that case, using the basic sigmoid plus binary crossentropy setup is still correct, but you may need:
- class weighting
- threshold tuning per label
- alternative metrics such as precision, recall, or F1
The output activation and loss stay the same. The evaluation strategy becomes more important because raw binary accuracy can be misleading when most labels are absent most of the time.
A Minimal Working Example
Here is a full small example:
This is the standard multi-label template when labels are independent.
Common Pitfalls
The biggest mistake is using softmax with categorical crossentropy for a task where multiple labels can be true at once. Another is feeding labels as a single class index instead of a multi-hot vector. Developers also forget that sigmoid outputs are probabilities and still need thresholding. In imbalanced datasets, relying only on binary accuracy can hide poor rare-label performance. The core setup stays simple, but evaluation and threshold selection often decide whether the model is actually useful.
Summary
- For multiple independent labels, use sigmoid outputs.
- Pair sigmoid with binary crossentropy, or use logits plus binary crossentropy with
from_logits=True. - Do not use softmax when several labels can be true simultaneously.
- Represent targets as multi-hot label vectors.
- Convert sigmoid probabilities to labels with a threshold.
- Tune metrics and thresholds carefully when labels are imbalanced.

