Cost and activation functions for multiple independent labels

cost functions

activation functions

multiple labels

machine learning

independent labels

Cost and activation functions for multiple independent labels

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

In a multi-label problem with independent labels, each output is its own yes-or-no decision. An example is tagging an image with both beach and sunset, or classifying an email as both important and finance. Because the labels are not mutually exclusive, the final layer and loss function should treat each label independently rather than forcing the probabilities to compete.

Use Sigmoid for the Output Layer

For independent labels, the standard activation is sigmoid on each output unit.

python

1import tensorflow as tf
2
3num_labels = 5
4
5model = tf.keras.Sequential([
6    tf.keras.layers.Input(shape=(20,)),
7    tf.keras.layers.Dense(64, activation="relu"),
8    tf.keras.layers.Dense(num_labels, activation="sigmoid")
9])

Each sigmoid output is interpreted independently as the probability of one label being present. That is exactly what you want when multiple labels can be true at the same time.

Use Binary Crossentropy as the Loss

The matching loss is binary crossentropy, applied independently to each output dimension and then aggregated.

python

1model.compile(
2    optimizer="adam",
3    loss="binary_crossentropy",
4    metrics=["binary_accuracy"]
5)

This is the standard answer for independent multi-label classification in Keras and similar frameworks.

If you prefer to output logits instead of probabilities, remove the sigmoid and set from_logits=True:

python

1model = tf.keras.Sequential([
2    tf.keras.layers.Input(shape=(20,)),
3    tf.keras.layers.Dense(64, activation="relu"),
4    tf.keras.layers.Dense(num_labels)
5])
6
7loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=True)
8model.compile(optimizer="adam", loss=loss_fn)

That is mathematically equivalent when used correctly.

Why Softmax Is Usually Wrong Here

Softmax assumes the outputs compete and sum to one. That makes sense for single-label multiclass tasks such as choosing exactly one digit from 0 through 9.

It does not fit independent labels because softmax pushes probability mass away from other outputs. If one label becomes more likely, the others must become less likely even if several should all be active together.

That is why this is usually wrong for independent multi-label outputs:

python

tf.keras.layers.Dense(num_labels, activation="softmax")

Use softmax only when exactly one class should be correct for each example.

Label Format Matters

Binary crossentropy expects each label position to be represented independently, usually with 0 or 1.

Example target:

python

1y = [
2    [1, 0, 1, 0, 0],
3    [0, 1, 0, 1, 1],
4]

That means the first example has labels 0 and 2, and the second example has labels 1, 3, and 4.

This is different from single-label classification, where targets are often one class index such as 3.

Thresholding Predictions

After training, sigmoid outputs are probabilities, not final label decisions. You still need a threshold to convert them into predicted labels.

python

1import numpy as np
2
3probs = np.array([[0.91, 0.12, 0.74, 0.18, 0.03]])
4pred = (probs >= 0.5).astype("int32")
5print(pred)

0.5 is a common starting point, but it is not sacred. Many real systems tune thresholds per label based on validation precision and recall.

Handle Imbalanced Labels Deliberately

Multi-label problems often have rare labels. In that case, using the basic sigmoid plus binary crossentropy setup is still correct, but you may need:

class weighting
threshold tuning per label
alternative metrics such as precision, recall, or F1

The output activation and loss stay the same. The evaluation strategy becomes more important because raw binary accuracy can be misleading when most labels are absent most of the time.

A Minimal Working Example

Here is a full small example:

python

1import numpy as np
2import tensorflow as tf
3
4x = np.random.rand(100, 20).astype("float32")
5y = np.random.randint(0, 2, size=(100, 5)).astype("float32")
6
7model = tf.keras.Sequential([
8    tf.keras.layers.Input(shape=(20,)),
9    tf.keras.layers.Dense(32, activation="relu"),
10    tf.keras.layers.Dense(5, activation="sigmoid")
11])
12
13model.compile(optimizer="adam", loss="binary_crossentropy")
14model.fit(x, y, epochs=2, batch_size=16, verbose=0)

This is the standard multi-label template when labels are independent.

Common Pitfalls

The biggest mistake is using softmax with categorical crossentropy for a task where multiple labels can be true at once. Another is feeding labels as a single class index instead of a multi-hot vector. Developers also forget that sigmoid outputs are probabilities and still need thresholding. In imbalanced datasets, relying only on binary accuracy can hide poor rare-label performance. The core setup stays simple, but evaluation and threshold selection often decide whether the model is actually useful.

Summary

For multiple independent labels, use sigmoid outputs.
Pair sigmoid with binary crossentropy, or use logits plus binary crossentropy with from_logits=True.
Do not use softmax when several labels can be true simultaneously.
Represent targets as multi-hot label vectors.
Convert sigmoid probabilities to labels with a threshold.
Tune metrics and thresholds carefully when labels are imbalanced.