Implementing Binary Cross Entropy loss gives different answer than Tensorflow's

TensorFlow

binary cross-entropy

loss function

machine learning

deep learning

Implementing Binary Cross Entropy loss gives different answer than Tensorflow's

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

If your manual binary cross-entropy calculation differs from TensorFlow’s, the formula itself is usually not the issue. The mismatch almost always comes from one of a few implementation details: logits versus probabilities, reduction mode, label smoothing, shape broadcasting, or numerical stability near 0 and 1.

So the right comparison is not “my formula versus TensorFlow,” but “my formula under the exact same assumptions TensorFlow is using.” Once those assumptions match, the numbers usually line up.

Start With the Probability Formula

For probabilities p and binary labels y, the textbook binary cross-entropy is:

-(y * log(p) + (1 - y) * log(1 - p))

A simple manual implementation looks like this:

python

1import tensorflow as tf
2
3
4def manual_bce_probs(y_true, y_pred, eps=1e-7):
5    y_pred = tf.clip_by_value(y_pred, eps, 1.0 - eps)
6    return -(y_true * tf.math.log(y_pred) +
7             (1.0 - y_true) * tf.math.log(1.0 - y_pred))
8
9
10y_true = tf.constant([[1.0], [0.0], [1.0]])
11y_pred = tf.constant([[0.9], [0.2], [0.8]])
12
13manual = manual_bce_probs(y_true, y_pred)
14print(manual.numpy())
15print(tf.reduce_mean(manual).numpy())

That can match tf.keras.losses.BinaryCrossentropy(from_logits=False) if the reduction is the same.

Logits Versus Probabilities Is the Biggest Source of Error

TensorFlow can compute binary cross-entropy from either probabilities or logits. Those are not interchangeable.

If your model outputs raw logits, do not feed them into the probability formula above. TensorFlow uses a numerically stable logit-based form instead.

python

1import tensorflow as tf
2
3y_true = tf.constant([[1.0], [0.0], [1.0]])
4logits = tf.constant([[2.0], [-1.0], [1.5]])
5
6loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=y_true, logits=logits)
7print(loss.numpy())

The equivalent Keras loss is:

python

keras_loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
print(keras_loss(y_true, logits).numpy())

If you compare a logit-based TensorFlow loss to a manual probability-based formula, the answers should differ because they are solving different inputs.

Reduction Changes the Final Number

Another frequent mismatch is reduction. TensorFlow losses can produce:

one loss per element
a sum
an average over the batch

If your manual code computes elementwise BCE but TensorFlow returns a reduced scalar, the two outputs will not match even when the underlying per-example values do.

python

1loss_obj = tf.keras.losses.BinaryCrossentropy(
2    from_logits=False,
3    reduction=tf.keras.losses.Reduction.SUM_OVER_BATCH_SIZE,
4)
5
6print(loss_obj(y_true, y_pred).numpy())
7print(tf.reduce_mean(manual_bce_probs(y_true, y_pred)).numpy())

When debugging, compare elementwise to elementwise first, then compare reduced values.

Shape and Axis Semantics Matter

For binary classification, shape mismatches can silently change the calculation through broadcasting. For example, (batch,) and (batch, 1) often work, but not always in the way you intend when additional dimensions are present.

A good debugging habit is to print shapes explicitly.

python

print(y_true.shape)
print(y_pred.shape)

If you are doing multi-label classification, make sure you understand whether TensorFlow is averaging across the last axis and then across the batch, or whether your manual implementation is flattening everything first.

Other Details That Can Affect the Result

TensorFlow’s BinaryCrossentropy also supports options such as label smoothing. If that is enabled, the effective targets differ from the raw 0 and 1 values you may be using manually.

Likewise, clipping matters. A naive implementation without clipping can hit log(0) and produce infinities, while TensorFlow uses numerically stable internals to avoid that failure mode.

Common Pitfalls

A common mistake is forgetting whether the model output is a sigmoid probability or a raw logit. That single mismatch explains a large fraction of BCE discrepancies.

Another issue is comparing a scalar TensorFlow loss to an unreduced manual tensor and concluding that the formula is wrong.

Developers also often ignore shape broadcasting. A calculation that runs without an exception is not automatically the calculation you intended.

Finally, do not omit clipping in manual probability-based code if you want a fair comparison near the boundaries.

Summary

Binary cross-entropy only matches when both implementations use the same assumptions.
The most important distinction is from_logits=True versus from_logits=False.
Compare unreduced values first, then compare reduced scalars.
Watch shape broadcasting and optional features such as label smoothing.
Use clipping or a stable logit-based formula to avoid numerical edge-case mismatches.