nan values in loss in keras model

Keras

Deep Learning

Machine Learning

NaN `Loss`

Troubleshooting

nan values in loss in keras model

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When the loss in a Keras model becomes NaN, training has usually hit numerical instability rather than a mysterious framework bug. The right way to debug it is to narrow the problem down systematically: verify the data, verify the loss-target pairing, then check optimization settings and model outputs until you find the first place where numbers stop being finite.

Start with the Data

A model cannot recover from bad input values. The first step is to check whether the training tensors already contain NaN or infinite values.

python

1import numpy as np
2
3print(np.isnan(x_train).any())
4print(np.isinf(x_train).any())
5print(np.isnan(y_train).any())
6print(np.isinf(y_train).any())

Do the same after preprocessing, not just on the raw dataset. It is common for normalization, division, or log transforms to introduce invalid values.

For example, dividing by a standard deviation of zero or taking the logarithm of zero can create invalid numbers before the model ever sees the data.

Check the Loss and Output Layer Pairing

A very common source of NaN loss is pairing the wrong output layer with the wrong loss function.

Examples of sensible pairings:

binary classification: sigmoid plus binary cross-entropy
multi-class classification: softmax plus categorical cross-entropy
regression: linear output plus MSE or MAE

A clean binary example:

python

1import tensorflow as tf
2
3model = tf.keras.Sequential([
4    tf.keras.layers.Dense(32, activation="relu", input_shape=(10,)),
5    tf.keras.layers.Dense(1, activation="sigmoid")
6])
7
8model.compile(
9    optimizer="adam",
10    loss="binary_crossentropy",
11    metrics=["accuracy"]
12)

If the labels, output shape, or loss semantics do not line up, training can diverge quickly.

Learning Rate Is Often the Real Culprit

If the optimizer steps are too large, weights can explode and produce infinities or NaN during forward or backward passes.

Try lowering the learning rate first.

python

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)
model.compile(optimizer=optimizer, loss="binary_crossentropy")

This is one of the fastest debugging moves because an overly aggressive learning rate is both common and easy to test.

Watch for Exploding Activations and Gradients

Even with clean data, certain model architectures can produce extremely large activations or gradients.

A practical mitigation is gradient clipping:

python

1optimizer = tf.keras.optimizers.Adam(
2    learning_rate=1e-4,
3    clipnorm=1.0
4)
5model.compile(optimizer=optimizer, loss="binary_crossentropy")

Clipping does not solve every issue, but it can stabilize models that would otherwise blow up during backpropagation.

Batch normalization, careful initialization, and more conservative depth can also help when the network itself is numerically unstable.

Use a Smaller Reproducible Training Step

When the loss becomes NaN, reduce the problem to a tiny batch and inspect intermediate outputs.

python

1batch_x = x_train[:8]
2batch_y = y_train[:8]
3
4pred = model(batch_x, training=False)
5print(tf.reduce_min(pred).numpy(), tf.reduce_max(pred).numpy())
6print(tf.math.reduce_any(tf.math.is_nan(pred)).numpy())

If predictions are already invalid before training, the issue is in model construction or input scaling. If predictions are fine before training but invalid after one optimizer step, the issue is often optimization-related.

Add a Finite-Value Check During Training

A custom callback can stop training as soon as the loss goes invalid.

python

1import tensorflow as tf
2
3class StopOnNaN(tf.keras.callbacks.Callback):
4    def on_batch_end(self, batch, logs=None):
5        loss = logs.get("loss")
6        if loss is not None and not tf.math.is_finite(loss):
7            self.model.stop_training = True
8            print(f"Stopped on invalid loss at batch {batch}: {loss}")
9
10model.fit(x_train, y_train, callbacks=[StopOnNaN()])

This is useful because it prevents long training runs from hiding where the first invalid value appeared.

Mixed Precision and Custom Losses Need Extra Care

If you are using mixed precision or a custom loss function, inspect that code carefully. Custom math such as manual logarithms, divisions, or exponentials can create invalid values very easily.

Bad example:

python

def unsafe_loss(y_true, y_pred):
    return -tf.reduce_mean(y_true * tf.math.log(y_pred))

If y_pred contains zero, the logarithm is problematic. Safer built-in losses usually handle numerical stability much better than ad hoc formulas.

Common Pitfalls

The most common mistake is assuming the model architecture is the problem before checking whether the data already contains NaN or infinite values. Another is using an overly large learning rate and then debugging everything except the optimizer step size. Developers also often mismatch the output layer and loss function, which can make the training objective numerically unstable or simply incorrect. A final issue is writing custom losses with unsafe math instead of relying on numerically stable built-in implementations.

Summary

'NaN loss usually means numerical instability somewhere in the training pipeline.'
Check the data first for invalid values.
Verify that the output layer, label format, and loss function match the task.
Lower the learning rate and consider gradient clipping if optimization is unstable.
Use small reproducible batches and finite-value checks to find the first point of failure.