Restoring saved TensorFlow model to evaluate on test set

TensorFlow

model evaluation

machine learning

test set

model restoration

Restoring saved TensorFlow model to evaluate on test set

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Evaluating on a test set is only useful if the restored model behaves exactly like the model that was trained. In TensorFlow that usually means loading the saved model, rebuilding the same input pipeline, and running evaluation without accidentally changing preprocessing or metric configuration.

Load the Saved Model Correctly

If the model was saved with Keras, the simplest path is tf.keras.models.load_model. A full saved model contains the architecture, weights, optimizer state, and compile configuration, so evaluation can often happen immediately after loading.

python

1import tensorflow as tf
2import numpy as np
3
4# Example test data
5x_test = np.random.rand(128, 10).astype("float32")
6y_test = np.random.randint(0, 2, size=(128, 1)).astype("float32")
7
8model = tf.keras.models.load_model("artifacts/binary_classifier")
9
10loss, accuracy = model.evaluate(x_test, y_test, batch_size=32, verbose=0)
11print(f"loss={loss:.4f}")
12print(f"accuracy={accuracy:.4f}")

This works well when the model was saved with model.save(...). If the model contains custom layers, losses, or metrics, pass them through custom_objects during loading so TensorFlow can rebuild the graph.

python

1import tensorflow as tf
2
3class F1Score(tf.keras.metrics.Metric):
4    def __init__(self, name="f1_score", **kwargs):
5        super().__init__(name=name, **kwargs)
6        self.tp = self.add_weight(name="tp", initializer="zeros")
7        self.fp = self.add_weight(name="fp", initializer="zeros")
8        self.fn = self.add_weight(name="fn", initializer="zeros")
9
10    def update_state(self, y_true, y_pred, sample_weight=None):
11        y_pred = tf.cast(y_pred > 0.5, tf.float32)
12        y_true = tf.cast(y_true, tf.float32)
13        self.tp.assign_add(tf.reduce_sum(y_true * y_pred))
14        self.fp.assign_add(tf.reduce_sum((1 - y_true) * y_pred))
15        self.fn.assign_add(tf.reduce_sum(y_true * (1 - y_pred)))
16
17    def result(self):
18        precision = self.tp / (self.tp + self.fp + 1e-7)
19        recall = self.tp / (self.tp + self.fn + 1e-7)
20        return 2 * precision * recall / (precision + recall + 1e-7)
21
22    def reset_state(self):
23        self.tp.assign(0.0)
24        self.fp.assign(0.0)
25        self.fn.assign(0.0)
26
27model = tf.keras.models.load_model(
28    "artifacts/model_with_custom_metric",
29    custom_objects={"F1Score": F1Score},
30)

Reuse the Same Test Input Pipeline

Loading the model is only half of the job. The test set must go through the same preprocessing used during training. If training normalized images to the range 0..1, tokenized text in a specific vocabulary, or one-hot encoded labels, the evaluation code must do the same.

Using tf.data keeps that logic explicit and reproducible:

python

1import tensorflow as tf
2
3def preprocess(features, label):
4    features = tf.cast(features, tf.float32) / 255.0
5    return features, label
6
7test_ds = (
8    tf.data.Dataset.from_tensor_slices((x_test, y_test))
9    .map(preprocess)
10    .batch(64)
11    .prefetch(tf.data.AUTOTUNE)
12)
13
14results = model.evaluate(test_ds, return_dict=True, verbose=0)
15print(results)

The return_dict=True option is useful because it ties each metric value to its metric name. That matters when a model tracks more than just loss and accuracy.

If You Only Saved Weights

Some projects save checkpoints with model.save_weights(...) instead of a full model. In that case TensorFlow cannot infer the architecture for you. You must rebuild the model in code, compile it with the same loss and metrics, and then restore the weights.

python

1import tensorflow as tf
2
3def build_model():
4    model = tf.keras.Sequential([
5        tf.keras.layers.Input(shape=(10,)),
6        tf.keras.layers.Dense(32, activation="relu"),
7        tf.keras.layers.Dense(1, activation="sigmoid"),
8    ])
9    model.compile(
10        optimizer="adam",
11        loss="binary_crossentropy",
12        metrics=["accuracy"],
13    )
14    return model
15
16model = build_model()
17model.load_weights("checkpoints/ckpt")
18
19print(model.evaluate(x_test, y_test, verbose=0))

This pattern is common in research code, but it is more fragile because the architecture and compile settings now live in two places.

Common Pitfalls

The most common failure is evaluating with different preprocessing than training. A model can look broken when the real problem is that the test features were not scaled, tokenized, padded, or ordered the same way.

Another frequent issue is loading weights into a model definition that has drifted. Renaming layers, changing input shape, or changing output units after training will cause load failures or, worse, silently wrong expectations about the restored model.

Custom objects are another trap. If the saved model used a custom layer, metric, or loss and you do not register it with custom_objects, loading will fail. Keep those definitions in importable modules instead of notebook cells.

Finally, be careful when interpreting metrics after load_weights. If you did not call compile, Keras has no loss or metrics configured for evaluation. Prediction can still work, but model.evaluate will not behave as expected until the model is compiled.

Summary

Prefer tf.keras.models.load_model when you saved the full Keras model.
Recreate the exact preprocessing pipeline before running test evaluation.
Use return_dict=True to make metric output easier to read and less error-prone.
If you only saved weights, rebuild and compile the model before calling evaluate.
Treat custom layers, metrics, and losses as part of the saved model contract.