Keras
model saving
checkpoints
deep learning
machine learning

Save Keras model at specific epochs

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Keras already provides checkpoint saving during training through callbacks. If you want to save at specific epochs, the cleanest options are either ModelCheckpoint for every epoch or a small custom callback when you need a custom schedule such as every fifth epoch.

Save Every Epoch with ModelCheckpoint

The built-in ModelCheckpoint callback can save the model after each epoch:

python
1import tensorflow as tf
2
3checkpoint = tf.keras.callbacks.ModelCheckpoint(
4    filepath="checkpoints/model_epoch_{epoch:02d}.keras",
5    save_weights_only=False,
6    save_freq="epoch",
7)
8
9model.fit(
10    x_train,
11    y_train,
12    epochs=10,
13    callbacks=[checkpoint],
14)

The {epoch:02d} part puts the epoch number in the filename, so each epoch produces a separate saved model.

Save Only the Best Epoch

If you do not want every checkpoint, only the best one by a validation metric:

python
1checkpoint = tf.keras.callbacks.ModelCheckpoint(
2    filepath="best_model.keras",
3    monitor="val_loss",
4    save_best_only=True,
5    mode="min",
6)

That is not "specific epochs" in the fixed-number sense, but it is often what people really want when they say they need checkpointing during training.

Save Every N Epochs with a Custom Callback

If the requirement is something like "save at epochs 5, 10, 15," write a tiny callback:

python
1import tensorflow as tf
2
3
4class SaveEveryNEpochs(tf.keras.callbacks.Callback):
5    def __init__(self, interval, prefix="checkpoint"):
6        super().__init__()
7        self.interval = interval
8        self.prefix = prefix
9
10    def on_epoch_end(self, epoch, logs=None):
11        current_epoch = epoch + 1
12        if current_epoch % self.interval == 0:
13            path = f"{self.prefix}_epoch_{current_epoch:02d}.keras"
14            self.model.save(path)
15            print(f"Saved model to {path}")
16
17
18callback = SaveEveryNEpochs(interval=5, prefix="checkpoints/model")
19
20model.fit(
21    x_train,
22    y_train,
23    epochs=20,
24    callbacks=[callback],
25)

This gives you exact control over the save schedule.

Save Weights Only When Full Models Are Too Heavy

Sometimes you only want weights:

python
1checkpoint = tf.keras.callbacks.ModelCheckpoint(
2    filepath="weights_epoch_{epoch:02d}.weights.h5",
3    save_weights_only=True,
4    save_freq="epoch",
5)

Weights-only checkpoints are smaller and often sufficient if the model architecture is recreated separately in code.

File Naming Matters

A good filename pattern saves time later. Include at least:

  • epoch number
  • optionally validation metric
  • maybe experiment name or timestamp

For example:

python
filepath="runs/exp1_epoch_{epoch:02d}_valloss_{val_loss:.4f}.keras"

That makes it much easier to see what each file represents without reopening logs.

Watch Disk Usage

Frequent full-model saving can consume disk space quickly, especially with large neural networks. If you save every epoch during a long run, plan for cleanup or keep only the checkpoints that matter.

That is one reason save_best_only=True is so attractive in many workflows.

Saving Helps with Training Resumption Too

Checkpoint files are not only for later evaluation. They also let you resume interrupted training. If you save full models or matching weights at planned epochs, a long run lost to a crash does not necessarily force you back to epoch one.

That is often the practical reason teams choose a checkpoint interval such as every five epochs even when they also keep a separate best-model checkpoint.

Common Pitfalls

The biggest pitfall is expecting ModelCheckpoint to save every fifth epoch automatically by epoch number. The built-in callback handles every epoch cleanly, but custom schedules are often easier with a custom callback.

Another issue is mixing model and weights file conventions. If you save weights only, reload them with the matching architecture instead of expecting a full standalone model file.

People also forget about disk usage. Saving a full model at every epoch can create many large files very quickly.

Summary

  • Use ModelCheckpoint(save_freq="epoch") to save after every epoch.
  • Use save_best_only=True when you only care about the best validation checkpoint.
  • Write a small custom callback when you need a schedule such as every N epochs.
  • Save weights only if full model files are larger than you need.
  • Use clear filenames so checkpoints are easy to identify later.

Course illustration
Course illustration

All Rights Reserved.