Keras not training on entire dataset
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
If Keras appears not to train on the full dataset, the problem is usually not that model.fit() randomly ignores samples. The usual causes are configuration choices around steps_per_epoch, generators, repeated datasets, partial batches, or input pipelines that do not actually expose every example.
Start with How the Data Enters fit()
Keras behaves differently depending on the input type:
- NumPy arrays
- Python generators
- '
tf.data.Dataset' - sequence objects such as
keras.utils.Sequence
With plain NumPy arrays, fit() typically knows the dataset length and will iterate through all samples each epoch. Problems are more common when you use generators or datasets, because then Keras depends on the pipeline configuration to know how much data constitutes one epoch.
steps_per_epoch Is the First Thing to Check
If you pass a generator or dataset and set steps_per_epoch too small, training stops the epoch early even though more samples exist.
That means exactly 100 batches are consumed per epoch. If the dataset contains 120 batches, the last 20 are not seen in that epoch.
The fix is to set steps_per_epoch correctly or let Keras infer it when possible.
tf.data.Dataset Can Repeat Forever
Another common issue is a dataset pipeline that includes .repeat() without careful epoch boundaries:
This dataset is infinite. Keras now needs steps_per_epoch because there is no natural end. If the number is wrong, your notion of "one epoch over the full dataset" no longer matches what the training loop is doing.
A safer finite pipeline looks like this:
If you do need .repeat(), calculate the step count deliberately.
Partial Batches and drop_remainder
Some pipelines discard the last incomplete batch. In tf.data, that can happen if drop_remainder=True is enabled:
If the dataset size is not divisible by 32, the remaining examples are dropped each epoch. That is not necessarily wrong, but it does mean the entire dataset is not being used.
When you want every sample, keep drop_remainder=False, which is the default.
Generators Must Report Length Correctly
If you use a custom generator or Sequence, make sure its length matches the real number of batches.
If __len__() underreports the number of batches, Keras never asks for the rest.
Validation Splits and Shuffling Can Confuse Diagnosis
Sometimes the model is training on the full training set, but part of the original data has been reserved for validation:
Now only 80 percent of the data is used for training. That is expected behavior, but it can look like missing samples if you forgot the split exists.
Shuffling can also make it harder to notice which samples were seen, even though the real issue is elsewhere.
Common Pitfalls
- Setting
steps_per_epochtoo small is one of the most common reasons not all batches are used. - Using
.repeat()without understanding that the dataset becomes infinite makes epoch semantics easy to misread. - Enabling
drop_remainder=Truediscards the final partial batch every epoch. - Returning the wrong value from
__len__()in a custom sequence causes Keras to stop early. - Forgetting about
validation_splitcan make it seem as though part of the dataset vanished from training.
Summary
- Keras usually trains on the full dataset unless the input pipeline tells it otherwise.
- Check
steps_per_epochfirst when using generators ortf.data.Dataset. - Be careful with
.repeat()and with partial-batch dropping. - Custom generators must report their length accurately.
- If training still looks incomplete, inspect the data pipeline rather than assuming
model.fit()is skipping samples on its own.

