Video classification using many to many LSTM in TensorFlow
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
An LSTM can model a video as a sequence of frame features over time. The important design choice is whether you want one label for the whole video or one label per frame or timestep. A many-to-many LSTM is the right fit when the output is also a sequence, while whole-video classification is often many-to-one instead.
Many-to-One Versus Many-to-Many
Suppose each video is represented as a sequence of frame embeddings:
For example, a 20-frame clip where each frame has been encoded into a 128-dimensional feature vector becomes:
There are two common label patterns:
- many-to-one: one class for the entire clip
- many-to-many: one class for each timestep
If the goal is “classify the whole video as running, jumping, or walking,” many-to-one is usually the natural formulation.
If the goal is “label each frame or short step in the sequence,” many-to-many is appropriate.
What Makes an LSTM Many-to-Many in TensorFlow
In Keras, the crucial switch is return_sequences=True.
With return_sequences=False, the layer returns only the final output vector. With return_sequences=True, it returns an output at every timestep.
That is why many-to-many models use return_sequences=True at least on the LSTM layer that feeds the timestep-wise output head.
A Minimal Many-to-Many Example
Here is a simple model that takes a sequence of frame features and predicts a class for every timestep.
The output shape is:
So the target tensor must also have one label per timestep, typically shaped like:
Example Training Data Shape
Here is a runnable toy training example:
This is a true many-to-many setup because the model predicts a class distribution at each timestep.
Whole-Video Classification Usually Uses Many-to-One
A lot of questions use the phrase “video classification,” but actually mean one class per clip. In that case, you usually want the last LSTM output only.
This produces one prediction per video, not one prediction per frame.
So before building the model, decide whether your label is attached to the whole clip or to each timestep.
Where the Frame Features Come From
In real video pipelines, raw frames are often too heavy to feed directly into an LSTM. A common pattern is:
- extract frames
- run each frame through a CNN or vision backbone
- feed the resulting feature vectors into the LSTM
That means the LSTM models temporal structure, while the CNN models spatial structure.
For example, you might use a pretrained image model to encode each frame into a feature vector of length 128 or 512, then train the LSTM on those sequences.
Padding and Variable-Length Videos
Real videos often have different lengths. Keras can handle this with padding plus masking.
Masking helps the LSTM ignore padded timesteps rather than treating them as real frames.
This is important if clips are batched to a common length.
Common Pitfalls
One common mistake is building a many-to-many model when the dataset has only one label per video. That causes shape mismatches or, worse, a model that solves the wrong task.
Another issue is forgetting return_sequences=True. Without it, the LSTM outputs only the last timestep and cannot feed a timestep-wise classifier correctly.
Developers also sometimes feed raw frames directly into an LSTM without a spatial encoder, which usually makes training harder and less efficient than using frame features.
Finally, be careful with target shape. For many-to-many classification, the labels must be aligned per timestep, not per video.
Summary
- A many-to-many LSTM is appropriate when you need one prediction per timestep in the video sequence.
- In Keras,
return_sequences=Trueis the key setting that enables timestep-wise outputs. - Whole-video classification is usually many-to-one, not many-to-many.
- In practice, frame features from a CNN are often a better LSTM input than raw pixels.
- Match the model output shape to the labeling scheme before training.

