Video classification using many to many LSTM in TensorFlow

video classification

many to many LSTM

TensorFlow

machine learning

deep learning

Video classification using many to many LSTM in TensorFlow

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

An LSTM can model a video as a sequence of frame features over time. The important design choice is whether you want one label for the whole video or one label per frame or timestep. A many-to-many LSTM is the right fit when the output is also a sequence, while whole-video classification is often many-to-one instead.

Many-to-One Versus Many-to-Many

Suppose each video is represented as a sequence of frame embeddings:

text

(timesteps, feature_dim)

For example, a 20-frame clip where each frame has been encoded into a 128-dimensional feature vector becomes:

text

(20, 128)

There are two common label patterns:

many-to-one: one class for the entire clip
many-to-many: one class for each timestep

If the goal is “classify the whole video as running, jumping, or walking,” many-to-one is usually the natural formulation.

If the goal is “label each frame or short step in the sequence,” many-to-many is appropriate.

What Makes an LSTM Many-to-Many in TensorFlow

In Keras, the crucial switch is return_sequences=True.

python

import tensorflow as tf

lstm = tf.keras.layers.LSTM(64, return_sequences=True)

With return_sequences=False, the layer returns only the final output vector. With return_sequences=True, it returns an output at every timestep.

That is why many-to-many models use return_sequences=True at least on the LSTM layer that feeds the timestep-wise output head.

A Minimal Many-to-Many Example

Here is a simple model that takes a sequence of frame features and predicts a class for every timestep.

python

1import tensorflow as tf
2
3timesteps = 20
4feature_dim = 128
5num_classes = 5
6
7inputs = tf.keras.Input(shape=(timesteps, feature_dim))
8x = tf.keras.layers.Masking()(inputs)
9x = tf.keras.layers.LSTM(64, return_sequences=True)(x)
10outputs = tf.keras.layers.TimeDistributed(
11    tf.keras.layers.Dense(num_classes, activation="softmax")
12)(x)
13
14model = tf.keras.Model(inputs, outputs)
15model.compile(
16    optimizer="adam",
17    loss="sparse_categorical_crossentropy",
18    metrics=["accuracy"],
19)
20
21model.summary()

The output shape is:

text

(batch_size, timesteps, num_classes)

So the target tensor must also have one label per timestep, typically shaped like:

text

(batch_size, timesteps)

Example Training Data Shape

Here is a runnable toy training example:

python

1import numpy as np
2import tensorflow as tf
3
4timesteps = 20
5feature_dim = 128
6num_classes = 5
7
8x_train = np.random.rand(32, timesteps, feature_dim).astype("float32")
9y_train = np.random.randint(0, num_classes, size=(32, timesteps)).astype("int32")
10
11inputs = tf.keras.Input(shape=(timesteps, feature_dim))
12x = tf.keras.layers.LSTM(64, return_sequences=True)(inputs)
13outputs = tf.keras.layers.TimeDistributed(
14    tf.keras.layers.Dense(num_classes, activation="softmax")
15)(x)
16
17model = tf.keras.Model(inputs, outputs)
18model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
19model.fit(x_train, y_train, epochs=1, verbose=0)

This is a true many-to-many setup because the model predicts a class distribution at each timestep.

Whole-Video Classification Usually Uses Many-to-One

A lot of questions use the phrase “video classification,” but actually mean one class per clip. In that case, you usually want the last LSTM output only.

python

1import tensorflow as tf
2
3inputs = tf.keras.Input(shape=(20, 128))
4x = tf.keras.layers.LSTM(64, return_sequences=False)(inputs)
5outputs = tf.keras.layers.Dense(5, activation="softmax")(x)
6
7model = tf.keras.Model(inputs, outputs)

This produces one prediction per video, not one prediction per frame.

So before building the model, decide whether your label is attached to the whole clip or to each timestep.

Where the Frame Features Come From

In real video pipelines, raw frames are often too heavy to feed directly into an LSTM. A common pattern is:

extract frames
run each frame through a CNN or vision backbone
feed the resulting feature vectors into the LSTM

That means the LSTM models temporal structure, while the CNN models spatial structure.

For example, you might use a pretrained image model to encode each frame into a feature vector of length 128 or 512, then train the LSTM on those sequences.

Padding and Variable-Length Videos

Real videos often have different lengths. Keras can handle this with padding plus masking.

python

inputs = tf.keras.Input(shape=(None, 128))
x = tf.keras.layers.Masking(mask_value=0.0)(inputs)
x = tf.keras.layers.LSTM(64, return_sequences=True)(x)

Masking helps the LSTM ignore padded timesteps rather than treating them as real frames.

This is important if clips are batched to a common length.

Common Pitfalls

One common mistake is building a many-to-many model when the dataset has only one label per video. That causes shape mismatches or, worse, a model that solves the wrong task.

Another issue is forgetting return_sequences=True. Without it, the LSTM outputs only the last timestep and cannot feed a timestep-wise classifier correctly.

Developers also sometimes feed raw frames directly into an LSTM without a spatial encoder, which usually makes training harder and less efficient than using frame features.

Finally, be careful with target shape. For many-to-many classification, the labels must be aligned per timestep, not per video.

Summary

A many-to-many LSTM is appropriate when you need one prediction per timestep in the video sequence.
In Keras, return_sequences=True is the key setting that enables timestep-wise outputs.
Whole-video classification is usually many-to-one, not many-to-many.
In practice, frame features from a CNN are often a better LSTM input than raw pixels.
Match the model output shape to the labeling scheme before training.