Implementing a many-to-many LSTM in TensorFlow?

LSTM

TensorFlow

neural networks

machine learning

many-to-many

Implementing a many-to-many LSTM in TensorFlow?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A many-to-many LSTM takes a sequence as input and produces a sequence as output. The most common version is sequence labeling, where every input time step has a corresponding output time step, so the core implementation detail in TensorFlow is making the recurrent layer return the full sequence instead of only the last hidden state.

Understand the Output Shape First

Before writing model code, define the shapes clearly. For aligned sequence labeling, the tensor shapes usually look like this:

input: (batch, timesteps, features)
output: (batch, timesteps, classes) for classification
output: (batch, timesteps, target_features) for regression

That is different from a many-to-one classifier, where the output would be only (batch, classes). In Keras, the key switch is return_sequences=True on the recurrent layer.

A Minimal Many-to-Many Model

For a simple sequence labeling problem, one LSTM followed by a Dense layer works because Keras applies the dense layer to each time step when the input is three-dimensional.

python

1import numpy as np
2import tensorflow as tf
3from tensorflow import keras
4
5num_samples = 256
6timesteps = 12
7input_features = 4
8num_classes = 3
9
10x = np.random.rand(num_samples, timesteps, input_features).astype("float32")
11y = np.random.randint(0, num_classes, size=(num_samples, timesteps))
12
13model = keras.Sequential([
14    keras.layers.Input(shape=(timesteps, input_features)),
15    keras.layers.Masking(mask_value=0.0),
16    keras.layers.LSTM(64, return_sequences=True),
17    keras.layers.Dense(num_classes, activation="softmax"),
18])
19
20model.compile(
21    optimizer="adam",
22    loss="sparse_categorical_crossentropy",
23    metrics=["accuracy"],
24)
25
26model.fit(x, y, epochs=3, batch_size=32)

This is a true many-to-many model because the prediction remains a sequence. If you remove return_sequences=True, the LSTM collapses the sequence into one final vector and you no longer have the right architecture.

Use `TimeDistributed` Only When It Adds Clarity

Older examples often wrap the output layer with TimeDistributed. In current TensorFlow Keras, a plain Dense after an LSTM with sequence output is usually enough because it already broadcasts over the time dimension.

python

1inputs = keras.Input(shape=(timesteps, input_features))
2x = keras.layers.LSTM(64, return_sequences=True)(inputs)
3outputs = keras.layers.TimeDistributed(
4    keras.layers.Dense(num_classes, activation="softmax")
5)(x)
6model = keras.Model(inputs, outputs)

This is valid, but it is not mandatory for the simple dense-per-step case. Use it when it makes the intended per-step transformation clearer to your team.

Stack LSTMs Carefully

If you stack multiple recurrent layers, all intermediate LSTM layers must also return sequences. Otherwise the next recurrent layer has nothing sequence-shaped to consume.

python

1inputs = keras.Input(shape=(timesteps, input_features))
2x = keras.layers.LSTM(128, return_sequences=True)(inputs)
3x = keras.layers.Dropout(0.2)(x)
4x = keras.layers.LSTM(64, return_sequences=True)(x)
5outputs = keras.layers.Dense(num_classes, activation="softmax")(x)
6model = keras.Model(inputs, outputs)

This is a common place where shape errors appear. If the second LSTM says it expected three dimensions but received two, an earlier layer stopped returning the full sequence.

Handle Variable-Length Sequences

Real sequence problems often have different lengths per example. Pad them to a common length and use masking so the padded steps do not affect training.

python

1sequences = [
2    [[1.0, 0.2], [0.5, 0.7]],
3    [[0.1, 0.9], [0.2, 0.8], [0.3, 0.7]],
4]
5
6x = keras.utils.pad_sequences(sequences, padding="post", dtype="float32")
7print(x.shape)

Masking is especially important for many-to-many tasks because every time step contributes to the loss. If padded steps are not masked, the model learns from fake tokens.

Sequence-to-Sequence Is a Different Many-to-Many Pattern

Some people say many-to-many when they mean encoder-decoder translation, where the output length may differ from the input length. That architecture is still sequence-to-sequence, but it is not the same as the aligned time-step labeling model shown above.

When input and output lengths differ, you typically need an encoder-decoder setup, teacher forcing during training, and a decoding loop at inference time. Do not force that problem into a single aligned LSTM unless the task truly has one label per input step.

Common Pitfalls

The biggest mistake is forgetting return_sequences=True, which silently turns the model into many-to-one. Another common issue is using the wrong target shape. For per-step classification with sparse_categorical_crossentropy, the target should usually be (batch, timesteps), not one scalar per sample.

Masking is also easy to skip, especially with padded data. That produces models that appear to train but spend capacity fitting padding artifacts.

Finally, be precise about task type. Sequence labeling, tagging, and frame-wise regression are aligned many-to-many tasks. Translation and summarization are not the same wiring pattern.

Summary

A many-to-many LSTM keeps the time dimension from input through output.
In Keras, return_sequences=True is the critical setting.
A Dense layer after the LSTM can produce one prediction per time step.
Use masking when sequences are padded to a common length.
Distinguish aligned sequence labeling from encoder-decoder sequence-to-sequence problems.

Implementing a many-to-many LSTM in TensorFlow?

Master System Design with Codemia

Introduction

Understand the Output Shape First

A Minimal Many-to-Many Model

Use TimeDistributed Only When It Adds Clarity

Stack LSTMs Carefully

Handle Variable-Length Sequences

Sequence-to-Sequence Is a Different Many-to-Many Pattern

Common Pitfalls

Summary

Use `TimeDistributed` Only When It Adds Clarity