Introduction
TensorFlow provides recurrent neural network (RNN) layers through tf.keras.layers — SimpleRNN, LSTM, and GRU. For sequence-to-sequence (Seq2Seq) models, you build an encoder-decoder architecture by stacking RNN layers with return_state=True on the encoder and feeding the encoder's final state into the decoder. TensorFlow 2.x uses Keras as the primary API, and the legacy tf.nn.dynamic_rnn and tf.contrib.seq2seq APIs are removed. Modern Seq2Seq implementations use the functional or subclassing API with attention mechanisms from tf.keras.layers.Attention or tfa.seq2seq.
SimpleRNN
1import tensorflow as tf
2
3# Basic SimpleRNN — single hidden layer
4model = tf.keras.Sequential([
5 tf.keras.layers.SimpleRNN(64, input_shape=(100, 32)), # (timesteps, features)
6 tf.keras.layers.Dense(10, activation='softmax')
7])
8
9# Key parameters:
10# units=64 — number of hidden units
11# activation='tanh' — default activation function
12# return_sequences — False: output shape (batch, units)
13# True: output shape (batch, timesteps, units)
14# return_state — also return the final hidden state
15
16# return_sequences=True for stacking RNN layers
17model = tf.keras.Sequential([
18 tf.keras.layers.SimpleRNN(64, return_sequences=True, input_shape=(100, 32)),
19 tf.keras.layers.SimpleRNN(32),
20 tf.keras.layers.Dense(10)
21])
SimpleRNN applies the recurrence h_t = tanh(W_x * x_t + W_h * h_{t-1} + b). It suffers from vanishing gradients on long sequences — use LSTM or GRU for sequences longer than 20-30 timesteps.
LSTM (Long Short-Term Memory)
1# LSTM — handles long-term dependencies
2model = tf.keras.Sequential([
3 tf.keras.layers.LSTM(128, input_shape=(50, 10)),
4 tf.keras.layers.Dense(1)
5])
6
7# Bidirectional LSTM — processes sequence in both directions
8model = tf.keras.Sequential([
9 tf.keras.layers.Bidirectional(
10 tf.keras.layers.LSTM(64, return_sequences=True),
11 input_shape=(50, 10)
12 ),
13 tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
14 tf.keras.layers.Dense(1)
15])
16
17# LSTM with return_state (needed for Seq2Seq encoder)
18inputs = tf.keras.Input(shape=(50, 10))
19lstm_out, state_h, state_c = tf.keras.layers.LSTM(
20 128, return_state=True
21)(inputs)
22# lstm_out: final output (batch, 128)
23# state_h: final hidden state (batch, 128)
24# state_c: final cell state (batch, 128)
LSTM adds a cell state that carries information across many timesteps via gating mechanisms (forget gate, input gate, output gate). return_state=True returns both the hidden state and the cell state, which is essential for initializing the Seq2Seq decoder.
GRU (Gated Recurrent Unit)
1# GRU — simpler than LSTM, often comparable performance
2model = tf.keras.Sequential([
3 tf.keras.layers.GRU(128, input_shape=(50, 10)),
4 tf.keras.layers.Dense(1)
5])
6
7# GRU with dropout and recurrent dropout
8model = tf.keras.Sequential([
9 tf.keras.layers.GRU(
10 128,
11 dropout=0.2, # Input dropout
12 recurrent_dropout=0.2, # Recurrent connection dropout
13 return_sequences=True,
14 input_shape=(50, 10)
15 ),
16 tf.keras.layers.GRU(64),
17 tf.keras.layers.Dense(1)
18])
GRU has two gates (reset and update) instead of LSTM's three, making it faster to train with fewer parameters. Performance is often similar to LSTM — try both and compare on your dataset.
Seq2Seq Encoder-Decoder
1import tensorflow as tf
2
3# Encoder
4encoder_inputs = tf.keras.Input(shape=(None, num_encoder_features))
5encoder_lstm = tf.keras.layers.LSTM(256, return_state=True)
6encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
7encoder_states = [state_h, state_c]
8
9# Decoder — initialized with encoder states
10decoder_inputs = tf.keras.Input(shape=(None, num_decoder_features))
11decoder_lstm = tf.keras.layers.LSTM(256, return_sequences=True, return_state=True)
12decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
13decoder_dense = tf.keras.layers.Dense(num_decoder_features, activation='softmax')
14decoder_outputs = decoder_dense(decoder_outputs)
15
16# Full model
17model = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
18model.compile(optimizer='adam', loss='categorical_crossentropy')
19
20# Training — teacher forcing (decoder input = shifted target)
21model.fit(
22 [encoder_input_data, decoder_input_data],
23 decoder_target_data,
24 batch_size=64,
25 epochs=100
26)
The encoder processes the input sequence and produces a context vector (final hidden state). The decoder uses this context to generate the output sequence one timestep at a time. During training, "teacher forcing" feeds the true previous token as decoder input.
Seq2Seq with Attention
1import tensorflow as tf
2
3# Encoder
4encoder_inputs = tf.keras.Input(shape=(max_encoder_len, encoder_features))
5encoder_lstm = tf.keras.layers.LSTM(256, return_sequences=True, return_state=True)
6encoder_output, state_h, state_c = encoder_lstm(encoder_inputs)
7
8# Decoder
9decoder_inputs = tf.keras.Input(shape=(max_decoder_len, decoder_features))
10decoder_lstm = tf.keras.layers.LSTM(256, return_sequences=True)
11decoder_output = decoder_lstm(decoder_inputs, initial_state=[state_h, state_c])
12
13# Attention layer — attends to encoder outputs at each decoder step
14attention = tf.keras.layers.Attention()
15context = attention([decoder_output, encoder_output])
16
17# Concatenate attention context with decoder output
18concat = tf.keras.layers.Concatenate()([decoder_output, context])
19output = tf.keras.layers.Dense(vocab_size, activation='softmax')(concat)
20
21model = tf.keras.Model([encoder_inputs, decoder_inputs], output)
22model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
Attention allows the decoder to focus on different parts of the encoder output at each timestep, rather than relying on a single fixed-length context vector. This dramatically improves performance on longer sequences.
Inference (Prediction) Loop
1# Encoder model for inference
2encoder_model = tf.keras.Model(encoder_inputs, [encoder_output, state_h, state_c])
3
4# Decoder model for inference (one step at a time)
5decoder_state_input_h = tf.keras.Input(shape=(256,))
6decoder_state_input_c = tf.keras.Input(shape=(256,))
7decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
8
9dec_out = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
10decoder_model = tf.keras.Model(
11 [decoder_inputs] + decoder_states_inputs,
12 [dec_out[0], dec_out[1], dec_out[2]]
13)
14
15# Inference loop
16def decode_sequence(input_seq):
17 enc_out, h, c = encoder_model.predict(input_seq)
18 target_seq = np.zeros((1, 1, num_decoder_features))
19 target_seq[0, 0, start_token_index] = 1.0
20
21 decoded = []
22 for _ in range(max_decoder_len):
23 output, h, c = decoder_model.predict([target_seq, h, c])
24 token_index = np.argmax(output[0, -1, :])
25 if token_index == stop_token_index:
26 break
27 decoded.append(token_index)
28 target_seq = np.zeros((1, 1, num_decoder_features))
29 target_seq[0, 0, token_index] = 1.0
30
31 return decoded
Common Pitfalls
Not setting return_sequences=True when stacking RNN layers: Each RNN layer expects a 3D input (batch, timesteps, features). Without return_sequences=True, the layer outputs 2D (batch, features), and the next RNN layer raises a shape error. Only the final RNN layer in a stack can use return_sequences=False.
Using recurrent_dropout > 0 with CuDNN: TensorFlow's CuDNN-optimized LSTM/GRU kernels do not support recurrent_dropout. Setting it falls back to the slower non-CuDNN implementation without warning. Use regular dropout for GPU training or accept the performance cost.
Forgetting teacher forcing during training: The Seq2Seq decoder must receive the ground truth previous token during training (teacher forcing). Feeding the decoder's own predictions during training causes slow convergence because early predictions are random noise.
Ignoring sequence padding and masking: Variable-length sequences must be padded and masked so the model does not learn from padding tokens. Use tf.keras.layers.Masking or pass mask_zero=True to the embedding layer, and ensure downstream layers propagate the mask.
Using legacy tf.nn.dynamic_rnn or tf.contrib.seq2seq: These APIs are removed in TensorFlow 2.x. Use tf.keras.layers.LSTM/GRU with return_state=True for encoders and the functional API for Seq2Seq architectures. For beam search decoding, use tfa.seq2seq.BeamSearchDecoder from TensorFlow Addons.
Summary
Use tf.keras.layers.LSTM or GRU for recurrent layers — SimpleRNN suffers from vanishing gradients
Set return_sequences=True when stacking RNN layers; use return_state=True for Seq2Seq encoders
Build Seq2Seq by passing encoder final states as initial_state to the decoder LSTM
Add tf.keras.layers.Attention between encoder and decoder for better long-sequence performance
Use Bidirectional wrapper for tasks where future context matters (classification, NER)
During inference, run the decoder one step at a time in a loop, feeding each output as the next input