`RNN`
Seq2Seq
TensorFlow
API Reference
Machine Learning

API Reference for `RNN` and Seq2Seq models in tensorflow

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

TensorFlow provides recurrent neural network (RNN) layers through tf.keras.layersSimpleRNN, LSTM, and GRU. For sequence-to-sequence (Seq2Seq) models, you build an encoder-decoder architecture by stacking RNN layers with return_state=True on the encoder and feeding the encoder's final state into the decoder. TensorFlow 2.x uses Keras as the primary API, and the legacy tf.nn.dynamic_rnn and tf.contrib.seq2seq APIs are removed. Modern Seq2Seq implementations use the functional or subclassing API with attention mechanisms from tf.keras.layers.Attention or tfa.seq2seq.

SimpleRNN

python
1import tensorflow as tf
2
3# Basic SimpleRNN — single hidden layer
4model = tf.keras.Sequential([
5    tf.keras.layers.SimpleRNN(64, input_shape=(100, 32)),  # (timesteps, features)
6    tf.keras.layers.Dense(10, activation='softmax')
7])
8
9# Key parameters:
10# units=64          — number of hidden units
11# activation='tanh' — default activation function
12# return_sequences  — False: output shape (batch, units)
13#                      True:  output shape (batch, timesteps, units)
14# return_state      — also return the final hidden state
15
16# return_sequences=True for stacking RNN layers
17model = tf.keras.Sequential([
18    tf.keras.layers.SimpleRNN(64, return_sequences=True, input_shape=(100, 32)),
19    tf.keras.layers.SimpleRNN(32),
20    tf.keras.layers.Dense(10)
21])

SimpleRNN applies the recurrence h_t = tanh(W_x * x_t + W_h * h_{t-1} + b). It suffers from vanishing gradients on long sequences — use LSTM or GRU for sequences longer than 20-30 timesteps.

LSTM (Long Short-Term Memory)

python
1# LSTM — handles long-term dependencies
2model = tf.keras.Sequential([
3    tf.keras.layers.LSTM(128, input_shape=(50, 10)),
4    tf.keras.layers.Dense(1)
5])
6
7# Bidirectional LSTM — processes sequence in both directions
8model = tf.keras.Sequential([
9    tf.keras.layers.Bidirectional(
10        tf.keras.layers.LSTM(64, return_sequences=True),
11        input_shape=(50, 10)
12    ),
13    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
14    tf.keras.layers.Dense(1)
15])
16
17# LSTM with return_state (needed for Seq2Seq encoder)
18inputs = tf.keras.Input(shape=(50, 10))
19lstm_out, state_h, state_c = tf.keras.layers.LSTM(
20    128, return_state=True
21)(inputs)
22# lstm_out: final output (batch, 128)
23# state_h:  final hidden state (batch, 128)
24# state_c:  final cell state (batch, 128)

LSTM adds a cell state that carries information across many timesteps via gating mechanisms (forget gate, input gate, output gate). return_state=True returns both the hidden state and the cell state, which is essential for initializing the Seq2Seq decoder.

GRU (Gated Recurrent Unit)

python
1# GRU — simpler than LSTM, often comparable performance
2model = tf.keras.Sequential([
3    tf.keras.layers.GRU(128, input_shape=(50, 10)),
4    tf.keras.layers.Dense(1)
5])
6
7# GRU with dropout and recurrent dropout
8model = tf.keras.Sequential([
9    tf.keras.layers.GRU(
10        128,
11        dropout=0.2,           # Input dropout
12        recurrent_dropout=0.2, # Recurrent connection dropout
13        return_sequences=True,
14        input_shape=(50, 10)
15    ),
16    tf.keras.layers.GRU(64),
17    tf.keras.layers.Dense(1)
18])

GRU has two gates (reset and update) instead of LSTM's three, making it faster to train with fewer parameters. Performance is often similar to LSTM — try both and compare on your dataset.

Seq2Seq Encoder-Decoder

python
1import tensorflow as tf
2
3# Encoder
4encoder_inputs = tf.keras.Input(shape=(None, num_encoder_features))
5encoder_lstm = tf.keras.layers.LSTM(256, return_state=True)
6encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
7encoder_states = [state_h, state_c]
8
9# Decoder — initialized with encoder states
10decoder_inputs = tf.keras.Input(shape=(None, num_decoder_features))
11decoder_lstm = tf.keras.layers.LSTM(256, return_sequences=True, return_state=True)
12decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
13decoder_dense = tf.keras.layers.Dense(num_decoder_features, activation='softmax')
14decoder_outputs = decoder_dense(decoder_outputs)
15
16# Full model
17model = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
18model.compile(optimizer='adam', loss='categorical_crossentropy')
19
20# Training — teacher forcing (decoder input = shifted target)
21model.fit(
22    [encoder_input_data, decoder_input_data],
23    decoder_target_data,
24    batch_size=64,
25    epochs=100
26)

The encoder processes the input sequence and produces a context vector (final hidden state). The decoder uses this context to generate the output sequence one timestep at a time. During training, "teacher forcing" feeds the true previous token as decoder input.

Seq2Seq with Attention

python
1import tensorflow as tf
2
3# Encoder
4encoder_inputs = tf.keras.Input(shape=(max_encoder_len, encoder_features))
5encoder_lstm = tf.keras.layers.LSTM(256, return_sequences=True, return_state=True)
6encoder_output, state_h, state_c = encoder_lstm(encoder_inputs)
7
8# Decoder
9decoder_inputs = tf.keras.Input(shape=(max_decoder_len, decoder_features))
10decoder_lstm = tf.keras.layers.LSTM(256, return_sequences=True)
11decoder_output = decoder_lstm(decoder_inputs, initial_state=[state_h, state_c])
12
13# Attention layer — attends to encoder outputs at each decoder step
14attention = tf.keras.layers.Attention()
15context = attention([decoder_output, encoder_output])
16
17# Concatenate attention context with decoder output
18concat = tf.keras.layers.Concatenate()([decoder_output, context])
19output = tf.keras.layers.Dense(vocab_size, activation='softmax')(concat)
20
21model = tf.keras.Model([encoder_inputs, decoder_inputs], output)
22model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

Attention allows the decoder to focus on different parts of the encoder output at each timestep, rather than relying on a single fixed-length context vector. This dramatically improves performance on longer sequences.

Inference (Prediction) Loop

python
1# Encoder model for inference
2encoder_model = tf.keras.Model(encoder_inputs, [encoder_output, state_h, state_c])
3
4# Decoder model for inference (one step at a time)
5decoder_state_input_h = tf.keras.Input(shape=(256,))
6decoder_state_input_c = tf.keras.Input(shape=(256,))
7decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
8
9dec_out = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
10decoder_model = tf.keras.Model(
11    [decoder_inputs] + decoder_states_inputs,
12    [dec_out[0], dec_out[1], dec_out[2]]
13)
14
15# Inference loop
16def decode_sequence(input_seq):
17    enc_out, h, c = encoder_model.predict(input_seq)
18    target_seq = np.zeros((1, 1, num_decoder_features))
19    target_seq[0, 0, start_token_index] = 1.0
20
21    decoded = []
22    for _ in range(max_decoder_len):
23        output, h, c = decoder_model.predict([target_seq, h, c])
24        token_index = np.argmax(output[0, -1, :])
25        if token_index == stop_token_index:
26            break
27        decoded.append(token_index)
28        target_seq = np.zeros((1, 1, num_decoder_features))
29        target_seq[0, 0, token_index] = 1.0
30
31    return decoded

Common Pitfalls

  • Not setting return_sequences=True when stacking RNN layers: Each RNN layer expects a 3D input (batch, timesteps, features). Without return_sequences=True, the layer outputs 2D (batch, features), and the next RNN layer raises a shape error. Only the final RNN layer in a stack can use return_sequences=False.
  • Using recurrent_dropout > 0 with CuDNN: TensorFlow's CuDNN-optimized LSTM/GRU kernels do not support recurrent_dropout. Setting it falls back to the slower non-CuDNN implementation without warning. Use regular dropout for GPU training or accept the performance cost.
  • Forgetting teacher forcing during training: The Seq2Seq decoder must receive the ground truth previous token during training (teacher forcing). Feeding the decoder's own predictions during training causes slow convergence because early predictions are random noise.
  • Ignoring sequence padding and masking: Variable-length sequences must be padded and masked so the model does not learn from padding tokens. Use tf.keras.layers.Masking or pass mask_zero=True to the embedding layer, and ensure downstream layers propagate the mask.
  • Using legacy tf.nn.dynamic_rnn or tf.contrib.seq2seq: These APIs are removed in TensorFlow 2.x. Use tf.keras.layers.LSTM/GRU with return_state=True for encoders and the functional API for Seq2Seq architectures. For beam search decoding, use tfa.seq2seq.BeamSearchDecoder from TensorFlow Addons.

Summary

  • Use tf.keras.layers.LSTM or GRU for recurrent layers — SimpleRNN suffers from vanishing gradients
  • Set return_sequences=True when stacking RNN layers; use return_state=True for Seq2Seq encoders
  • Build Seq2Seq by passing encoder final states as initial_state to the decoder LSTM
  • Add tf.keras.layers.Attention between encoder and decoder for better long-sequence performance
  • Use Bidirectional wrapper for tasks where future context matters (classification, NER)
  • During inference, run the decoder one step at a time in a loop, feeding each output as the next input

Course illustration
Course illustration

All Rights Reserved.