Sequence Models (Pre-Transformer)

Deep Learning Foundations

Practical Training Decisions

Sequence Models (Pre-Transformer)

Topics Covered

RNN Fundamentals

Bidirectional RNNs

Teacher Forcing

LSTM and Gating

The Three Gates

GRU, The Simplified LSTM

Vanishing Gradients in Sequences

Why LSTM Addresses Vanishing Gradients

The Truncated BPTT Approximation

Sequential vs. Parallel Processing

Parallelism in Training vs. Inference

What RNNs Do Better

Seq2Seq and Bahdanau Attention

A feedforward neural network processes one input and produces one output. It has no memory. You cannot ask it "what did you see three inputs ago?" because by design, every input is treated independently. This works for classifying images or tabular data, but it fails for text. The meaning of "bank" depends on what came before it. The grammatical form of a verb depends on the subject earlier in the sentence. Language is inherently sequential.

The Recurrent Neural Network (RNN) solves this by maintaining a hidden state , a vector that summarizes everything the network has seen so far. At each time step $t$ , the RNN takes two inputs: the current token $x_t$ and the previous hidden state $h_{t-1}$ . It produces one output: the new hidden state $h_t$ . The hidden state is then passed forward to the next time step, carrying information about the history of the sequence.

h_t = \tanh(W_h h_{t-1} + W_x x_t + b)

In NumPy, the entire forward step is one line of linear algebra:

python

1import numpy as np
2
3def rnn_step(h_prev, x, W_h, W_x, b):
4    """One time step of a vanilla RNN."""
5    h = np.tanh(W_h @ h_prev + W_x @ x + b)
6    return h

The same weight matrices $W_h$ and $W_x$ are used at every time step. This weight sharing is what makes an RNN a recurrence, the same transformation is applied to each input in the context of the accumulated history. A sentence of length 100 passes through 100 applications of the same set of weights.

RNN unrolled through time: hidden state threads the sequence

The same weights are applied at every time step. The hidden state h_t carries information from all past tokens; each new token updates it.

To visualize an RNN, "unroll" it through time: draw one copy of the cell for each time step, connected by arrows representing the hidden state flow. The unrolled view makes it clear that an RNN over a sequence of length $T$ is really a very deep feedforward network, $T$ layers deep, where all layers share the same weights. This weight sharing is what gives RNNs parameter efficiency: a sentence of length 100 uses the same number of parameters as a sentence of length 10.

Key Insight

The unrolled RNN is a deep feedforward network where every layer shares weights. This perspective immediately reveals the vanishing gradient problem: backpropagating through 100 layers (one per time step) multiplies gradients by the same weight matrix 100 times. If any eigenvalue of that matrix is less than 1, gradients shrink exponentially. This is not a bug. It is the direct consequence of sharing weights across time, and it is why vanilla RNNs struggle with long-range dependencies.

Bidirectional RNNs

Standard RNNs process sequences left to right, the hidden state at position $t$ only sees positions $1$ through $t$ . For tasks like named entity recognition, where knowing what comes after a word is as important as what came before it, a bidirectional RNN runs two separate RNNs: one forward and one backward over the same sequence. The two hidden states at each position are concatenated to form the final representation.

The bidirectional design doubles the parameter count but dramatically improves performance on understanding tasks. It is impractical for generation tasks (you cannot process future tokens before generating the current one), but for classification or sequence labeling, it is the standard extension of vanilla RNNs.

Interview Tip

When a paper says 'we use a bidirectional LSTM with hidden size 512', they mean two separate LSTMs — one reading left-to-right, one reading right-to-left — whose hidden states are concatenated at each position, giving a 1024-dimensional representation per token. This is the architecture BERT replaced: BERT's bidirectional attention achieves the same 'see both directions' property but with full parallelism.

Teacher Forcing

Training a sequence-to-sequence RNN requires deciding what to feed as input at each time step during training. The model's own generated output from the previous step is often wrong early in training, feeding wrong tokens into subsequent steps would compound errors and make learning very slow. Teacher forcing instead feeds the ground-truth previous token at each step during training, regardless of what the model predicted. This makes training stable but creates an exposure bias: the model never sees its own mistakes during training, then sees them at inference time. Scheduled sampling addresses this by gradually replacing ground-truth tokens with model-predicted tokens during training as training progresses.

Level Expectations

Beginner: understand that RNNs process sequences one step at a time with a hidden state that serves as memory.

Intermediate: implement an LSTM cell from scratch (the CodeExecutor problem) and understand how the gates control information flow.

Advanced: read 'Attention Is All You Need' (Vaswani et al., 2017) — the paper that made RNNs obsolete — and understand why self-attention is strictly more powerful than recurrence for modeling long-range dependencies.

Course

Deep Learning Foundations

Mathematical Foundations

Neural Network Foundations

Representation Learning

Generative Models Beyond Language

Vision and Modern Self-Supervised Learning

Practical Training Decisions