Deep Learning Foundations
Mathematical Foundations
Neural Network Foundations
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
Sequence Models (Pre-Transformer)
A feedforward neural network processes one input and produces one output. It has no memory. You cannot ask it "what did you see three inputs ago?" because by design, every input is treated independently. This works for classifying images or tabular data, but it fails for text. The meaning of "bank" depends on what came before it. The grammatical form of a verb depends on the subject earlier in the sentence. Language is inherently sequential.
The Recurrent Neural Network (RNN) solves this by maintaining a hidden state , a vector that summarizes everything the network has seen so far. At each time step , the RNN takes two inputs: the current token and the previous hidden state . It produces one output: the new hidden state . The hidden state is then passed forward to the next time step, carrying information about the history of the sequence.
In NumPy, the entire forward step is one line of linear algebra:
The same weight matrices and are used at every time step. This weight sharing is what makes an RNN a recurrence, the same transformation is applied to each input in the context of the accumulated history. A sentence of length 100 passes through 100 applications of the same set of weights.
RNN unrolled through time: hidden state threads the sequence
To visualize an RNN, "unroll" it through time: draw one copy of the cell for each time step, connected by arrows representing the hidden state flow. The unrolled view makes it clear that an RNN over a sequence of length is really a very deep feedforward network, layers deep, where all layers share the same weights. This weight sharing is what gives RNNs parameter efficiency: a sentence of length 100 uses the same number of parameters as a sentence of length 10.
The unrolled RNN is a deep feedforward network where every layer shares weights. This perspective immediately reveals the vanishing gradient problem: backpropagating through 100 layers (one per time step) multiplies gradients by the same weight matrix 100 times. If any eigenvalue of that matrix is less than 1, gradients shrink exponentially. This is not a bug. It is the direct consequence of sharing weights across time, and it is why vanilla RNNs struggle with long-range dependencies.
Bidirectional RNNs
Standard RNNs process sequences left to right, the hidden state at position only sees positions $1$ through . For tasks like named entity recognition, where knowing what comes after a word is as important as what came before it, a bidirectional RNN runs two separate RNNs: one forward and one backward over the same sequence. The two hidden states at each position are concatenated to form the final representation.
The bidirectional design doubles the parameter count but dramatically improves performance on understanding tasks. It is impractical for generation tasks (you cannot process future tokens before generating the current one), but for classification or sequence labeling, it is the standard extension of vanilla RNNs.
When a paper says 'we use a bidirectional LSTM with hidden size 512', they mean two separate LSTMs — one reading left-to-right, one reading right-to-left — whose hidden states are concatenated at each position, giving a 1024-dimensional representation per token. This is the architecture BERT replaced: BERT's bidirectional attention achieves the same 'see both directions' property but with full parallelism.
Teacher Forcing
Training a sequence-to-sequence RNN requires deciding what to feed as input at each time step during training. The model's own generated output from the previous step is often wrong early in training, feeding wrong tokens into subsequent steps would compound errors and make learning very slow. Teacher forcing instead feeds the ground-truth previous token at each step during training, regardless of what the model predicted. This makes training stable but creates an exposure bias: the model never sees its own mistakes during training, then sees them at inference time. Scheduled sampling addresses this by gradually replacing ground-truth tokens with model-predicted tokens during training as training progresses.
Beginner: understand that RNNs process sequences one step at a time with a hidden state that serves as memory.
Intermediate: implement an LSTM cell from scratch (the CodeExecutor problem) and understand how the gates control information flow.
Advanced: read 'Attention Is All You Need' (Vaswani et al., 2017) — the paper that made RNNs obsolete — and understand why self-attention is strictly more powerful than recurrence for modeling long-range dependencies.