The Perceptron to the MLP

Topics Covered

Single Neuron and Decision Boundaries

The Neuron as a Weighted Vote

What the Decision Boundary Actually Is

The Fundamental Limitation

Activation Functions

Why Linear Layers Collapse: A Proof

Sigmoid, The Original, and Its Problems

Tanh, Better, But Still Saturates

ReLU, The Workhorse That Changed Everything

GELU, What Transformers Use

SwiGLU, The Modern LLM Standard

Multilayer Perceptrons

The Architecture

Forward Pass as Function Composition

The Universal Approximation Theorem

Weight Matrices as Learned Features

Backpropagation from Scratch

The Setup

Step 1: Backward Through the Loss

Step 2: Backward Through the Second Layer

Step 3: Backward Through ReLU

Step 4: Backward Through the First Layer

Why This Works

Loss Functions: Cross-Entropy and Softmax

Numerical Gradient Checking

Practice: Implement It Yourself

A neural network is not magic. It is a sequence of matrix multiplications and nonlinearities. But before you can understand what happens in a 70-billion-parameter model, you need to understand what a single neuron does. Start there.

The Neuron as a Weighted Vote

A single neuron takes a vector of inputs xx, computes a weighted sum, adds a bias, and passes the result through an activation function:

output=ϕ(w1x1+w2x2++wnxn+b)=ϕ(wx+b)\text{output} = \phi(w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b) = \phi(w^\top x + b)

The weights ww control how much each input contributes. The bias bb shifts the decision threshold. Without the bias, every decision boundary must pass through the origin, a severe constraint. The activation function ϕ\phi decides whether the neuron "fires."

What the Decision Boundary Actually Is

For a two-class classification problem with two input features, the neuron draws a line (a hyperplane in higher dimensions) defined by:

wx+b=0w^\top x + b = 0

Points on one side get a positive weighted sum (neuron fires). Points on the other side get a negative weighted sum (neuron doesn't fire). Changing ww rotates the boundary. Changing bb shifts it in or out from the origin.

A single neuron is a hyperplane
-2-101234-2-101234x₁x₂Class A (y = 1)Class B (y = 0)
The neuron computes w·x + b and applies a sigmoid. The decision boundary w·x + b = 0 is a line in 2D, a plane in 3D, a hyperplane in general. Rotating the weights rotates the line; changing the bias shifts it. This is exactly logistic regression in geometric form — and it cannot separate XOR.

This is identical to logistic regression, a single neuron with a sigmoid activation is logistic regression. The weights are what you learn. The decision boundary is the geometry of those weights.

The Fundamental Limitation

Here is the catch: a single neuron can only separate linearly separable classes. If you need to classify XOR, where (0,0)(0,0) and (1,1)(1,1) are class 0 but (0,1)(0,1) and (1,0)(1,0) are class 1, a single hyperplane cannot do it. No matter how you rotate or shift the line, the two classes remain interlocked.

This is not just an academic limitation. Real-world data is rarely linearly separable. Spam emails don't cluster neatly on one side of a hyperplane in feature space. The solution is to stack neurons into layers. But first, let's understand activation functions, they're what allows stacking to produce nonlinear behavior.

Key Insight

The perceptron (1957) only used a step function. It either fired or didn't. The modern neuron uses smooth, differentiable activations. This small change makes the entire field possible: you can't do gradient descent through a step function because the gradient is zero everywhere except the discontinuity.