Deep Learning Foundations
Mathematical Foundations
Representation Learning
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
The Perceptron to the MLP
A neural network is not magic. It is a sequence of matrix multiplications and nonlinearities. But before you can understand what happens in a 70-billion-parameter model, you need to understand what a single neuron does. Start there.
The Neuron as a Weighted Vote
A single neuron takes a vector of inputs , computes a weighted sum, adds a bias, and passes the result through an activation function:
The weights control how much each input contributes. The bias shifts the decision threshold. Without the bias, every decision boundary must pass through the origin, a severe constraint. The activation function decides whether the neuron "fires."
What the Decision Boundary Actually Is
For a two-class classification problem with two input features, the neuron draws a line (a hyperplane in higher dimensions) defined by:
Points on one side get a positive weighted sum (neuron fires). Points on the other side get a negative weighted sum (neuron doesn't fire). Changing rotates the boundary. Changing shifts it in or out from the origin.
A single neuron is a hyperplane
This is identical to logistic regression, a single neuron with a sigmoid activation is logistic regression. The weights are what you learn. The decision boundary is the geometry of those weights.
The Fundamental Limitation
Here is the catch: a single neuron can only separate linearly separable classes. If you need to classify XOR, where and are class 0 but and are class 1, a single hyperplane cannot do it. No matter how you rotate or shift the line, the two classes remain interlocked.
This is not just an academic limitation. Real-world data is rarely linearly separable. Spam emails don't cluster neatly on one side of a hyperplane in feature space. The solution is to stack neurons into layers. But first, let's understand activation functions, they're what allows stacking to produce nonlinear behavior.
The perceptron (1957) only used a step function. It either fired or didn't. The modern neuron uses smooth, differentiable activations. This small change makes the entire field possible: you can't do gradient descent through a step function because the gradient is zero everywhere except the discontinuity.