Batch Gradient Descent for Logistic Regression

machine learning

logistic regression

batch gradient descent

optimization techniques

algorithm implementation

Batch Gradient Descent for Logistic Regression

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Batch gradient descent is the simplest optimization strategy for logistic regression: compute the gradient using the entire training set, then update the weights once per iteration. It is conceptually clean and easy to implement, which makes it a good way to understand how logistic regression is trained.

Logistic regression in one equation

For binary classification, logistic regression predicts a probability using the sigmoid function:

linear score = X @ w + b
probability = sigmoid(score)

The sigmoid squashes the score into the range from 0 to 1, which lets us interpret the output as the probability of class 1.

python

1import numpy as np
2
3
4def sigmoid(z):
5    return 1.0 / (1.0 + np.exp(-z))

What makes batch gradient descent "batch"

The word batch here means the gradient is computed from all training examples before any parameter update happens. If there are m examples, every step uses all m rows.

That is different from:

stochastic gradient descent, which updates after one example,
and mini-batch gradient descent, which updates after a small chunk of examples.

Because every step uses the full dataset, batch gradient descent moves smoothly but can be slow on large datasets.

A complete NumPy implementation

The code below trains logistic regression with batch gradient descent:

python

1import numpy as np
2
3
4def sigmoid(z):
5    return 1.0 / (1.0 + np.exp(-z))
6
7
8def fit_logistic_regression(X, y, learning_rate=0.1, epochs=1000):
9    n_samples, n_features = X.shape
10    w = np.zeros(n_features)
11    b = 0.0
12
13    for epoch in range(epochs):
14        linear = X @ w + b
15        y_hat = sigmoid(linear)
16
17        error = y_hat - y
18        dw = (X.T @ error) / n_samples
19        db = np.sum(error) / n_samples
20
21        w -= learning_rate * dw
22        b -= learning_rate * db
23
24        if epoch % 100 == 0:
25            loss = -np.mean(y * np.log(y_hat + 1e-9) + (1 - y) * np.log(1 - y_hat + 1e-9))
26            print(epoch, round(loss, 4))
27
28    return w, b
29
30
31X = np.array([
32    [0.2, 1.1],
33    [1.0, 1.0],
34    [1.3, 0.2],
35    [2.0, 1.0],
36    [2.2, 1.5],
37])
38
39y = np.array([0, 0, 0, 1, 1])
40
41weights, bias = fit_logistic_regression(X, y)
42print(weights, bias)

This is the essential training loop: predict, compute error, compute gradient, update parameters.

Why the gradient looks this way

For logistic regression with binary cross-entropy loss, the gradient simplifies nicely. Once you compute y_hat - y, the rest is just averaging the feature-weighted errors across the dataset.

That simplicity is one reason logistic regression is such a useful teaching model. The optimization logic is much easier to inspect than in deeper neural networks.

When batch gradient descent is a good fit

Batch gradient descent is a good choice when:

the dataset fits comfortably in memory,
you want stable, deterministic updates,
and training speed is not your main bottleneck.

It is especially useful for education, prototypes, and smaller tabular datasets. On very large datasets, mini-batch methods usually win because they trade some smoothness for much better throughput.

Common Pitfalls

The most common mistake is forgetting the bias term. If you only optimize the weight vector and never update an intercept, the model may fit poorly even when the rest of the code looks correct.

Another issue is numerical instability in the loss computation. Taking log(0) breaks training, so small stabilizers such as 1e-9 are commonly added.

Be careful with feature scaling too. If one feature is measured in tiny decimals and another is in huge raw counts, gradient descent can converge very slowly or behave erratically.

Finally, a learning rate that is too large can make the loss bounce or diverge, while a learning rate that is too small can make training look frozen.

Summary

Batch gradient descent updates logistic regression parameters using the full dataset each step.
The training loop is: predict, compute error, compute gradient, update weights and bias.
It is easy to implement and useful for understanding optimization.
It is stable on small and medium datasets but can be slow on large ones.
Watch the learning rate, feature scaling, and numerical stability when implementing it from scratch.