Linear Regression
Closed Form Solution
Machine Learning
Statistical Modelling
Analytical Methods

Implementation of Linear Regression Closed Form Solution

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Linear regression is one of the most fundamental algorithms in machine learning and statistics, used to model the relationship between a dependent variable and one or more independent variables. Unlike iterative methods such as gradient descent, the closed-form solution computes the optimal coefficients directly in a single step using matrix algebra. This article explains the derivation, implementation, and practical considerations of the closed-form solution.

Linear Regression Basics

Linear regression fits a line (or hyperplane in higher dimensions) to data by minimizing the sum of squared errors between observed and predicted values.

For univariate regression, the model is:

y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon

where yy is the dependent variable, xx is the independent variable, β0\beta_0 is the intercept, β1\beta_1 is the slope, and ϵ\epsilon is the error term.

For multivariate regression with mm features, the model generalizes to:

y=Xw+ϵ\mathbf{y} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}

where:

  • y\mathbf{y} is the n×1n \times 1 vector of observed outputs
  • X\mathbf{X} is the n×(m+1)n \times (m+1) design matrix (with a column of ones for the intercept)
  • w\mathbf{w} is the (m+1)×1(m+1) \times 1 vector of coefficients [β0,β1,,βm]T[\beta_0, \beta_1, \ldots, \beta_m]^T

The Normal Equation

The closed-form solution minimizes the sum of squared residuals yXw2\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2. Taking the gradient with respect to w\mathbf{w} and setting it to zero yields the normal equation:

w=(XTX)1XTy\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

Derivation

Start with the cost function:

J(w)=(yXw)T(yXw)J(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w})

Expanding:

J(w)=yTy2wTXTy+wTXTXwJ(\mathbf{w}) = \mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w}

Taking the gradient and setting it to zero:

wJ=2XTy+2XTXw=0\nabla_\mathbf{w} J = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w} = 0

Solving for w\mathbf{w}:

XTXw=XTy\mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}

w=(XTX)1XTy\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Step-by-Step Computation

Given nn data points with mm features:

  1. Construct the design matrix X\mathbf{X} by prepending a column of ones:

X=[1x11x121x21x221xn1xn2]\mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} \\ 1 & x_{21} & x_{22} \\ \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} \end{bmatrix}

  1. Compute XTX\mathbf{X}^T\mathbf{X}, a (m+1)×(m+1)(m+1) \times (m+1) square matrix
  2. Compute (XTX)1(\mathbf{X}^T\mathbf{X})^{-1}
  3. Compute XTy\mathbf{X}^T\mathbf{y}, a (m+1)×1(m+1) \times 1 vector
  4. Multiply: w=(XTX)1XTy\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Implementation

Python with NumPy

python
1import numpy as np
2
3def linear_regression_closed_form(X, y):
4    """
5    X: numpy array of shape (n, m) - feature matrix
6    y: numpy array of shape (n,) - target vector
7    Returns: coefficient vector w of shape (m+1,)
8    """
9    n = X.shape[0]
10    # Add intercept column
11    X_aug = np.column_stack([np.ones(n), X])
12    # Normal equation
13    w = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y
14    return w
15
16# Example usage
17X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
18y = np.array([3, 7, 11, 15])
19w = linear_regression_closed_form(X, y)
20print(f"Intercept: {w[0]:.4f}, Coefficients: {w[1:]}")

More Numerically Stable Version

Using np.linalg.lstsq avoids explicitly computing the matrix inverse, which can be numerically unstable:

python
1def linear_regression_stable(X, y):
2    n = X.shape[0]
3    X_aug = np.column_stack([np.ones(n), X])
4    w, residuals, rank, sv = np.linalg.lstsq(X_aug, y, rcond=None)
5    return w

This uses the SVD (Singular Value Decomposition) internally, which is more robust when XTX\mathbf{X}^T\mathbf{X} is nearly singular.

Advantages and Disadvantages

AspectClosed-Form SolutionGradient Descent
ComputationSingle matrix operationIterative, many steps
HyperparametersNoneLearning rate, iterations
ComplexityO(m2n+m3)O(m^2 n + m^3)O(mnk)O(mnk) per epoch
Large mmImpractical (matrix inverse)Scales well
Large nnScales wellEach epoch is O(mn)O(mn)
Numerical stabilityCan be poorGenerally stable

The closed-form solution is ideal when the number of features mm is small (up to a few thousand). For high-dimensional problems, iterative methods like gradient descent or stochastic gradient descent are preferred because the O(m3)O(m^3) cost of matrix inversion becomes prohibitive.

Regularized Variant: Ridge Regression

When XTX\mathbf{X}^T\mathbf{X} is singular or nearly singular (multicollinearity), the inverse does not exist or is numerically unstable. Ridge regression adds a regularization term:

w=(XTX+λI)1XTy\mathbf{w} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

The λI\lambda\mathbf{I} term ensures the matrix is invertible and shrinks the coefficients toward zero, reducing overfitting. Here λ>0\lambda > 0 is the regularization strength.

Summary

The closed-form solution for linear regression provides exact coefficients in one computation step via the normal equation w=(XTX)1XTy\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}. It requires no hyperparameter tuning and is the preferred method when the feature count is manageable. For numerical stability, use SVD-based solvers rather than explicit matrix inversion. When features are highly correlated or numerous, ridge regression or iterative methods are better alternatives.


Course illustration
Course illustration

All Rights Reserved.