Implementation of Linear Regression Closed Form Solution

Linear Regression

Closed Form Solution

Machine Learning

Statistical Modelling

Analytical Methods

Implementation of Linear Regression Closed Form Solution

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Linear regression is one of the most fundamental algorithms in machine learning and statistics, used to model the relationship between a dependent variable and one or more independent variables. Unlike iterative methods such as gradient descent, the closed-form solution computes the optimal coefficients directly in a single step using matrix algebra. This article explains the derivation, implementation, and practical considerations of the closed-form solution.

Linear Regression Basics

Linear regression fits a line (or hyperplane in higher dimensions) to data by minimizing the sum of squared errors between observed and predicted values.

For univariate regression, the model is:

$y = \beta_0 + \beta_1 x + \epsilon$

where $y$ is the dependent variable, $x$ is the independent variable, $\beta_0$ is the intercept, $\beta_1$ is the slope, and $\epsilon$ is the error term.

For multivariate regression with $m$ features, the model generalizes to:

$\mathbf{y} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}$

where:

$\mathbf{y}$ is the $n \times 1$ vector of observed outputs
$\mathbf{X}$ is the $n \times (m+1)$ design matrix (with a column of ones for the intercept)
$\mathbf{w}$ is the $(m+1) \times 1$ vector of coefficients $[\beta_0, \beta_1, \ldots, \beta_m]^T$

The Normal Equation

The closed-form solution minimizes the sum of squared residuals $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ . Taking the gradient with respect to $\mathbf{w}$ and setting it to zero yields the normal equation:

$\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$

Derivation

Start with the cost function:

$J(\mathbf{w}) = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w})$

Expanding:

$J(\mathbf{w}) = \mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w}$

Taking the gradient and setting it to zero:

$\nabla_\mathbf{w} J = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w} = 0$

Solving for $\mathbf{w}$ :

$\mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}$

$\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Step-by-Step Computation

Given $n$ data points with $m$ features:

Construct the design matrix $\mathbf{X}$ by prepending a column of ones:

$\mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} \\ 1 & x_{21} & x_{22} \\ \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} \end{bmatrix}$

Compute $\mathbf{X}^T\mathbf{X}$ , a $(m+1) \times (m+1)$ square matrix
Compute $(\mathbf{X}^T\mathbf{X})^{-1}$
Compute $\mathbf{X}^T\mathbf{y}$ , a $(m+1) \times 1$ vector
Multiply: $\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Implementation

Python with NumPy

python

1import numpy as np
2
3def linear_regression_closed_form(X, y):
4    """
5    X: numpy array of shape (n, m) - feature matrix
6    y: numpy array of shape (n,) - target vector
7    Returns: coefficient vector w of shape (m+1,)
8    """
9    n = X.shape[0]
10    # Add intercept column
11    X_aug = np.column_stack([np.ones(n), X])
12    # Normal equation
13    w = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y
14    return w
15
16# Example usage
17X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
18y = np.array([3, 7, 11, 15])
19w = linear_regression_closed_form(X, y)
20print(f"Intercept: {w[0]:.4f}, Coefficients: {w[1:]}")

More Numerically Stable Version

Using np.linalg.lstsq avoids explicitly computing the matrix inverse, which can be numerically unstable:

python

1def linear_regression_stable(X, y):
2    n = X.shape[0]
3    X_aug = np.column_stack([np.ones(n), X])
4    w, residuals, rank, sv = np.linalg.lstsq(X_aug, y, rcond=None)
5    return w

This uses the SVD (Singular Value Decomposition) internally, which is more robust when $\mathbf{X}^T\mathbf{X}$ is nearly singular.

Advantages and Disadvantages

Aspect	Closed-Form Solution	Gradient Descent
Computation	Single matrix operation	Iterative, many steps
Hyperparameters	None	Learning rate, iterations
Complexity	$O(m^2 n + m^3)$	$O(mnk)$ per epoch
Large $m$	Impractical (matrix inverse)	Scales well
Large $n$	Scales well	Each epoch is $O(mn)$
Numerical stability	Can be poor	Generally stable

The closed-form solution is ideal when the number of features $m$ is small (up to a few thousand). For high-dimensional problems, iterative methods like gradient descent or stochastic gradient descent are preferred because the $O(m^3)$ cost of matrix inversion becomes prohibitive.

Regularized Variant: Ridge Regression

When $\mathbf{X}^T\mathbf{X}$ is singular or nearly singular (multicollinearity), the inverse does not exist or is numerically unstable. Ridge regression adds a regularization term:

$\mathbf{w} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$

The $\lambda\mathbf{I}$ term ensures the matrix is invertible and shrinks the coefficients toward zero, reducing overfitting. Here $\lambda > 0$ is the regularization strength.

Summary

The closed-form solution for linear regression provides exact coefficients in one computation step via the normal equation $\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ . It requires no hyperparameter tuning and is the preferred method when the feature count is manageable. For numerical stability, use SVD-based solvers rather than explicit matrix inversion. When features are highly correlated or numerous, ridge regression or iterative methods are better alternatives.