residual-plot
python
data-visualization
regression-analysis
machine-learning

Residual plot for residual vs predicted value in Python

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

A residuals-versus-predicted plot is one of the fastest ways to check whether a regression model is behaving sensibly. The plot compares fitted values on the horizontal axis with residuals, which are actual values minus predicted values, on the vertical axis. A good fit usually produces residuals scattered randomly around zero rather than a visible pattern.

What Residuals Represent

For each observation:

  • predicted value = what the model estimated
  • residual = actual minus predicted

If the model is well specified, the residuals should not systematically depend on the fitted value.

python
residual = y_true - y_pred

That simple quantity is the foundation of the plot.

Create A Residual Plot With scikit-learn And Matplotlib

python
1import matplotlib.pyplot as plt
2from sklearn.datasets import make_regression
3from sklearn.linear_model import LinearRegression
4from sklearn.model_selection import train_test_split
5
6X, y = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)
7X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
8
9model = LinearRegression()
10model.fit(X_train, y_train)
11
12y_pred = model.predict(X_test)
13residuals = y_test - y_pred
14
15plt.axhline(0, color="black", linestyle="--")
16plt.scatter(y_pred, residuals, alpha=0.7)
17plt.xlabel("Predicted values")
18plt.ylabel("Residuals")
19plt.title("Residuals vs Predicted")
20plt.show()

This is often enough for a first diagnostic check.

What A Good Plot Looks Like

A healthy residual plot usually shows:

  • points centered around zero
  • no strong curved pattern
  • roughly similar vertical spread across the range of fitted values

That does not prove the model is perfect, but it suggests the linear fit is at least not obviously violating core assumptions.

What Problem Patterns Look Like

Several visual patterns are warnings:

  • curve shape: the relationship may be nonlinear
  • funnel shape: residual variance may change with prediction size
  • clusters: missing groups or interactions may exist
  • isolated large points: potential outliers or influential observations

If residuals trend upward or downward as predictions increase, the model is systematically missing structure.

Example Of A Nonlinear Misspecification

python
1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.linear_model import LinearRegression
4
5x = np.linspace(-3, 3, 200).reshape(-1, 1)
6y = x[:, 0] ** 2 + np.random.normal(scale=0.5, size=200)
7
8model = LinearRegression()
9model.fit(x, y)
10y_pred = model.predict(x)
11residuals = y - y_pred
12
13plt.axhline(0, color="black", linestyle="--")
14plt.scatter(y_pred, residuals, alpha=0.7)
15plt.xlabel("Predicted values")
16plt.ylabel("Residuals")
17plt.title("Residual plot with nonlinear pattern")
18plt.show()

Because the true relationship is quadratic, a linear model leaves a visible curved pattern in the residuals.

Use The Plot As A Diagnostic, Not A Final Verdict

A residual plot is one diagnostic tool, not a proof of correctness. It works best alongside:

  • residual histogram or Q-Q plot
  • leverage or influence checks
  • domain knowledge about feature construction
  • train-test evaluation metrics

A random-looking residual plot is reassuring, but it does not guarantee the model is the right one for the business problem.

Statsmodels Also Makes This Easy

If you are already using Statsmodels for regression summaries, the residual calculation is equally straightforward.

python
1import statsmodels.api as sm
2
3X_const = sm.add_constant(X_train)
4model = sm.OLS(y_train, X_const).fit()
5pred = model.predict(sm.add_constant(X_test))
6resid = y_test - pred

You can then plot pred versus resid in the same way with Matplotlib.

Common Pitfalls

The most common mistake is plotting residuals against the wrong quantity, such as the original input feature when the diagnostic you want is residuals versus fitted values. Another is forgetting the horizontal zero line, which makes interpretation harder. Developers also sometimes overinterpret a small sample of noisy points and declare the model broken without checking whether the apparent pattern is stable. Finally, a residual plot can suggest a problem, but fixing the model still requires domain-informed feature or model changes.

Summary

  • A residuals-versus-predicted plot checks whether regression errors behave randomly around zero.
  • Residuals are computed as actual values minus predicted values.
  • Curves, funnels, and clusters often indicate model misspecification or changing variance.
  • Matplotlib plus scikit-learn or Statsmodels is enough to generate the plot.
  • Use the plot as a diagnostic aid, not as the only measure of model quality.

Course illustration
Course illustration

All Rights Reserved.