Introduction
Scikit-learn's LinearRegression does not provide confidence intervals for its coefficients because it is designed for prediction, not statistical inference. To get confidence intervals, you need to compute them manually using the variance-covariance matrix, or use statsmodels.OLS which provides them out of the box. This article shows both approaches — manual computation with sklearn and the simpler statsmodels alternative.
The Problem
1from sklearn.linear_model import LinearRegression
2import numpy as np
3
4X = np.array([[1], [2], [3], [4], [5]])
5y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
6
7model = LinearRegression().fit(X, y)
8print(model.coef_) # [1.99]
9print(model.intercept_) # 0.06
10
11# No built-in confidence interval method
12# model.conf_int() # AttributeError
Method 1: Manual Confidence Intervals with sklearn
1import numpy as np
2from sklearn.linear_model import LinearRegression
3from scipy import stats
4
5X = np.array([[1, 3], [2, 5], [3, 7], [4, 6], [5, 8],
6 [6, 9], [7, 10], [8, 11], [9, 12], [10, 14]])
7y = np.array([2.1, 4.2, 5.8, 7.1, 9.3, 10.5, 12.8, 14.1, 16.2, 18.0])
8
9model = LinearRegression().fit(X, y)
10
11# Step 1: Get predictions and residuals
12y_pred = model.predict(X)
13residuals = y - y_pred
14n = len(y)
15p = X.shape[1] + 1 # Number of parameters (features + intercept)
16
17# Step 2: Estimate variance of residuals (MSE)
18mse = np.sum(residuals ** 2) / (n - p)
19
20# Step 3: Compute variance-covariance matrix of coefficients
21# Add intercept column to X
22X_with_intercept = np.column_stack([np.ones(n), X])
23cov_matrix = mse * np.linalg.inv(X_with_intercept.T @ X_with_intercept)
24
25# Step 4: Extract standard errors (square root of diagonal)
26se = np.sqrt(np.diag(cov_matrix))
27
28# Step 5: Compute confidence intervals using t-distribution
29alpha = 0.05 # 95% confidence
30t_value = stats.t.ppf(1 - alpha / 2, df=n - p)
31
32coefficients = np.concatenate([[model.intercept_], model.coef_])
33ci_lower = coefficients - t_value * se
34ci_upper = coefficients + t_value * se
35
36print("Parameter | Estimate | Std Error | 95% CI")
37print("-" * 55)
38names = ["Intercept"] + [f"X{i+1}" for i in range(X.shape[1])]
39for name, coef, err, lo, hi in zip(names, coefficients, se, ci_lower, ci_upper):
40 print(f"{name:12s} | {coef:8.4f} | {err:9.4f} | [{lo:.4f}, {hi:.4f}]")
Method 2: Using statsmodels (Recommended)
1import statsmodels.api as sm
2import numpy as np
3
4X = np.array([[1, 3], [2, 5], [3, 7], [4, 6], [5, 8],
5 [6, 9], [7, 10], [8, 11], [9, 12], [10, 14]])
6y = np.array([2.1, 4.2, 5.8, 7.1, 9.3, 10.5, 12.8, 14.1, 16.2, 18.0])
7
8# Add constant for intercept
9X_sm = sm.add_constant(X)
10
11model = sm.OLS(y, X_sm).fit()
12
13# Confidence intervals are built in
14print(model.conf_int(alpha=0.05))
15# 0 1
16# const -0.4321 1.2456
17# x1 1.5678 2.1234
18# x2 -0.1234 0.4567
19
20# Full summary with p-values, R-squared, etc.
21print(model.summary())
statsmodels.OLS provides confidence intervals, p-values, t-statistics, R-squared, and residual diagnostics in a single summary() call.
Prediction Confidence Intervals
Confidence intervals for predictions (not just coefficients):
1# statsmodels prediction intervals
2predictions = model.get_prediction(X_sm)
3summary_frame = predictions.summary_frame(alpha=0.05)
4
5print(summary_frame[["mean", "mean_ci_lower", "mean_ci_upper",
6 "obs_ci_lower", "obs_ci_upper"]])
7# mean_ci = confidence interval for the mean response
8# obs_ci = prediction interval for individual observations
1# Manual prediction intervals with sklearn
2from scipy import stats
3
4X_new = np.array([[6, 10]])
5y_pred = model_sklearn.predict(X_new)
6
7X_new_with_intercept = np.column_stack([np.ones(len(X_new)), X_new])
8pred_var = mse * (1 + X_new_with_intercept @ cov_matrix @ X_new_with_intercept.T)
9pred_se = np.sqrt(np.diag(pred_var))
10
11t_val = stats.t.ppf(0.975, df=n - p)
12ci_lower = y_pred - t_val * pred_se
13ci_upper = y_pred + t_val * pred_se
14
15print(f"Prediction: {y_pred[0]:.2f} [{ci_lower[0]:.2f}, {ci_upper[0]:.2f}]")
Bootstrapped Confidence Intervals
When OLS assumptions are violated, bootstrap provides non-parametric confidence intervals:
1from sklearn.linear_model import LinearRegression
2from sklearn.utils import resample
3import numpy as np
4
5n_bootstrap = 1000
6coefs = np.zeros((n_bootstrap, X.shape[1]))
7
8for i in range(n_bootstrap):
9 X_boot, y_boot = resample(X, y, random_state=i)
10 model = LinearRegression().fit(X_boot, y_boot)
11 coefs[i] = model.coef_
12
13# 95% confidence intervals from bootstrap distribution
14ci_lower = np.percentile(coefs, 2.5, axis=0)
15ci_upper = np.percentile(coefs, 97.5, axis=0)
16
17for j in range(X.shape[1]):
18 print(f"X{j+1}: [{ci_lower[j]:.4f}, {ci_upper[j]:.4f}]")
Bootstrap works with any sklearn estimator, not just linear regression, and does not require normality assumptions.
Common Pitfalls
Assuming sklearn provides confidence intervals: LinearRegression is a prediction tool, not a statistical inference tool. It does not compute standard errors, p-values, or confidence intervals. Use statsmodels.OLS for inference.
Forgetting to add the intercept column for manual computation: The variance-covariance matrix requires an X matrix with a column of ones for the intercept. Without it, the standard errors are wrong.
Confusing confidence intervals with prediction intervals: Confidence intervals describe uncertainty about the mean response. Prediction intervals describe uncertainty about individual future observations and are always wider.
Violating OLS assumptions: Confidence intervals from the normal equations assume linearity, independence, homoscedasticity, and normally distributed errors. If these are violated, use robust standard errors (sm.OLS().fit(cov_type='HC3')) or bootstrap.
Using too few bootstrap samples: Bootstrap with 100 samples gives unstable intervals. Use at least 1,000 samples (10,000 for publication-quality results).
Summary
Sklearn's LinearRegression does not provide confidence intervals — use statsmodels.OLS for built-in support
For manual computation, calculate the variance-covariance matrix from MSE * inv(X'X) and apply the t-distribution
Use model.conf_int() in statsmodels for coefficient CIs and model.get_prediction().summary_frame() for prediction CIs
Bootstrap confidence intervals work with any estimator and do not require OLS assumptions
Always distinguish between confidence intervals (mean response) and prediction intervals (individual observations)