XGBoost
regression
R programming
machine learning
data science

How to use XGBoost algorithm for regression in R?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

XGBoost (eXtreme Gradient Boosting) is one of the most effective algorithms for regression tasks, offering regularization, handling of missing values, and parallel tree construction. In R, the xgboost package provides the xgb.train() and xgboost() functions. For regression, set objective = "reg:squarederror" (MSE loss) or "reg:squaredlogerror" (log-transformed MSE). The workflow involves converting data to xgb.DMatrix, tuning hyperparameters with cross-validation (xgb.cv), training the model, and evaluating with RMSE or MAE.

Installation and Setup

r
1# Install xgboost
2install.packages("xgboost")
3library(xgboost)
4
5# For this tutorial, we also use
6library(caret)       # For train/test split
7library(Metrics)     # For RMSE, MAE

Basic Regression Example

r
1library(xgboost)
2
3# Use built-in Boston housing dataset equivalent
4data(mtcars)
5
6# Predict mpg from other features
7set.seed(42)
8train_idx <- sample(1:nrow(mtcars), 0.8 * nrow(mtcars))
9train_data <- mtcars[train_idx, ]
10test_data <- mtcars[-train_idx, ]
11
12# Separate features and target
13X_train <- as.matrix(train_data[, -1])  # All columns except mpg
14y_train <- train_data$mpg
15
16X_test <- as.matrix(test_data[, -1])
17y_test <- test_data$mpg
18
19# Create DMatrix objects
20dtrain <- xgb.DMatrix(data = X_train, label = y_train)
21dtest <- xgb.DMatrix(data = X_test, label = y_test)
22
23# Train model
24model <- xgb.train(
25  params = list(
26    objective = "reg:squarederror",
27    eta = 0.1,           # Learning rate
28    max_depth = 6,       # Maximum tree depth
29    subsample = 0.8,     # Row sampling
30    colsample_bytree = 0.8  # Column sampling
31  ),
32  data = dtrain,
33  nrounds = 100,
34  watchlist = list(train = dtrain, test = dtest),
35  verbose = 0
36)
37
38# Predict
39predictions <- predict(model, dtest)
40rmse <- sqrt(mean((predictions - y_test)^2))
41cat("RMSE:", rmse, "\n")

The objective = "reg:squarederror" parameter tells XGBoost to minimize mean squared error, making it a standard regression model.

Cross-Validation for Hyperparameter Tuning

r
1library(xgboost)
2
3params <- list(
4  objective = "reg:squarederror",
5  eta = 0.1,
6  max_depth = 6,
7  subsample = 0.8,
8  colsample_bytree = 0.8
9)
10
11# 5-fold cross-validation
12cv_result <- xgb.cv(
13  params = params,
14  data = dtrain,
15  nrounds = 500,
16  nfold = 5,
17  metrics = "rmse",
18  early_stopping_rounds = 20,  # Stop if no improvement for 20 rounds
19  verbose = 1
20)
21
22# Optimal number of rounds
23best_nrounds <- cv_result$best_iteration
24cat("Best iteration:", best_nrounds, "\n")
25cat("Best RMSE:", min(cv_result$evaluation_log$test_rmse_mean), "\n")
26
27# Train final model with optimal nrounds
28final_model <- xgb.train(
29  params = params,
30  data = dtrain,
31  nrounds = best_nrounds
32)

xgb.cv performs k-fold cross-validation and reports train/test metrics per round. early_stopping_rounds prevents overfitting by stopping when the test metric plateaus.

Grid Search for Hyperparameters

r
1library(xgboost)
2
3# Define search grid
4grid <- expand.grid(
5  eta = c(0.01, 0.05, 0.1),
6  max_depth = c(3, 6, 9),
7  subsample = c(0.7, 0.8, 1.0)
8)
9
10best_rmse <- Inf
11best_params <- NULL
12
13for (i in 1:nrow(grid)) {
14  params <- list(
15    objective = "reg:squarederror",
16    eta = grid$eta[i],
17    max_depth = grid$max_depth[i],
18    subsample = grid$subsample[i],
19    colsample_bytree = 0.8
20  )
21
22  cv <- xgb.cv(
23    params = params,
24    data = dtrain,
25    nrounds = 300,
26    nfold = 5,
27    metrics = "rmse",
28    early_stopping_rounds = 15,
29    verbose = 0
30  )
31
32  min_rmse <- min(cv$evaluation_log$test_rmse_mean)
33  if (min_rmse < best_rmse) {
34    best_rmse <- min_rmse
35    best_params <- params
36    best_nrounds <- cv$best_iteration
37  }
38}
39
40cat("Best RMSE:", best_rmse, "\n")
41cat("Best params:", paste(names(best_params), best_params, sep = "="), "\n")

Feature Importance

r
1library(xgboost)
2
3# Get feature importance
4importance <- xgb.importance(model = final_model)
5print(importance)
6#    Feature       Gain      Cover  Frequency
7# 1:      wt 0.48231    0.28312    0.25000
8# 2:    disp 0.21123    0.19876    0.20000
9# 3:      hp 0.15432    0.22341    0.18000
10
11# Plot feature importance
12xgb.plot.importance(importance, top_n = 10)

Gain measures the improvement in accuracy contributed by a feature. Cover measures the relative number of observations related to a feature. Frequency counts how often a feature appears in trees.

Regression Objective Functions

r
1# Standard squared error (L2 loss) — most common
2params <- list(objective = "reg:squarederror")
3
4# Squared log error — for targets with large range (e.g., prices)
5params <- list(objective = "reg:squaredlogerror")
6
7# Absolute error (L1 loss) — robust to outliers
8params <- list(objective = "reg:absoluteerror")
9
10# Pseudo-Huber loss — smooth approximation of L1
11params <- list(objective = "reg:pseudohubererror")
12
13# Quantile regression
14params <- list(
15  objective = "reg:quantileerror",
16  quantile_alpha = 0.5  # Median regression
17)
18
19# Gamma regression — for strictly positive targets
20params <- list(objective = "reg:gamma")
21
22# Tweedie regression — for zero-inflated positive targets
23params <- list(objective = "reg:tweedie", tweedie_variance_power = 1.5)

Complete Workflow with caret

r
1library(caret)
2library(xgboost)
3
4data(mtcars)
5set.seed(42)
6
7# caret handles DMatrix conversion and cross-validation
8train_control <- trainControl(
9  method = "cv",
10  number = 5,
11  verboseIter = FALSE
12)
13
14tune_grid <- expand.grid(
15  nrounds = c(100, 200),
16  max_depth = c(3, 6),
17  eta = c(0.05, 0.1),
18  gamma = 0,
19  colsample_bytree = 0.8,
20  min_child_weight = 1,
21  subsample = 0.8
22)
23
24model <- train(
25  mpg ~ .,
26  data = mtcars,
27  method = "xgbTree",
28  trControl = train_control,
29  tuneGrid = tune_grid
30)
31
32print(model$bestTune)
33predictions <- predict(model, mtcars)

The caret package wraps XGBoost with a unified interface, handling data conversion, cross-validation, and grid search automatically.

Common Pitfalls

  • Not using xgb.DMatrix: Passing a raw data frame to xgb.train fails. XGBoost requires numeric matrices wrapped in xgb.DMatrix. Convert factors to numeric (one-hot encoding) and use as.matrix() before creating the DMatrix.
  • Overfitting without early stopping: XGBoost can memorize training data if nrounds is too high. Always use xgb.cv with early_stopping_rounds to find the optimal number of boosting rounds, or use a validation set in the watchlist.
  • Using deprecated reg:linear objective: The objective "reg:linear" was renamed to "reg:squarederror" in XGBoost 0.90. Using the old name triggers a deprecation warning. Always use "reg:squarederror" for standard regression.
  • Ignoring feature importance for sparse features: XGBoost handles sparse data well, but including thousands of irrelevant features still slows training. Use feature importance after an initial model to select the top features and retrain.
  • Not scaling the learning rate with nrounds: A high eta (0.3) with many rounds overfits quickly. A low eta (0.01) with few rounds underfits. The rule of thumb is to lower eta and increase nrounds with early stopping for best results.

Summary

  • Use objective = "reg:squarederror" for standard regression in XGBoost
  • Convert data to xgb.DMatrix with numeric matrices before training
  • Use xgb.cv with early_stopping_rounds to find optimal nrounds and prevent overfitting
  • Tune eta, max_depth, subsample, and colsample_bytree via grid search
  • Check xgb.importance() to understand which features drive predictions
  • Use caret::train(method = "xgbTree") for an integrated workflow with automatic cross-validation

Course illustration
Course illustration

All Rights Reserved.