Combining random forests built with different training sets in R

Random Forests

Machine Learning

R Programming

Ensemble Learning

Data Science

Combining random forests built with different training sets in R

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Combining random forests trained on different datasets can make sense, but only if you are clear about what "combine" means. In R, you can sometimes merge forest objects directly, but in many practical cases the safer approach is to combine predictions rather than trying to fuse the internals of independently trained models.

Two Different Meanings of “Combining”

There are two common scenarios:

The forests were trained for the same target with the same predictors and compatible settings.
The forests were trained on different feature spaces, preprocessing pipelines, or label definitions.

In the first case, direct combination may be possible. In the second case, combining predictions is usually the better idea.

Direct Combination With Compatible Forests

The randomForest package provides a combine function for compatible forests.

1library(randomForest)
2
3set.seed(1)
4train_a <- iris[1:75, ]
5train_b <- iris[76:150, ]
6
7rf_a <- randomForest(Species ~ ., data = train_a, ntree = 50)
8rf_b <- randomForest(Species ~ ., data = train_b, ntree = 50)
9
10rf_combined <- combine(rf_a, rf_b)
11print(rf_combined)

This effectively pools the trees from both fitted forests into one larger forest object. That can work when the modeling problem and feature space are consistent across both fits.

When Direct Combination Is Reasonable

Direct combination makes the most sense when:

The response variable is the same.
The predictors are the same and in the same format.
The forests were trained with compatible settings.
You are conceptually just adding more trees trained on additional data partitions.

If those assumptions hold, combining the forests is similar to building one larger ensemble in pieces.

Prediction-Level Ensembling Is More Flexible

If the models were trained on meaningfully different datasets or pipelines, average or vote on predictions instead of merging the forest objects.

1library(randomForest)
2
3set.seed(1)
4idx <- sample(seq_len(nrow(iris)), 100)
5train <- iris[idx, ]
6test <- iris[-idx, ]
7
8rf_1 <- randomForest(Species ~ ., data = train, ntree = 100)
9rf_2 <- randomForest(Species ~ ., data = train[sample(nrow(train), 80), ], ntree = 100)
10
11pred_1 <- predict(rf_1, test, type = "prob")
12pred_2 <- predict(rf_2, test, type = "prob")
13
14avg_prob <- (pred_1 + pred_2) / 2
15final_pred <- colnames(avg_prob)[max.col(avg_prob)]
16head(final_pred)

This approach is robust because each model can stay independent while the ensemble happens at the output layer.

Why Prediction Ensembling Is Often Better

Prediction-level ensembling is easier to reason about when:

Training sets differ in size or distribution.
The forests come from different resampling schemes.
You want to compare model diversity explicitly.
The fitted objects are not structurally compatible for direct combination.

Instead of forcing the internals together, you let the models vote or average their confidence.

Be Careful About Data Leakage and Comparability

Combining forests is not automatically better. If the training sets overlap strangely, if preprocessing differs, or if the label construction changed, the ensemble can become harder to interpret rather than stronger.

A useful checklist is:

Same target definition.
Same predictor meaning.
Same preprocessing.
Same evaluation dataset.

If those are not aligned, the ensemble result may be statistically messy.

Common Pitfalls

Trying to directly combine forest objects that were trained with incompatible feature spaces or targets.
Assuming more trees automatically means a better model without validating on held-out data.
Mixing preprocessing pipelines and then averaging predictions as if they were directly comparable.
Ignoring calibration differences when averaging class probabilities.
Treating “trained on different data” as a benefit without checking whether the data distributions actually complement each other.

Summary

Combining random forests can mean merging forest objects or ensembling predictions.
Direct combination is viable only when the forests are truly compatible.
Prediction-level ensembling is more flexible and often safer.
Validate the combined result on held-out data instead of assuming it improved performance.
Clear alignment of features, targets, and preprocessing matters more than the act of combining by itself.