Combining random forests built with different training sets in R
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Combining random forests trained on different datasets can make sense, but only if you are clear about what "combine" means. In R, you can sometimes merge forest objects directly, but in many practical cases the safer approach is to combine predictions rather than trying to fuse the internals of independently trained models.
Two Different Meanings of “Combining”
There are two common scenarios:
- The forests were trained for the same target with the same predictors and compatible settings.
- The forests were trained on different feature spaces, preprocessing pipelines, or label definitions.
In the first case, direct combination may be possible. In the second case, combining predictions is usually the better idea.
Direct Combination With Compatible Forests
The randomForest package provides a combine function for compatible forests.
This effectively pools the trees from both fitted forests into one larger forest object. That can work when the modeling problem and feature space are consistent across both fits.
When Direct Combination Is Reasonable
Direct combination makes the most sense when:
- The response variable is the same.
- The predictors are the same and in the same format.
- The forests were trained with compatible settings.
- You are conceptually just adding more trees trained on additional data partitions.
If those assumptions hold, combining the forests is similar to building one larger ensemble in pieces.
Prediction-Level Ensembling Is More Flexible
If the models were trained on meaningfully different datasets or pipelines, average or vote on predictions instead of merging the forest objects.
This approach is robust because each model can stay independent while the ensemble happens at the output layer.
Why Prediction Ensembling Is Often Better
Prediction-level ensembling is easier to reason about when:
- Training sets differ in size or distribution.
- The forests come from different resampling schemes.
- You want to compare model diversity explicitly.
- The fitted objects are not structurally compatible for direct combination.
Instead of forcing the internals together, you let the models vote or average their confidence.
Be Careful About Data Leakage and Comparability
Combining forests is not automatically better. If the training sets overlap strangely, if preprocessing differs, or if the label construction changed, the ensemble can become harder to interpret rather than stronger.
A useful checklist is:
- Same target definition.
- Same predictor meaning.
- Same preprocessing.
- Same evaluation dataset.
If those are not aligned, the ensemble result may be statistically messy.
Common Pitfalls
- Trying to directly combine forest objects that were trained with incompatible feature spaces or targets.
- Assuming more trees automatically means a better model without validating on held-out data.
- Mixing preprocessing pipelines and then averaging predictions as if they were directly comparable.
- Ignoring calibration differences when averaging class probabilities.
- Treating “trained on different data” as a benefit without checking whether the data distributions actually complement each other.
Summary
- Combining random forests can mean merging forest objects or ensembling predictions.
- Direct combination is viable only when the forests are truly compatible.
- Prediction-level ensembling is more flexible and often safer.
- Validate the combined result on held-out data instead of assuming it improved performance.
- Clear alignment of features, targets, and preprocessing matters more than the act of combining by itself.

