caret combine the stratified createMultiFolds repeatedCV and groupKFold

caret

createMultiFolds

repeatedCV

groupKFold

machine learning

caret combine the stratified createMultiFolds repeatedCV and groupKFold

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

In caret, createMultiFolds() and groupKFold() solve different validation problems. The first gives repeated, roughly stratified folds based on the outcome, while the second prevents leakage by keeping all rows from the same group together.

Why these helpers do not combine automatically

createMultiFolds(y, k, times) works at the row level. It tries to preserve the outcome distribution in each training split. groupKFold(group, k) also works at the row level, but its main rule is stronger: all records that share a group must travel together.

When observations are clustered by patient, account, device, or site, group integrity usually matters more than perfect row-level stratification. If rows from the same group land in both training and validation sets, you get leakage and an overly optimistic score.

The practical consequence is that there is no built-in one-liner in caret that gives you repeated, grouped, and stratified folds at the same time. The usual solution is to build custom resampling indices and pass them into trainControl(index = ..., indexOut = ...).

Build grouped folds first

Start by deciding what each group represents. Then create folds on the group ids, not on the raw rows.

1library(caret)
2
3set.seed(42)
4
5dat <- data.frame(
6  patient_id = rep(paste0("P", 1:12), each = 3),
7  x1 = rnorm(36),
8  x2 = runif(36),
9  y = rep(c("yes", "no", "yes", "no"), each = 9)
10)
11
12group_folds <- groupKFold(dat$patient_id, k = 4)
13str(group_folds, max.level = 1)

Each element in group_folds is a vector of training row indices. All rows from a patient stay together.

If grouping is your only requirement, you can already use those indices in trainControl.

Add repetition with custom indices

To imitate repeated cross-validation, generate grouped folds multiple times and store every training split in one list.

1make_grouped_repeats <- function(groups, k, repeats) {
2  out <- list()
3
4  for (r in seq_len(repeats)) {
5    unique_groups <- sample(unique(groups))
6    shuffled <- factor(groups, levels = unique_groups)
7    folds <- groupKFold(shuffled, k = k)
8
9    names(folds) <- paste0("Fold", seq_along(folds), ".Rep", r)
10    out <- c(out, folds)
11  }
12
13  out
14}
15
16index <- make_grouped_repeats(dat$patient_id, k = 4, repeats = 3)

This creates repeated grouped splits, but it is not yet stratified. For many grouped problems, that is acceptable because leakage prevention is the priority.

Approximate stratification at the group level

If each group can be assigned a single class label, you can stratify groups rather than rows. That works well when each patient or account belongs to one target class.

1library(dplyr)
2
3group_labels <- dat %>%
4  distinct(patient_id, y)
5
6group_fold_ids <- createMultiFolds(group_labels$y, k = 4, times = 3)
7
8index <- lapply(group_fold_ids, function(train_group_rows) {
9  train_groups <- group_labels$patient_id[train_group_rows]
10  which(dat$patient_id %in% train_groups)
11})

Now the repetition and approximate stratification happen at the group table, and the row indices are expanded afterward. This is the safest way to combine the ideas when each group has one stable label.

If a single group contains mixed outcome classes, perfect stratification and strict grouping can conflict. In that case, keep grouping intact and accept only approximate balance, because leakage is the larger modeling error.

Use the indices in `train()`

Once you have the training indices, pass them directly to trainControl.

1ctrl <- trainControl(
2  method = "cv",
3  index = index,
4  savePredictions = "final",
5  classProbs = TRUE
6)
7
8fit <- train(
9  y ~ x1 + x2,
10  data = dat,
11  method = "glm",
12  family = binomial(),
13  trControl = ctrl
14)

When index is supplied, caret uses your custom resampling plan instead of generating one from method = "repeatedcv".

Common Pitfalls

The biggest mistake is trying to use createMultiFolds() on row-level outcomes and groupKFold() on row-level groups independently, then assuming the results can be merged afterward. They describe different split geometries, so naïve merging usually breaks group integrity.

Another issue is demanding perfect stratification when groups have mixed labels internally. In that situation, there may be no mathematically clean split that satisfies both rules. Document the tradeoff and favor no leakage over ideal class balance.

It is also easy to forget that custom resampling indices replace the built-in repeated CV logic. If you pass index, you are responsible for creating the repeated structure yourself and naming or tracking the resamples clearly.

Finally, verify the split quality before training. Check that no group appears in both train and holdout sets for the same fold, and inspect class counts across folds so you understand how much imbalance remains.

Summary

'createMultiFolds() gives repeated stratified splits at the row level.'
'groupKFold() preserves group boundaries to prevent leakage.'
'caret does not provide a single helper that combines both automatically.'
Use custom index values in trainControl() to implement repeated grouped resampling.
If stratification matters, do it at the group level when each group has one stable label.

caret combine the stratified createMultiFolds repeatedCV and groupKFold

Master System Design with Codemia

Introduction

Why these helpers do not combine automatically

Build grouped folds first

Add repetition with custom indices

Approximate stratification at the group level

Use the indices in train()

Common Pitfalls

Summary

Use the indices in `train()`