nested cross validation
sklearn
classification report
machine learning
model evaluation

Classification report with Nested Cross Validation in SKlearn Average/Individual values

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

With nested cross validation, you usually tune hyperparameters in an inner loop and estimate generalization in an outer loop. The tricky part is that classification_report produces per-class metrics for one set of predictions, so you need to decide whether you want one report per outer fold, an average across folds, or a single report built from all outer test predictions combined.

What to Average in Nested Cross Validation

There are two common outputs:

  • Individual fold reports, which show how metrics vary from one outer split to another.
  • An averaged summary, which combines the outer-fold results into one table.

Those are not identical. If you compute a report on each fold and then average the numbers, you get a mean of fold-level metrics. If you concatenate all outer-fold predictions and then call classification_report once, you get a global report over all held-out predictions.

Both can be valid. The fold-by-fold view is better for stability analysis. The global report is easier to read and usually closer to what people expect in a final summary.

Collect Predictions from the Outer Folds

The clean pattern is:

  1. Split data with an outer StratifiedKFold.
  2. Run GridSearchCV or another tuner on the training portion only.
  3. Predict on the outer test fold.
  4. Store y_true, y_pred, and, if you want per-fold metrics, the report dictionary.

Example:

python
1import numpy as np
2from sklearn.datasets import load_breast_cancer
3from sklearn.metrics import classification_report
4from sklearn.model_selection import GridSearchCV, StratifiedKFold
5from sklearn.pipeline import Pipeline
6from sklearn.preprocessing import StandardScaler
7from sklearn.svm import SVC
8
9X, y = load_breast_cancer(return_X_y=True)
10
11outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
12inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
13
14pipeline = Pipeline([
15    ("scaler", StandardScaler()),
16    ("model", SVC())
17])
18
19param_grid = {
20    "model__C": [0.1, 1.0, 10.0],
21    "model__kernel": ["linear", "rbf"]
22}
23
24all_true = []
25all_pred = []
26fold_reports = []
27
28for train_idx, test_idx in outer_cv.split(X, y):
29    X_train, X_test = X[train_idx], X[test_idx]
30    y_train, y_test = y[train_idx], y[test_idx]
31
32    search = GridSearchCV(
33        pipeline,
34        param_grid=param_grid,
35        cv=inner_cv,
36        n_jobs=-1
37    )
38    search.fit(X_train, y_train)
39
40    y_pred = search.predict(X_test)
41    all_true.extend(y_test)
42    all_pred.extend(y_pred)
43    fold_reports.append(classification_report(y_test, y_pred, output_dict=True))

This avoids leakage because the scaler and model tuning both happen inside the nested workflow.

Build Individual and Average Reports

Once you have fold_reports, you can inspect each fold directly or average selected metrics across folds:

python
1labels = ["0", "1"]
2metrics = ["precision", "recall", "f1-score"]
3
4for label in labels:
5    print(f"Class {label}")
6    for metric in metrics:
7        values = [report[label][metric] for report in fold_reports]
8        print(metric, np.mean(values))

If you want one overall report from all held-out predictions, call classification_report once at the end:

python
print(classification_report(all_true, all_pred))

This second report is often the most useful final artifact because every prediction came from a model that never saw its corresponding outer test sample during tuning.

Choose the Right Summary for Your Goal

Use the global report when the goal is a compact final evaluation. Use fold-level averages when the goal is to understand variance across splits. In many projects, it is worth keeping both:

  • global report for the headline result
  • fold averages and standard deviations for stability

That combination gives you a more honest story than a single optimistic score.

If class balance changes noticeably between folds, weighted and macro averages can also tell different stories. Read those fields carefully instead of treating them as interchangeable.

Common Pitfalls

  • Computing classification_report on the inner cross-validation results and presenting it as the final estimate. The outer loop is the evaluation loop that matters.
  • Scaling or preprocessing before the split instead of inside a pipeline. That leaks information from test folds into training.
  • Averaging per-fold metrics and assuming they are identical to a report built from all outer predictions. They answer related but different questions.
  • Ignoring class imbalance. Macro, weighted, and per-class metrics can diverge significantly when one class is rare.
  • Reporting only a single mean score without any fold-level variation. Nested cross validation is valuable partly because it exposes instability.

Summary

  • In nested cross validation, tune in the inner loop and evaluate in the outer loop.
  • Store predictions from each outer test fold if you want a trustworthy overall report.
  • Use per-fold reports to inspect variability and average metrics across outer splits.
  • Use one final classification_report(all_true, all_pred) for a compact held-out summary.
  • Keep preprocessing inside a pipeline to avoid leakage.

Course illustration
Course illustration

All Rights Reserved.