Classification report with Nested Cross Validation in SKlearn Average/Individual values
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
With nested cross validation, you usually tune hyperparameters in an inner loop and estimate generalization in an outer loop. The tricky part is that classification_report produces per-class metrics for one set of predictions, so you need to decide whether you want one report per outer fold, an average across folds, or a single report built from all outer test predictions combined.
What to Average in Nested Cross Validation
There are two common outputs:
- Individual fold reports, which show how metrics vary from one outer split to another.
- An averaged summary, which combines the outer-fold results into one table.
Those are not identical. If you compute a report on each fold and then average the numbers, you get a mean of fold-level metrics. If you concatenate all outer-fold predictions and then call classification_report once, you get a global report over all held-out predictions.
Both can be valid. The fold-by-fold view is better for stability analysis. The global report is easier to read and usually closer to what people expect in a final summary.
Collect Predictions from the Outer Folds
The clean pattern is:
- Split data with an outer
StratifiedKFold. - Run
GridSearchCVor another tuner on the training portion only. - Predict on the outer test fold.
- Store
y_true,y_pred, and, if you want per-fold metrics, the report dictionary.
Example:
This avoids leakage because the scaler and model tuning both happen inside the nested workflow.
Build Individual and Average Reports
Once you have fold_reports, you can inspect each fold directly or average selected metrics across folds:
If you want one overall report from all held-out predictions, call classification_report once at the end:
This second report is often the most useful final artifact because every prediction came from a model that never saw its corresponding outer test sample during tuning.
Choose the Right Summary for Your Goal
Use the global report when the goal is a compact final evaluation. Use fold-level averages when the goal is to understand variance across splits. In many projects, it is worth keeping both:
- global report for the headline result
- fold averages and standard deviations for stability
That combination gives you a more honest story than a single optimistic score.
If class balance changes noticeably between folds, weighted and macro averages can also tell different stories. Read those fields carefully instead of treating them as interchangeable.
Common Pitfalls
- Computing
classification_reporton the inner cross-validation results and presenting it as the final estimate. The outer loop is the evaluation loop that matters. - Scaling or preprocessing before the split instead of inside a pipeline. That leaks information from test folds into training.
- Averaging per-fold metrics and assuming they are identical to a report built from all outer predictions. They answer related but different questions.
- Ignoring class imbalance. Macro, weighted, and per-class metrics can diverge significantly when one class is rare.
- Reporting only a single mean score without any fold-level variation. Nested cross validation is valuable partly because it exposes instability.
Summary
- In nested cross validation, tune in the inner loop and evaluate in the outer loop.
- Store predictions from each outer test fold if you want a trustworthy overall report.
- Use per-fold reports to inspect variability and average metrics across outer splits.
- Use one final
classification_report(all_true, all_pred)for a compact held-out summary. - Keep preprocessing inside a pipeline to avoid leakage.

