How to Plot PR-Curve Over 10 folds of Cross Validation in Scikit-Learn

Scikit-Learn

PR-Curve

Cross Validation

Machine Learning

Data Science

How to Plot PR-Curve Over 10 folds of Cross Validation in Scikit-Learn

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A precision-recall curve is often more informative than ROC when the positive class is rare. The tricky part with cross-validation is deciding what exactly to plot across folds without producing a misleading average.

A Better Cross-Validation Pattern for PR Curves

For ROC curves, people often interpolate fold curves onto a common grid and then average them. Precision-recall curves do not behave as cleanly under that kind of averaging because precision can jump sharply as thresholds move.

A practical approach in scikit-learn is:

train on each fold
collect out-of-fold prediction scores for the held-out data
plot each fold lightly if you want variability
build one pooled precision-recall curve from all held-out predictions combined

That pooled out-of-fold curve answers the most useful question: how does the model behave on unseen examples across the full dataset?

Example with 10-Fold Stratified Cross-Validation

The example below uses an imbalanced synthetic dataset, a scaling-plus-logistic-regression pipeline, and StratifiedKFold so each fold keeps a similar class ratio.

python

1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.datasets import make_classification
4from sklearn.linear_model import LogisticRegression
5from sklearn.metrics import PrecisionRecallDisplay, average_precision_score, precision_recall_curve
6from sklearn.model_selection import StratifiedKFold
7from sklearn.pipeline import make_pipeline
8from sklearn.preprocessing import StandardScaler
9
10X, y = make_classification(
11    n_samples=1200,
12    n_features=20,
13    n_informative=6,
14    n_redundant=2,
15    weights=[0.85, 0.15],
16    random_state=42,
17)
18
19cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
20model = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
21
22all_true = []
23all_scores = []
24
25plt.figure(figsize=(8, 6))
26
27for fold, (train_idx, test_idx) in enumerate(cv.split(X, y), start=1):
28    X_train, X_test = X[train_idx], X[test_idx]
29    y_train, y_test = y[train_idx], y[test_idx]
30
31    model.fit(X_train, y_train)
32    y_score = model.predict_proba(X_test)[:, 1]
33
34    all_true.append(y_test)
35    all_scores.append(y_score)
36
37    PrecisionRecallDisplay.from_predictions(
38        y_test,
39        y_score,
40        name=f"Fold {fold}",
41        alpha=0.2,
42        lw=1,
43    )
44
45all_true = np.concatenate(all_true)
46all_scores = np.concatenate(all_scores)
47precision, recall, _ = precision_recall_curve(all_true, all_scores)
48ap = average_precision_score(all_true, all_scores)
49
50plt.plot(recall, precision, color="black", lw=2.5, label=f"Out-of-fold AP = {ap:.3f}")
51plt.xlabel("Recall")
52plt.ylabel("Precision")
53plt.title("10-fold cross-validated precision-recall curve")
54plt.legend(loc="lower left")
55plt.grid(True)
56plt.tight_layout()
57plt.show()

This produces thin fold-level curves plus one bold pooled curve. The bold curve is usually the one you want to discuss in a report because it is based entirely on held-out predictions.

Why Out-of-Fold Predictions Are Useful

Each prediction in the pooled curve comes from a model that did not train on that sample. That makes the final PR curve a realistic summary of cross-validated generalization.

It also avoids a common mistake: plotting a PR curve from predictions made on the full training set after one final fit. That curve is optimistic because the model has already seen those examples.

If you only want a scalar summary, report the cross-validated average precision as well. You can compute it from the pooled out-of-fold scores, or compute one value per fold and summarize the mean and standard deviation separately.

Common Pitfalls

Averaging precision-recall points across folds as if PR curves behave like ROC curves. The result can be hard to interpret.
Using plain KFold on an imbalanced dataset instead of StratifiedKFold.
Plotting curves from in-sample predictions rather than held-out fold predictions.
Forgetting that some estimators expose decision_function instead of predict_proba.
Comparing models only by one PR curve without also checking class imbalance, threshold behavior, and average precision.

Summary

For PR curves under cross-validation, pooled out-of-fold predictions are usually more informative than a naive pointwise fold average.
'StratifiedKFold helps preserve class balance across the 10 folds.'
Plot fold curves lightly if you want variability, and highlight one pooled held-out curve for the main result.
Use average_precision_score as a compact summary alongside the curve.
Always compute PR metrics on held-out predictions, not on data used for fitting.