roc curve
sklearn
python
machine learning
data science

roc curve with sklearn python

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

A ROC curve shows how a binary classifier trades off true positive rate against false positive rate as the decision threshold changes. In scikit-learn, the important implementation detail is that you should pass probability scores or decision scores to roc_curve, not hard class predictions.

What a ROC Curve Measures

For each threshold, the model produces:

  • true positive rate, also called recall or sensitivity
  • false positive rate, which measures how often negative examples are incorrectly flagged as positive

A classifier with strong separation pushes the curve toward the top-left corner. A random model tends to follow the diagonal.

The area under that curve, usually called AUC, summarizes ranking quality in one number.

Train a Simple Binary Classifier

Here is a runnable example using logistic regression.

python
1from sklearn.datasets import make_classification
2from sklearn.linear_model import LogisticRegression
3from sklearn.model_selection import train_test_split
4
5X, y = make_classification(
6    n_samples=1000,
7    n_features=10,
8    n_informative=5,
9    n_redundant=2,
10    random_state=42
11)
12
13X_train, X_test, y_train, y_test = train_test_split(
14    X, y, test_size=0.3, random_state=42
15)
16
17model = LogisticRegression(max_iter=1000)
18model.fit(X_train, y_train)

The fitted model can now produce scores for the positive class.

Use Scores, Not Predicted Labels

This is the most important part. roc_curve expects continuous scores.

python
1from sklearn.metrics import roc_curve, roc_auc_score
2
3scores = model.predict_proba(X_test)[:, 1]
4fpr, tpr, thresholds = roc_curve(y_test, scores)
5auc = roc_auc_score(y_test, scores)
6
7print("AUC:", auc)
8print("First few thresholds:", thresholds[:5])

Why not use model.predict(X_test)? Because hard labels collapse the model output to one threshold only. A ROC curve needs the full score ranking so it can evaluate many thresholds.

For models without predict_proba, use decision_function if available.

Plot the ROC Curve

Use Matplotlib for a quick visualization.

python
1import matplotlib.pyplot as plt
2
3plt.plot(fpr, tpr, label=f"Logistic regression AUC = {auc:.3f}")
4plt.plot([0, 1], [0, 1], linestyle="--", label="Random baseline")
5plt.xlabel("False Positive Rate")
6plt.ylabel("True Positive Rate")
7plt.title("ROC Curve")
8plt.legend()
9plt.show()

The dashed diagonal is the random baseline. A better model stays above that line.

Interpret AUC Carefully

AUC answers a ranking question: how well does the model place positive examples above negative ones. It does not tell you whether the chosen threshold is good for your business objective.

That means you can have:

  • a decent AUC but an unsuitable production threshold
  • an excellent AUC on a balanced test set that becomes less useful under class imbalance

ROC is a useful diagnostic, not a complete evaluation strategy.

Compare Multiple Models

A ROC plot is especially useful for comparing models on the same test set.

python
1from sklearn.ensemble import RandomForestClassifier
2
3rf = RandomForestClassifier(random_state=42)
4rf.fit(X_train, y_train)
5rf_scores = rf.predict_proba(X_test)[:, 1]
6rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_scores)
7rf_auc = roc_auc_score(y_test, rf_scores)
8
9plt.plot(fpr, tpr, label=f"Logistic AUC = {auc:.3f}")
10plt.plot(rf_fpr, rf_tpr, label=f"Random forest AUC = {rf_auc:.3f}")
11plt.plot([0, 1], [0, 1], linestyle="--", label="Random baseline")
12plt.xlabel("False Positive Rate")
13plt.ylabel("True Positive Rate")
14plt.legend()
15plt.show()

This helps you compare ranking behavior rather than just a single threshold-dependent accuracy number.

Multiclass ROC Needs a Different Setup

ROC is naturally binary. For multiclass classification in scikit-learn, you usually evaluate one class versus the rest and compute ROC separately for each class.

That means if your problem has more than two classes, you need a one-vs-rest or similar strategy before applying the usual ROC workflow.

When ROC Is Not the Best Metric

If the positive class is rare and you care a lot about precision, a precision-recall curve may be more informative than ROC. ROC can still look strong even when false positives are operationally expensive.

So the right sequence is often:

  1. use ROC and AUC to understand score ranking
  2. use precision-recall and threshold metrics for operational tradeoffs

Common Pitfalls

Passing model.predict(...) into roc_curve instead of probability or decision scores.

Computing ROC on training data and calling the result performance. Use a held-out set.

Treating a good AUC as proof that the production threshold is correct.

Applying the binary recipe directly to a multiclass problem without a one-vs-rest setup.

Ignoring class imbalance and threshold costs when choosing the final model.

Summary

  • In scikit-learn, build ROC curves from continuous model scores, not hard labels.
  • Use predict_proba(... )[:, 1] or decision_function for the positive-class score.
  • 'roc_curve gives you the threshold sweep, and roc_auc_score summarizes ranking quality.'
  • Plot ROC to compare models, but choose thresholds with additional metrics.
  • For multiclass problems, use a one-vs-rest style evaluation rather than the plain binary recipe.

Course illustration
Course illustration

All Rights Reserved.