Best way to combine probabilistic classifiers in scikit-learn

scikit-learn

machine learning

probabilistic classifiers

ensemble methods

Python

Best way to combine probabilistic classifiers in scikit-learn

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When several classifiers expose predict_proba, the best way to combine them depends on whether you want a simple average or a learned combiner. In scikit-learn, the two main answers are soft voting and stacking. Soft voting is the right baseline because it is easy to implement and directly uses the predicted class probabilities.

Start with Soft Voting

VotingClassifier with voting='soft' averages the predicted probabilities from each base model and picks the class with the highest final probability.

python

1from sklearn.datasets import load_iris
2from sklearn.ensemble import RandomForestClassifier, VotingClassifier
3from sklearn.linear_model import LogisticRegression
4from sklearn.model_selection import train_test_split
5from sklearn.naive_bayes import GaussianNB
6
7X, y = load_iris(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(
9    X, y, test_size=0.3, random_state=42
10)
11
12ensemble = VotingClassifier(
13    estimators=[
14        ('lr', LogisticRegression(max_iter=1000)),
15        ('rf', RandomForestClassifier(random_state=42)),
16        ('nb', GaussianNB()),
17    ],
18    voting='soft',
19)
20
21ensemble.fit(X_train, y_train)
22print(ensemble.predict_proba(X_test[:3]))
23print(ensemble.score(X_test, y_test))

This is the simplest standard solution in scikit-learn and is usually the first thing to try.

Why Soft Voting Fits Probabilistic Models

Hard voting only looks at predicted labels. Soft voting uses the full confidence information.

Suppose one classifier predicts class A with probability 0.51 and another predicts class A with probability 0.99. Hard voting treats those as equal votes. Soft voting lets the stronger confidence influence the result.

That is why probabilistic classifiers should normally be combined with soft voting rather than majority vote.

Weighted Soft Voting

If one model is consistently better than the others, give it more influence with weights.

python

1from sklearn.datasets import load_iris
2from sklearn.ensemble import RandomForestClassifier, VotingClassifier
3from sklearn.linear_model import LogisticRegression
4from sklearn.naive_bayes import GaussianNB
5
6X, y = load_iris(return_X_y=True)
7
8ensemble = VotingClassifier(
9    estimators=[
10        ('lr', LogisticRegression(max_iter=1000)),
11        ('rf', RandomForestClassifier(random_state=42)),
12        ('nb', GaussianNB()),
13    ],
14    voting='soft',
15    weights=[2, 3, 1],
16)
17
18ensemble.fit(X, y)
19print(ensemble.predict_proba(X[:2]))

The weights should come from cross-validation, not intuition. If you guess the weights, you may accidentally overfit to a small validation split or over-trust a poorly calibrated model.

Stacking Learns the Combination Rule

If you want the ensemble to learn how to combine the base predictions, use stacking.

python

1from sklearn.datasets import load_iris
2from sklearn.ensemble import RandomForestClassifier, StackingClassifier
3from sklearn.linear_model import LogisticRegression
4from sklearn.naive_bayes import GaussianNB
5
6X, y = load_iris(return_X_y=True)
7
8stack = StackingClassifier(
9    estimators=[
10        ('rf', RandomForestClassifier(random_state=42)),
11        ('nb', GaussianNB()),
12    ],
13    final_estimator=LogisticRegression(max_iter=1000),
14    stack_method='predict_proba',
15)
16
17stack.fit(X, y)
18print(stack.predict(X[:5]))

Stacking is more flexible than soft voting because the final estimator can learn patterns such as "trust model A for one class and model B for another." The cost is more complexity and a higher risk of overfitting if the dataset is small.

Calibration Matters

Averaging probabilities only makes sense when those probabilities are reasonably calibrated. Some classifiers rank examples well but output confidence values that are too extreme or too conservative.

If probability quality matters, calibrate the base learners first.

python

1from sklearn.calibration import CalibratedClassifierCV
2from sklearn.svm import LinearSVC
3
4base = LinearSVC()
5calibrated = CalibratedClassifierCV(base)

This step is especially important if you evaluate the ensemble with log loss or if downstream decisions depend on the probability values themselves rather than just the winning class.

A Practical Decision Rule

A good working rule is:

use soft voting as the baseline
add weights if validation clearly shows unequal model quality
move to stacking when you need a learned combiner and have enough data to support it

Do not skip straight to a complex ensemble. A single strong classifier often beats a poorly designed ensemble.

Common Pitfalls

Averaging probabilities from models whose outputs are badly calibrated.
Combining several very similar models and expecting meaningful diversity gains.
Choosing voting weights by intuition instead of cross-validation.
Using stacking on a small dataset where the meta-model does not have enough signal.
Assuming an ensemble must outperform the best individual classifier.

Summary

In scikit-learn, soft voting is the standard starting point for combining probabilistic classifiers.
Weighted soft voting is useful when some base models are demonstrably stronger.
Stacking is more flexible because the combination rule is learned.
Probability calibration matters if the numeric probabilities themselves drive decisions.
Start simple, validate carefully, and only add complexity when the data justifies it.