Adapting binary stacking example to multiclass

Machine Learning

Multiclass Classification

Binary Stacking

Data Science

Algorithm Adaptation

Adapting binary stacking example to multiclass

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Stacking for multiclass classification follows the same core idea as binary stacking: train several base models, then train a meta-model on their predictions. The main difference is that each base model now produces one score or probability per class, so the meta-model must consume a wider feature set and the training pipeline must avoid leakage carefully.

What Changes from Binary to Multiclass

In binary classification, a base model often produces one probability for the positive class. In multiclass classification with k classes, that same base model typically produces k probabilities.

That means if you have:

'm base models'
'k classes'

then the meta-model can receive roughly m * k probability features per sample.

This is the key conceptual shift when adapting a binary stacking example.

Use Probabilities, Not Hard Labels

For multiclass stacking, class probabilities are usually more informative than predicted labels because they preserve uncertainty.

With scikit-learn, the simplest route is StackingClassifier using predict_proba.

python

1from sklearn.datasets import load_iris
2from sklearn.ensemble import RandomForestClassifier, StackingClassifier
3from sklearn.linear_model import LogisticRegression
4from sklearn.model_selection import train_test_split
5from sklearn.pipeline import make_pipeline
6from sklearn.preprocessing import StandardScaler
7from sklearn.svm import SVC
8
9X, y = load_iris(return_X_y=True)
10X_train, X_test, y_train, y_test = train_test_split(
11    X, y, test_size=0.25, random_state=42, stratify=y
12)
13
14estimators = [
15    ("rf", RandomForestClassifier(n_estimators=100, random_state=42)),
16    ("svc", make_pipeline(StandardScaler(), SVC(probability=True, random_state=42))),
17]
18
19stack = StackingClassifier(
20    estimators=estimators,
21    final_estimator=LogisticRegression(max_iter=1000),
22    stack_method="predict_proba",
23)
24
25stack.fit(X_train, y_train)
26print(stack.score(X_test, y_test))

This is the cleanest modern adaptation in scikit-learn because it handles out-of-fold stacking internally.

Avoid Training-Set Leakage

The meta-model must not be trained on base-model predictions generated from the same rows the base models trained on. That gives overly optimistic features and inflated results.

The right approach is:

split data into folds
train each base model on training folds
generate predictions for the held-out fold
combine those out-of-fold predictions into meta-features
train the meta-model on that meta-feature matrix

StackingClassifier helps automate that process. If you write the stacking manually, you must implement this carefully yourself.

What Manual Multiclass Stacking Looks Like

A manual stacking workflow usually collects predict_proba outputs from each base estimator and concatenates them column-wise.

python

1import numpy as np
2from sklearn.datasets import load_iris
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.linear_model import LogisticRegression
5
6X, y = load_iris(return_X_y=True)
7
8base1 = RandomForestClassifier(n_estimators=50, random_state=1)
9base2 = LogisticRegression(max_iter=1000, random_state=1)
10
11base1.fit(X, y)
12base2.fit(X, y)
13
14meta_features = np.hstack([
15    base1.predict_proba(X),
16    base2.predict_proba(X),
17])
18
19meta_model = LogisticRegression(max_iter=1000, random_state=1)
20meta_model.fit(meta_features, y)

This demonstrates the data shape, but for a real pipeline you must replace same-data predictions with out-of-fold predictions to avoid leakage.

Choose a Meta-Model That Supports Multiclass

Your meta-model must support multiclass targets. Common choices include:

logistic regression
gradient boosting
small neural network

Start simple. Logistic regression is often a good default meta-model because it is fast, well-understood, and easy to debug.

Class Imbalance and Calibration

In multiclass problems, poor probability calibration can hurt stacking quality. If one base model is consistently overconfident, its probability columns may dominate the meta-model for the wrong reasons.

If probabilities are unstable, consider:

calibrating base estimators
comparing predict_proba versus decision_function
using stratified folds

Multiclass stacking often succeeds or fails on these details rather than on the high-level algorithm choice.

Evaluation Still Matters

Do not compare your multiclass stack only by accuracy if the classes are uneven or the costs differ. Check metrics that match the problem.

Examples:

macro F1
weighted F1
confusion matrix
per-class recall

A stack can improve one dominant class while hurting minority classes, so inspect more than one summary metric.

Common Pitfalls

The biggest mistake is feeding hard class labels from base models into the meta-model and losing probability information. Another is training the meta-model on predictions from models that already saw those same rows during fitting, which creates leakage. Teams also forget that multiclass stacking increases the number of meta-features quickly, making feature scaling and estimator choice more important. Finally, reporting only accuracy can hide the fact that the stack did not actually improve multiclass behavior where it matters.

Summary

Multiclass stacking extends binary stacking by using one prediction feature per class per base model.
Use probabilities rather than hard labels when possible.
Prevent leakage by training the meta-model on out-of-fold base predictions.
Choose a multiclass-capable meta-model such as logistic regression.
Evaluate with metrics that reflect multiclass performance, not just overall accuracy.