Machine Learning
Multiclass Classification
Binary Stacking
Data Science
Algorithm Adaptation

Adapting binary stacking example to multiclass

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Stacking for multiclass classification follows the same core idea as binary stacking: train several base models, then train a meta-model on their predictions. The main difference is that each base model now produces one score or probability per class, so the meta-model must consume a wider feature set and the training pipeline must avoid leakage carefully.

What Changes from Binary to Multiclass

In binary classification, a base model often produces one probability for the positive class. In multiclass classification with k classes, that same base model typically produces k probabilities.

That means if you have:

  • 'm base models'
  • 'k classes'

then the meta-model can receive roughly m * k probability features per sample.

This is the key conceptual shift when adapting a binary stacking example.

Use Probabilities, Not Hard Labels

For multiclass stacking, class probabilities are usually more informative than predicted labels because they preserve uncertainty.

With scikit-learn, the simplest route is StackingClassifier using predict_proba.

python
1from sklearn.datasets import load_iris
2from sklearn.ensemble import RandomForestClassifier, StackingClassifier
3from sklearn.linear_model import LogisticRegression
4from sklearn.model_selection import train_test_split
5from sklearn.pipeline import make_pipeline
6from sklearn.preprocessing import StandardScaler
7from sklearn.svm import SVC
8
9X, y = load_iris(return_X_y=True)
10X_train, X_test, y_train, y_test = train_test_split(
11    X, y, test_size=0.25, random_state=42, stratify=y
12)
13
14estimators = [
15    ("rf", RandomForestClassifier(n_estimators=100, random_state=42)),
16    ("svc", make_pipeline(StandardScaler(), SVC(probability=True, random_state=42))),
17]
18
19stack = StackingClassifier(
20    estimators=estimators,
21    final_estimator=LogisticRegression(max_iter=1000),
22    stack_method="predict_proba",
23)
24
25stack.fit(X_train, y_train)
26print(stack.score(X_test, y_test))

This is the cleanest modern adaptation in scikit-learn because it handles out-of-fold stacking internally.

Avoid Training-Set Leakage

The meta-model must not be trained on base-model predictions generated from the same rows the base models trained on. That gives overly optimistic features and inflated results.

The right approach is:

  1. split data into folds
  2. train each base model on training folds
  3. generate predictions for the held-out fold
  4. combine those out-of-fold predictions into meta-features
  5. train the meta-model on that meta-feature matrix

StackingClassifier helps automate that process. If you write the stacking manually, you must implement this carefully yourself.

What Manual Multiclass Stacking Looks Like

A manual stacking workflow usually collects predict_proba outputs from each base estimator and concatenates them column-wise.

python
1import numpy as np
2from sklearn.datasets import load_iris
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.linear_model import LogisticRegression
5
6X, y = load_iris(return_X_y=True)
7
8base1 = RandomForestClassifier(n_estimators=50, random_state=1)
9base2 = LogisticRegression(max_iter=1000, random_state=1)
10
11base1.fit(X, y)
12base2.fit(X, y)
13
14meta_features = np.hstack([
15    base1.predict_proba(X),
16    base2.predict_proba(X),
17])
18
19meta_model = LogisticRegression(max_iter=1000, random_state=1)
20meta_model.fit(meta_features, y)

This demonstrates the data shape, but for a real pipeline you must replace same-data predictions with out-of-fold predictions to avoid leakage.

Choose a Meta-Model That Supports Multiclass

Your meta-model must support multiclass targets. Common choices include:

  • logistic regression
  • gradient boosting
  • small neural network

Start simple. Logistic regression is often a good default meta-model because it is fast, well-understood, and easy to debug.

Class Imbalance and Calibration

In multiclass problems, poor probability calibration can hurt stacking quality. If one base model is consistently overconfident, its probability columns may dominate the meta-model for the wrong reasons.

If probabilities are unstable, consider:

  • calibrating base estimators
  • comparing predict_proba versus decision_function
  • using stratified folds

Multiclass stacking often succeeds or fails on these details rather than on the high-level algorithm choice.

Evaluation Still Matters

Do not compare your multiclass stack only by accuracy if the classes are uneven or the costs differ. Check metrics that match the problem.

Examples:

  • macro F1
  • weighted F1
  • confusion matrix
  • per-class recall

A stack can improve one dominant class while hurting minority classes, so inspect more than one summary metric.

Common Pitfalls

The biggest mistake is feeding hard class labels from base models into the meta-model and losing probability information. Another is training the meta-model on predictions from models that already saw those same rows during fitting, which creates leakage. Teams also forget that multiclass stacking increases the number of meta-features quickly, making feature scaling and estimator choice more important. Finally, reporting only accuracy can hide the fact that the stack did not actually improve multiclass behavior where it matters.

Summary

  • Multiclass stacking extends binary stacking by using one prediction feature per class per base model.
  • Use probabilities rather than hard labels when possible.
  • Prevent leakage by training the meta-model on out-of-fold base predictions.
  • Choose a multiclass-capable meta-model such as logistic regression.
  • Evaluate with metrics that reflect multiclass performance, not just overall accuracy.

Course illustration
Course illustration

All Rights Reserved.