classification
machine learning
data validation
error handling
scikit-learn

Invalid classes inferred from unique values of y. Expected 0 1 2 3 4 5, got 1 2 3 4 5 6

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

This error usually means your classifier wrapper expects class labels to be contiguous and zero-based, but your target vector starts at 1 instead of 0. In other words, the model sees six classes, but it expects them to be encoded as 0 through 5, while your data contains 1 through 6. The fix is usually to normalize the labels before training and to keep that same mapping for prediction output.

Why the Error Appears

Many machine learning APIs accept arbitrary label values because they internally encode classes for you. Some classifier implementations, however, expect the labels to already be integer-encoded in a consecutive range.

For example, this target array has six classes but starts at 1:

python
y = [1, 2, 3, 4, 5, 6]

A classifier that expects zero-based contiguous labels wants:

python
y = [0, 1, 2, 3, 4, 5]

That is why the message says it expected 0 1 2 3 4 5 but got 1 2 3 4 5 6.

The Simplest Fix: Re-encode Labels

If your labels are already numeric but just offset by one, you can subtract one safely.

python
1import numpy as np
2
3y = np.array([1, 2, 3, 4, 5, 6])
4y_fixed = y - 1
5print(y_fixed)

Then fit the model with y_fixed instead of the original labels.

This works only when you know the labels are supposed to be consecutive integers and the only problem is the starting index.

The Safer General Fix: LabelEncoder

If your labels are strings, arbitrary numbers, or inconsistent across datasets, use LabelEncoder.

python
1import numpy as np
2from sklearn.preprocessing import LabelEncoder
3
4raw_y = np.array([1, 2, 3, 4, 5, 6])
5encoder = LabelEncoder()
6y_encoded = encoder.fit_transform(raw_y)
7
8print(y_encoded)
9print(encoder.classes_)

This guarantees a contiguous 0-based encoding regardless of the original label values.

After prediction, convert back to the original labels if needed:

python
predicted_encoded = np.array([0, 2, 5])
predicted_original = encoder.inverse_transform(predicted_encoded)
print(predicted_original)

That keeps your training representation compatible with the model while preserving the original label meaning for reports and downstream code.

Train and Test Must Use the Same Mapping

A common mistake is fitting the encoder separately on training and test data. The mapping must be learned from the training labels and reused consistently.

python
1from sklearn.model_selection import train_test_split
2from sklearn.preprocessing import LabelEncoder
3
4X = np.arange(18).reshape(6, 3)
5y = np.array([1, 2, 3, 4, 5, 6])
6
7X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
8
9encoder = LabelEncoder()
10y_train_encoded = encoder.fit_transform(y_train)
11y_test_encoded = encoder.transform(y_test)

Using one fitted encoder ensures the class IDs mean the same thing everywhere.

Check the Real Root Cause

Sometimes the wrong labels are a symptom of a preprocessing bug rather than a harmless encoding issue. Before you patch the labels, inspect how they were created.

Useful checks include:

  • printing np.unique(y) before training
  • verifying label generation after merges or joins
  • checking whether one-based indexing came from a legacy dataset or manual coding convention
  • confirming that train and validation splits contain the expected label space

If labels should never have started at 1, fix the source pipeline rather than adding repeated corrections downstream.

Example With a Classifier

python
1import numpy as np
2from sklearn.preprocessing import LabelEncoder
3from xgboost import XGBClassifier
4
5X = np.array([
6    [0.1, 1.0],
7    [0.2, 1.1],
8    [0.3, 1.2],
9    [0.4, 1.3],
10    [0.5, 1.4],
11    [0.6, 1.5],
12])
13y = np.array([1, 2, 3, 4, 5, 6])
14
15encoder = LabelEncoder()
16y_encoded = encoder.fit_transform(y)
17
18model = XGBClassifier(objective="multi:softprob", num_class=len(encoder.classes_))
19model.fit(X, y_encoded)

The important part is not the specific library. It is the label normalization before fitting.

Common Pitfalls

The most common mistake is subtracting 1 without first checking whether the labels are truly consecutive integers. If the labels are 10, 20, 30, subtracting one does not solve the underlying representation problem.

Another mistake is fitting the label encoder separately on different splits. That can produce inconsistent class IDs.

Developers also sometimes fix the training labels but forget to inverse-transform predictions before presenting them to users or downstream systems.

Finally, do not ignore the possibility of upstream data corruption. A class shift may reveal a real preprocessing bug rather than just an encoding convention mismatch.

Summary

  • The error usually means the classifier expects zero-based consecutive class IDs.
  • If labels are simply one-based, subtracting 1 can fix the issue.
  • 'LabelEncoder is the safer general solution for arbitrary label values.'
  • Reuse the same label mapping across training, validation, test, and prediction.
  • Check whether the label mismatch came from a real preprocessing problem.

Course illustration
Course illustration

All Rights Reserved.