numpy
one-hot encoding
array manipulation
machine learning
data preprocessing

Convert array of indices to one-hot encoded array in NumPy

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Converting an array of class indices into a one-hot encoded matrix is a common NumPy task in machine learning and preprocessing pipelines. Each input index becomes a row whose selected class position is 1 and whose other positions are 0.

The key question is usually not whether this is possible. It is how to do it efficiently and clearly without writing slow Python loops.

The Simplest NumPy Pattern

For a one-dimensional array of indices, the cleanest approach is often to index into an identity matrix created by np.eye.

python
1import numpy as np
2
3indices = np.array([2, 0, 1, 2])
4one_hot = np.eye(3, dtype=int)[indices]
5
6print(one_hot)

Output:

text
1[[0 0 1]
2 [1 0 0]
3 [0 1 0]
4 [0 0 1]]

This works because np.eye(3) creates a 3 by 3 identity matrix, and selecting rows by the class indices gives the one-hot representation directly.

Control the Number of Classes Explicitly

In production code, it is often better to pass the class count explicitly instead of inferring it from the largest value. That makes shapes stable across batches.

python
1import numpy as np
2
3indices = np.array([1, 3, 0])
4num_classes = 5
5one_hot = np.eye(num_classes, dtype=np.float32)[indices]
6
7print(one_hot.shape)
8print(one_hot)

This is useful when some classes do not appear in a given batch but still belong to the model’s output space.

Scatter into a Preallocated Array

If you want to avoid building the full identity matrix first, you can scatter values into a zeros array.

python
1import numpy as np
2
3indices = np.array([2, 0, 1, 2])
4num_classes = 3
5one_hot = np.zeros((indices.size, num_classes), dtype=int)
6one_hot[np.arange(indices.size), indices] = 1
7
8print(one_hot)

This pattern is especially useful when you want to build the output in place or when the number of classes is large enough that creating an identity matrix feels wasteful.

Validate the Indices First

One-hot encoding assumes that every value is a valid class index in the range from 0 to num_classes - 1. If an index is negative or too large, NumPy indexing will fail or produce incorrect behavior.

A quick validation step can make debugging much easier:

python
1import numpy as np
2
3indices = np.array([0, 2, 1])
4num_classes = 3
5
6if np.any(indices < 0) or np.any(indices >= num_classes):
7    raise ValueError("indices out of range")

That is worth doing when indices come from user input, external files, or a model pipeline that may drift.

One-Hot Encoding Is Not Always the Right Tool

One-hot encoding is easy to understand, but it can be memory-heavy when the number of classes is large. In deep learning, embeddings are often better for very high-cardinality categorical features.

Still, for small or moderate class counts, one-hot matrices are simple, explicit, and integrate well with NumPy-based workflows.

Common Pitfalls

  • Inferring the number of classes from one batch when later batches may contain other classes.
  • Forgetting to validate that all indices are in range.
  • Using Python loops instead of NumPy vectorized indexing.
  • Building a huge dense one-hot matrix when the class space is very large.
  • Mixing up label arrays and already one-hot encoded arrays later in the pipeline.

Summary

  • The simplest NumPy solution is often np.eye(num_classes)[indices].
  • A scatter approach with np.zeros and advanced indexing is another efficient option.
  • Pass num_classes explicitly when stable output shape matters.
  • Validate index ranges before encoding.
  • One-hot encoding is useful for moderate class counts, but not always ideal for very large ones.

Course illustration
Course illustration

All Rights Reserved.