numpy
array manipulation
data shuffling
python programming
data processing

Better way to shuffle two numpy arrays in unison

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When two NumPy arrays represent aligned data such as features and labels, they must be shuffled with the same permutation. Shuffling them independently breaks that relationship immediately. The clean solution is to generate one permutation of row indices and apply it to every aligned array.

The Basic Pattern

Suppose X contains samples and y contains labels in matching order. The safe shuffle is:

python
1import numpy as np
2
3X = np.array([[1, 10], [2, 20], [3, 30], [4, 40]])
4y = np.array([0, 1, 0, 1])
5
6perm = np.random.permutation(len(X))
7X_shuffled = X[perm]
8y_shuffled = y[perm]
9
10print(X_shuffled)
11print(y_shuffled)

The same permutation array is applied to both objects, so row i in X_shuffled still corresponds to element i in y_shuffled.

This is the key idea. Once you understand that, the rest is just API style.

Why Separate Shuffles Are Wrong

A common mistake is this:

python
np.random.shuffle(X)
np.random.shuffle(y)

These are two independent random operations. Even if they happen one after the other in the same script, there is no guarantee they will apply the same reordering.

That breaks the dataset pairing. In a machine learning pipeline, that can silently corrupt training because features no longer match their labels.

Prefer default_rng in Newer Code

For modern NumPy code, np.random.default_rng() is usually a cleaner interface than the older global random functions.

python
1import numpy as np
2
3rng = np.random.default_rng(seed=42)
4perm = rng.permutation(len(X))
5
6X_shuffled = X[perm]
7y_shuffled = y[perm]

This has two advantages:

  • reproducibility is explicit through the generator instance
  • random state is easier to isolate in larger programs and tests

If you are writing reusable code, this is often the best style.

A Reusable Helper Function

If you need to shuffle aligned arrays frequently, wrap the pattern in a helper.

python
1import numpy as np
2
3
4def shuffle_in_unison(*arrays, seed=None):
5    if not arrays:
6        return ()
7
8    length = len(arrays[0])
9    if any(len(arr) != length for arr in arrays):
10        raise ValueError("All arrays must have the same length")
11
12    rng = np.random.default_rng(seed)
13    perm = rng.permutation(length)
14    return tuple(arr[perm] for arr in arrays)
15
16
17X = np.array([[1, 10], [2, 20], [3, 30], [4, 40]])
18y = np.array([0, 1, 0, 1])
19weights = np.array([0.1, 0.2, 0.3, 0.4])
20
21X2, y2, w2 = shuffle_in_unison(X, y, weights, seed=7)
22print(X2)
23print(y2)
24print(w2)

This is a good pattern when you need the same shuffle across features, labels, sample weights, or metadata arrays.

In-Place vs Copying Behavior

Indexing with a permutation such as X[perm] creates a reordered copy. That is usually what you want because it leaves the original arrays unchanged.

If you truly want to overwrite the original variables, reassign them:

python
X = X[perm]
y = y[perm]

Be explicit about that choice. Hidden mutation is rarely helpful when debugging data pipelines.

Pandas and scikit-learn Alternatives

If you are already using higher-level libraries, there are alternatives. For example, scikit-learn has a helper that shuffles arrays consistently:

python
1from sklearn.utils import shuffle
2import numpy as np
3
4X = np.array([[1, 10], [2, 20], [3, 30]])
5y = np.array([0, 1, 0])
6
7X_shuffled, y_shuffled = shuffle(X, y, random_state=42)
8print(X_shuffled)
9print(y_shuffled)

That is convenient, but the NumPy permutation approach is still the core idea and works without extra dependencies.

Shape and Length Checks Matter

All aligned arrays must have the same length along the dimension being shuffled. Usually that means the first axis represents samples.

For example, if X.shape[0] != y.shape[0], then the data is not aligned properly to begin with. A shuffle helper should validate that rather than fail later with confusing index errors.

This matters even more when mixing one-dimensional labels with higher-dimensional feature arrays or when shuffling batches of tensors.

Common Pitfalls

The most common mistake is calling np.random.shuffle on each array separately and assuming they will stay aligned. Another is generating a permutation from the wrong length, such as the number of columns instead of the number of samples. Developers also sometimes forget that advanced indexing creates a copy, which can matter for memory usage on large arrays. A final issue is neglecting reproducibility when experiments need a stable shuffle order across runs.

Summary

  • To shuffle aligned NumPy arrays together, generate one permutation and apply it to all arrays.
  • Never shuffle the arrays independently if the pairing must be preserved.
  • 'default_rng().permutation(...) is a clean modern pattern.'
  • Wrap the logic in a helper if you need to shuffle several aligned arrays repeatedly.
  • Validate that all arrays have the same sample length before shuffling.

Course illustration
Course illustration

All Rights Reserved.