best way to pick a random subset from a collection?

random subset

collection sampling

subset selection

sampling techniques

randomization methods

best way to pick a random subset from a collection?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

The best way to pick a random subset depends on the size of the collection, whether duplicates are allowed, and whether the full collection fits in memory. For a normal in-memory collection where you want k unique elements without replacement, the standard answer is to use a built-in sampling function instead of writing a custom shuffle loop by hand.

Use Built-In Sampling for Ordinary Cases

In Python, random.sample is the clearest choice for selecting k unique items without replacement.

python

1import random
2
3items = ["a", "b", "c", "d", "e", "f"]
4subset = random.sample(items, k=3)
5
6print(subset)

This returns a random subset of size 3 with no duplicates. It is usually preferable to shuffling the whole collection when you only need a small sample.

Know the Difference Between Sampling and Shuffling

People often solve this problem by shuffling the collection and taking the first k elements.

python

1import random
2
3items = ["a", "b", "c", "d", "e", "f"]
4random.shuffle(items)
5subset = items[:3]

This works, but it mutates the original list and does more work than necessary if k is much smaller than the collection size. That is why direct sampling APIs are a better default when they are available.

If you need to preserve the original list, copy first:

python

shuffled = items[:]
random.shuffle(shuffled)
subset = shuffled[:3]

Handle Large or Streaming Collections Differently

If the collection is too large to keep in memory or arrives as a stream, use reservoir sampling. This lets you choose k items uniformly without storing the entire input.

python

1import random
2
3
4def reservoir_sample(iterable, k):
5    result = []
6
7    for i, item in enumerate(iterable):
8        if i < k:
9            result.append(item)
10        else:
11            j = random.randint(0, i)
12            if j < k:
13                result[j] = item
14
15    return result
16
17
18numbers = range(1, 1_000_001)
19print(reservoir_sample(numbers, 5))

This is the right tool for log streams, large files, or database cursors when reading everything into a list is wasteful.

Clarify Whether Replacement Is Allowed

Some tasks want a subset without duplicates. Others want repeated selection where the same element can appear more than once.

Without replacement:

python

random.sample(items, 3)

With replacement:

python

random.choices(items, k=3)

That distinction matters. If the question says "subset," it usually implies no replacement, but many business tasks really want repeated random picks instead.

Think About Reproducibility

For testing or scientific work, seed the random generator so the sample can be reproduced.

python

1import random
2
3rng = random.Random(42)
4subset = rng.sample(["a", "b", "c", "d", "e"], 2)
5print(subset)

Using a local random generator is often cleaner than setting global seed state, especially in larger programs or test suites.

Consider Weighted Sampling Separately

Some problems do not want every element to be equally likely. If certain items should be favored, that is a different requirement from uniform subset selection and should be handled explicitly.

For example, with replacement in Python:

python

1import random
2
3items = ["free", "standard", "premium"]
4weights = [0.6, 0.3, 0.1]
5
6print(random.choices(items, weights=weights, k=5))

Weighted sampling without replacement is more specialized, and the right approach depends on the language or library you use. The important point is not to quietly substitute a weighted process when the task really asked for a uniform random subset.

Common Pitfalls

Shuffling the entire collection when a direct sampling API would be simpler and cheaper.
Forgetting that random.shuffle mutates the original list.
Using sampling without replacement when the problem actually allows repeated picks.
Reading an entire large stream into memory instead of using reservoir sampling.
Ignoring reproducibility requirements in tests, simulations, or experiments.

Summary

For ordinary in-memory collections, use a built-in sampling function such as random.sample.
Shuffling and slicing works, but it mutates the list and is often less direct.
For streaming or very large inputs, use reservoir sampling.
Decide explicitly whether duplicates are allowed.
Seed a local random generator when reproducibility matters.