best way to pick a random subset from a collection?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
The best way to pick a random subset depends on the size of the collection, whether duplicates are allowed, and whether the full collection fits in memory. For a normal in-memory collection where you want k unique elements without replacement, the standard answer is to use a built-in sampling function instead of writing a custom shuffle loop by hand.
Use Built-In Sampling for Ordinary Cases
In Python, random.sample is the clearest choice for selecting k unique items without replacement.
This returns a random subset of size 3 with no duplicates. It is usually preferable to shuffling the whole collection when you only need a small sample.
Know the Difference Between Sampling and Shuffling
People often solve this problem by shuffling the collection and taking the first k elements.
This works, but it mutates the original list and does more work than necessary if k is much smaller than the collection size. That is why direct sampling APIs are a better default when they are available.
If you need to preserve the original list, copy first:
Handle Large or Streaming Collections Differently
If the collection is too large to keep in memory or arrives as a stream, use reservoir sampling. This lets you choose k items uniformly without storing the entire input.
This is the right tool for log streams, large files, or database cursors when reading everything into a list is wasteful.
Clarify Whether Replacement Is Allowed
Some tasks want a subset without duplicates. Others want repeated selection where the same element can appear more than once.
Without replacement:
With replacement:
That distinction matters. If the question says "subset," it usually implies no replacement, but many business tasks really want repeated random picks instead.
Think About Reproducibility
For testing or scientific work, seed the random generator so the sample can be reproduced.
Using a local random generator is often cleaner than setting global seed state, especially in larger programs or test suites.
Consider Weighted Sampling Separately
Some problems do not want every element to be equally likely. If certain items should be favored, that is a different requirement from uniform subset selection and should be handled explicitly.
For example, with replacement in Python:
Weighted sampling without replacement is more specialized, and the right approach depends on the language or library you use. The important point is not to quietly substitute a weighted process when the task really asked for a uniform random subset.
Common Pitfalls
- Shuffling the entire collection when a direct sampling API would be simpler and cheaper.
- Forgetting that
random.shufflemutates the original list. - Using sampling without replacement when the problem actually allows repeated picks.
- Reading an entire large stream into memory instead of using reservoir sampling.
- Ignoring reproducibility requirements in tests, simulations, or experiments.
Summary
- For ordinary in-memory collections, use a built-in sampling function such as
random.sample. - Shuffling and slicing works, but it mutates the list and is often less direct.
- For streaming or very large inputs, use reservoir sampling.
- Decide explicitly whether duplicates are allowed.
- Seed a local random generator when reproducibility matters.

