Algorithm for sampling without replacement?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Sampling without replacement means selecting unique items so no item appears twice in one sample. This is essential in statistics, randomized testing, and machine learning data splits. The right algorithm depends on data size, whether data is stored in memory, and whether input arrives as a stream.
Problem Setup
Given population size N and desired sample size k, choose k distinct elements uniformly at random.
Important constraints:
kcannot exceedN- each valid subset of size
kshould have equal probability
A correct algorithm must satisfy uniqueness and uniformity simultaneously.
Method 1: Shuffle and Slice
If full dataset fits memory, shuffle once and take first k elements.
This is simple and reliable, with O of N time and O of N memory for copied array.
Method 2: Partial Fisher-Yates
When k is much smaller than N, perform only first k swaps of Fisher-Yates.
This still gives uniform samples and can reduce work when k is small relative to N.
Method 3: Reservoir Sampling for Streams
If data arrives as stream or cannot fit memory, use reservoir sampling.
This uses O of k memory and O of N single-pass time.
Method 4: Index Sampling Then Lookup
For large immutable arrays where copying values is expensive, sample unique indices first, then fetch values.
This pattern is practical when data lives in external storage and random index access is available.
Uniformity and Validation
A correct sampler should not bias certain elements or subsets.
Practical validation checks:
- run many trials and compare frequency counts
- confirm no duplicates in each sample
- test edge cases where
kis zero, one, andN
For high-stakes simulation workloads, include statistical tests in quality pipelines.
Performance Tradeoffs
Choose algorithm by context:
- full in-memory list and moderate size: shuffle or partial Fisher-Yates
- huge stream: reservoir
- expensive element objects: sample indices
Do not choose algorithm only by popularity. Match it to memory and access constraints.
Reproducibility Requirements
For experiments and tests, controlled randomness is essential. Pass explicit seeds and isolate random generator state.
Avoid mixing global random state across unrelated modules when reproducibility matters.
Common Pitfalls
A common pitfall is repeatedly drawing random elements and retrying duplicates, which can become inefficient and biased if implemented incorrectly. Another is forgetting that stream sampling needs reservoir logic, not list shuffle. Teams also often ignore reproducibility and then fail to reproduce model-training results. Finally, not validating k boundaries can cause silent logic errors in downstream workflows.
Summary
- Sampling without replacement requires uniqueness and uniform randomness.
- Shuffle-based approaches are simplest for in-memory datasets.
- Reservoir sampling is the standard for streaming or huge inputs.
- Seeded randomness improves reproducibility and debugging.
- Pick algorithm based on data size, access pattern, and memory budget.

