Reservoir sampling
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Reservoir sampling is an algorithm used to randomly select a sample of `k` items from a population of unknown size `n` (where `n >= k`) in a single pass through the data. This is particularly advantageous in scenarios where the population size is dynamic or when it is impractical to store the entire population in memory.
How Reservoir Sampling Works
Algorithm
The reservoir sampling algorithm is typically implemented as follows for selecting `k` samples from a stream:
- Initialize the Reservoir:
- Create an array or list `R` of size `k` and fill it with the first `k` items from the input stream.
- Iterate Over Remaining Elements:
- For each subsequent element at index `i` (starting from `k+1` to `n`):
- Generate a random integer `j` between `0` and `i` (inclusive).
- If `j` is less than `k`, replace `R[j]` with the current item from the stream.
- Result: The reservoir `R` will contain a random sample of `k` items.
Example
Suppose we want to sample `k = 2` elements from the sequence `[1, 2, 3, 4, 5]`:
- Initialization:
- Initialize `R` = `[1, 2]`.
- Process Remaining Elements:
- Element at index `3` (value `3`):
- Generate random index `j` = 0.
- Since `0 < 2`, replace `R[0]` with `3`. Now `R` = `[3, 2]`.
- Element at index `4` (value `4`):
- Generate random index `j` = 3.
- No replacement as `3 >= 2`.
- Element at index `5` (value `5`):
- Generate random index `j` = 1.
- Since `1 < 2`, replace `R[1]` with `5`. Now `R` = `[3, 5]`.
The final `R` could be any combination of two elements from the list, showcasing that it is uniformly random.
Properties of Reservoir Sampling
- Uniformity: Each element from the input stream has an equal probability of being chosen in the reservoir.
- Single Pass: It is able to sample in a single pass over the data, making it efficient for large datasets or streams.
- Space Efficient: Only `k` elements need to be stored in memory, regardless of the input size.
- Dynamic Population: Ideal for situations where the total number of items `n` is unknown or continually increasing.
Variants and Considerations
- Weighted Reservoir Sampling: A variant that allows elements to have different probabilities of selection, allowing for weighted sampling based on element importance.
- Stream Updates: If new elements are added to the stream after the sampling process, the reservoir can be updated efficiently using similar logic.
Key Applications of Reservoir Sampling
- Online Algorithm: It is well-suited for online environments where the data is continuously arriving.
- Database Management: Used for database query optimization and randomizing datasets efficiently.
- Machine Learning: Preprocessing step to generate training and testing datasets from massive data streams.
Conclusion
Reservoir sampling is a powerful technique for efficiently obtaining a random sample from a potentially infinite dataset or stream with minimal memory usage. Its single-pass collection nature, coupled with unbiased selection, makes it a go-to method in various computer science and data processing applications.
Summary Table
Below is a summary of the key points of reservoir sampling:
| Aspect | Description |
| Purpose | Randomly sample k items from a stream of unknown size. |
| Initialization | Fill a reservoir R with the first k items. |
| Selection Process | Replace items in R with probability for each new element i. |
| Properties | Uniformity, single pass, space-efficient. |
| Space Complexity | |
| Applications | Online algorithms, database management, machine learning. |
Reservoir sampling remains a vital tool for data scientists and engineers handling large datasets, providing a robust solution for random sampling challenges.

