Reservoir sampling

Random sampling

Algorithm

Data Streams

Computer science

Reservoir sampling

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Reservoir sampling is an algorithm used to randomly select a sample of `k` items from a population of unknown size `n` (where `n >= k`) in a single pass through the data. This is particularly advantageous in scenarios where the population size is dynamic or when it is impractical to store the entire population in memory.

How Reservoir Sampling Works

Algorithm

The reservoir sampling algorithm is typically implemented as follows for selecting `k` samples from a stream:

Initialize the Reservoir:
- Create an array or list `R` of size `k` and fill it with the first `k` items from the input stream.
Iterate Over Remaining Elements:
- For each subsequent element at index `i` (starting from `k+1` to `n`):
  - Generate a random integer `j` between `0` and `i` (inclusive).
  - If `j` is less than `k`, replace `R[j]` with the current item from the stream.
Result: The reservoir `R` will contain a random sample of `k` items.

Example

Suppose we want to sample `k = 2` elements from the sequence `[1, 2, 3, 4, 5]`:

Initialization:
- Initialize `R` = `[1, 2]`.
Process Remaining Elements:
- Element at index `3` (value `3`):
  - Generate random index `j` = 0.
  - Since `0 < 2`, replace `R[0]` with `3`. Now `R` = `[3, 2]`.
- Element at index `4` (value `4`):
  - Generate random index `j` = 3.
  - No replacement as `3 >= 2`.
- Element at index `5` (value `5`):
  - Generate random index `j` = 1.
  - Since `1 < 2`, replace `R[1]` with `5`. Now `R` = `[3, 5]`.

The final `R` could be any combination of two elements from the list, showcasing that it is uniformly random.

Properties of Reservoir Sampling

Uniformity: Each element from the input stream has an equal probability of being chosen in the reservoir.
Single Pass: It is able to sample in a single pass over the data, making it efficient for large datasets or streams.
Space Efficient: Only `k` elements need to be stored in memory, regardless of the input size.
Dynamic Population: Ideal for situations where the total number of items `n` is unknown or continually increasing.

Variants and Considerations

Weighted Reservoir Sampling: A variant that allows elements to have different probabilities of selection, allowing for weighted sampling based on element importance.
Stream Updates: If new elements are added to the stream after the sampling process, the reservoir can be updated efficiently using similar logic.

Key Applications of Reservoir Sampling

Online Algorithm: It is well-suited for online environments where the data is continuously arriving.
Database Management: Used for database query optimization and randomizing datasets efficiently.
Machine Learning: Preprocessing step to generate training and testing datasets from massive data streams.

Conclusion

Reservoir sampling is a powerful technique for efficiently obtaining a random sample from a potentially infinite dataset or stream with minimal memory usage. Its single-pass collection nature, coupled with unbiased selection, makes it a go-to method in various computer science and data processing applications.

Summary Table

Below is a summary of the key points of reservoir sampling:

Aspect	Description
Purpose	Randomly sample `k` items from a stream of unknown size.
Initialization	Fill a reservoir `R` with the first `k` items.
Selection Process	Replace items in `R` with probability $k/i$ for each new element `i`.
Properties	Uniformity, single pass, space-efficient.
Space Complexity	$O(k)$
Applications	Online algorithms, database management, machine learning.

Reservoir sampling remains a vital tool for data scientists and engineers handling large datasets, providing a robust solution for random sampling challenges.