Select an element from a stream with uniform distributed probability
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Problem Statement
When dealing with data streams, a common challenge is selecting an element with a uniformly distributed probability. Imagine a scenario where data is being fed to you in real-time, and you need to select one element such that every element seen so far has the same chance of being chosen. This is an example of a streaming algorithm task and is colloquially known as "Reservoir Sampling".
Reservoir Sampling
Reservoir Sampling offers a solution to selecting an element from a stream with uniform probability without knowing the total number of elements in advance. The simplest form of the algorithm, known as Reservoir Sampling with a reservoir of size 1, allows for selecting one item among `n` items with each having a probability.
Algorithm Explanation
- Initialize an empty reservoir. For simplicity, assume the reservoir can hold only one item.
- As each element comes in from the stream, do the following:
- If the reservoir is empty, place the current element in the reservoir.
- If the reservoir already is occupied, replace the set element with the probability of , where `n` is the index position of the current element plus one.
Pseudocode
Here is a simple pseudocode representation for selecting an element from a stream with uniform probability:
- Step 1: `n=1`, Element = 3. Since `n=1`, place 3 in the reservoir. \
- Step 2: `n=2`, Element = 7. Draw a random number between 1 and 2. If it equals 2, replace 3 with 7. Otherwise, keep 3. \
- Step 3: `n=3`, Element = 5. Draw a random number between 1 and 3. Replace the reservoir if the number equals 3. \
- Step 4 & 5: Continue with the same logic.
- Space Complexity: since only a constant amount of space is required.
- Time Complexity: where `n` is the number of elements processed because each element is considered once.
- Random Sampling: Useful in data analytics for estimating statistics with a subset.
- Load Balancing: Random selection might be preferred to uniformly distribute load.
- Online Video Platforms: For randomized selection of videos or clips to display to users.

