Data Streaming
Random Selection
Algorithm
Probability
Uniform Distribution

Select an element from a stream with uniform distributed probability

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Problem Statement

When dealing with data streams, a common challenge is selecting an element with a uniformly distributed probability. Imagine a scenario where data is being fed to you in real-time, and you need to select one element such that every element seen so far has the same chance of being chosen. This is an example of a streaming algorithm task and is colloquially known as "Reservoir Sampling".

Reservoir Sampling

Reservoir Sampling offers a solution to selecting an element from a stream with uniform probability without knowing the total number of elements in advance. The simplest form of the algorithm, known as Reservoir Sampling with a reservoir of size 1, allows for selecting one item among `n` items with each having a 1/n1/n probability.

Algorithm Explanation

  1. Initialize an empty reservoir. For simplicity, assume the reservoir can hold only one item.
  2. As each element comes in from the stream, do the following:
    • If the reservoir is empty, place the current element in the reservoir.
    • If the reservoir already is occupied, replace the set element with the probability of 1/n1/n, where `n` is the index position of the current element plus one.

Pseudocode

Here is a simple pseudocode representation for selecting an element from a stream with uniform probability:

  • Step 1: `n=1`, Element = 3. Since `n=1`, place 3 in the reservoir. \
  • Step 2: `n=2`, Element = 7. Draw a random number between 1 and 2. If it equals 2, replace 3 with 7. Otherwise, keep 3. \
  • Step 3: `n=3`, Element = 5. Draw a random number between 1 and 3. Replace the reservoir if the number equals 3. \
  • Step 4 & 5: Continue with the same logic.
  • Space Complexity: O(1)O(1) since only a constant amount of space is required.
  • Time Complexity: O(n)O(n) where `n` is the number of elements processed because each element is considered once.
  • Random Sampling: Useful in data analytics for estimating statistics with a subset.
  • Load Balancing: Random selection might be preferred to uniformly distribute load.
  • Online Video Platforms: For randomized selection of videos or clips to display to users.

Course illustration
Course illustration

All Rights Reserved.