Get a random element in single direction linked list by one time traverse
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
In the realm of computer science, particularly data structures, a linked list offers a simplistic and dynamic alternative to arrays. In a singly-linked list, each node points to the next node and the list terminates at a node pointing to `null`. An intriguing problem involving singly-linked lists is selecting a random element while traversing the list only once. This article explores a method called Reservoir Sampling to achieve this.
Reservoir Sampling
Reservoir Sampling is an algorithm primarily used when dealing with a large or unknown data stream size. This method is particularly beneficial for selecting a random element from a singly-linked list without prior knowledge of its length, making it perfect for single-pass operations.
Algorithm Explanation
- Initialization: Begin by initializing a counter `n` to 1 and choosing the first node as your initial random element (let’s call this node `chosen`).
- Traversal: Iterate over the rest of the list. For each node at index `i` (where `i` starts from 2):
- Generate a random integer `r` such that `1 <= r <= i`.
- If `r` equals `i`, update `chosen` to the current node. Essentially, this step allows each node to have a fair probability of being selected as the random node.
- Termination: By the time the traversal has reached the end of the list, the node in `chosen` will be the desired random node. The probability that any node gets selected is `1/n`, where `n` is the length of the linked list.
Detailed Example
Consider a singly-linked list containing the nodes [A, B, C, D]. Implementing the Reservoir Sampling algorithm looks like this:
- Step 1: Start with `chosen = A` and `n = 1`.
- Step 2: Move to node B, generate `r` between `[1, 2]`. If `r` = 2, update `chosen = B`.
- Step 3: Move to node C, generate `r` between `[1, 3]`. If `r` = 3, update `chosen = C`.
- Step 4: Move to node D, generate `r` between `[1, 4]`. If `r` = 4, update `chosen = D`.
At the end of the traversal, `chosen` is a randomly selected node from the list.
Mathematical Justification
To understand why this method works, let's delve a bit into its probability model:
- For node `k` to be selected as the final element, it must replace the current element at round `k` and never be replaced afterwards.
- When `k` is encountered, it gets selected with probability `1/k`.
- As the list proceeds, it should not be replaced by each subsequent node `i` from `k+1` to `n`, giving it a cumulative probability: Which simplifies to:
This demonstrates that each node has an equal probability of being chosen, thereby justifying the algorithm's correctness.
Complexity Analysis
- Time Complexity: O(n), where `n` is the number of nodes in the list. This is due to the single traversal required over the linked list.
- Space Complexity: O(1), given that constant space is used, disregarding the input size.
Use Case Scenarios
- Stream Processing: Ideal for processing elements in streaming data as it requires minimal memory.
- Large Dataset Sampling: Efficient for data sampling from large datasets where the size is unknown or it is impractical to load all elements into memory.
- Distributed Systems: Useful in distributed computing environments where elements are distributed across multiple nodes.
Summary
The following table encapsulates the core aspects of using Reservoir Sampling for picking a random element from a singly-linked list:
| Aspect | Description |
| Purpose | Select a random element with a single traversal. |
| Algorithm Used | Reservoir Sampling |
| Space Complexity | O(1) |
| Time Complexity | O(n) |
| Ideal Scenarios | Stream processing, large datasets, distributed systems |
| Probability | Each node has a 1/n probability of being chosen,
where n is the list length. |
Reservoir Sampling provides an elegant and sophisticated solution to the problem of random selection from a singly linked list, ensuring efficiency both in time and computational resources. It has distinct advantages in scenarios involving large or streaming datasets.

