Algorithm for sampling without replacement?

algorithm

sampling

probability

statistics

data science

Algorithm for sampling without replacement?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Sampling without replacement means selecting unique items so no item appears twice in one sample. This is essential in statistics, randomized testing, and machine learning data splits. The right algorithm depends on data size, whether data is stored in memory, and whether input arrives as a stream.

Problem Setup

Given population size N and desired sample size k, choose k distinct elements uniformly at random.

Important constraints:

k cannot exceed N
each valid subset of size k should have equal probability

A correct algorithm must satisfy uniqueness and uniformity simultaneously.

Method 1: Shuffle and Slice

If full dataset fits memory, shuffle once and take first k elements.

python

1import random
2from typing import List
3
4
5def sample_shuffle(data: List[int], k: int, seed: int = 0) -> List[int]:
6    if k < 0 or k > len(data):
7        raise ValueError("k must be between 0 and len(data)")
8
9    rng = random.Random(seed)
10    arr = data.copy()
11    rng.shuffle(arr)
12    return arr[:k]
13
14
15print(sample_shuffle(list(range(10)), 4, seed=42))

This is simple and reliable, with O of N time and O of N memory for copied array.

Method 2: Partial Fisher-Yates

When k is much smaller than N, perform only first k swaps of Fisher-Yates.

python

1import random
2from typing import List
3
4
5def sample_partial_fisher_yates(data: List[int], k: int, seed: int = 0) -> List[int]:
6    if k < 0 or k > len(data):
7        raise ValueError("invalid k")
8
9    rng = random.Random(seed)
10    arr = data.copy()
11    n = len(arr)
12
13    for i in range(k):
14        j = rng.randrange(i, n)
15        arr[i], arr[j] = arr[j], arr[i]
16
17    return arr[:k]
18
19
20print(sample_partial_fisher_yates(list(range(20)), 5, seed=7))

This still gives uniform samples and can reduce work when k is small relative to N.

Method 3: Reservoir Sampling for Streams

If data arrives as stream or cannot fit memory, use reservoir sampling.

python

1import random
2from typing import Iterable, List
3
4
5def reservoir_sample(stream: Iterable[int], k: int, seed: int = 0) -> List[int]:
6    if k < 0:
7        raise ValueError("k must be non-negative")
8
9    rng = random.Random(seed)
10    reservoir: List[int] = []
11
12    for i, item in enumerate(stream):
13        if i < k:
14            reservoir.append(item)
15        else:
16            j = rng.randint(0, i)
17            if j < k:
18                reservoir[j] = item
19
20    if len(reservoir) < k:
21        raise ValueError("stream shorter than k")
22
23    return reservoir
24
25
26print(reservoir_sample(range(1000), 10, seed=21))

This uses O of k memory and O of N single-pass time.

Method 4: Index Sampling Then Lookup

For large immutable arrays where copying values is expensive, sample unique indices first, then fetch values.

python

1import random
2
3
4def sample_indices(n: int, k: int, seed: int = 0):
5    if k < 0 or k > n:
6        raise ValueError("invalid k")
7    rng = random.Random(seed)
8    return rng.sample(range(n), k)
9
10
11idx = sample_indices(1_000_000, 5, seed=1)
12print(idx)

This pattern is practical when data lives in external storage and random index access is available.

Uniformity and Validation

A correct sampler should not bias certain elements or subsets.

Practical validation checks:

run many trials and compare frequency counts
confirm no duplicates in each sample
test edge cases where k is zero, one, and N

For high-stakes simulation workloads, include statistical tests in quality pipelines.

Performance Tradeoffs

Choose algorithm by context:

full in-memory list and moderate size: shuffle or partial Fisher-Yates
huge stream: reservoir
expensive element objects: sample indices

Do not choose algorithm only by popularity. Match it to memory and access constraints.

Reproducibility Requirements

For experiments and tests, controlled randomness is essential. Pass explicit seeds and isolate random generator state.

python

rng = random.Random(123)
print(rng.sample(range(50), 6))

Avoid mixing global random state across unrelated modules when reproducibility matters.

Common Pitfalls

A common pitfall is repeatedly drawing random elements and retrying duplicates, which can become inefficient and biased if implemented incorrectly. Another is forgetting that stream sampling needs reservoir logic, not list shuffle. Teams also often ignore reproducibility and then fail to reproduce model-training results. Finally, not validating k boundaries can cause silent logic errors in downstream workflows.

Summary

Sampling without replacement requires uniqueness and uniform randomness.
Shuffle-based approaches are simplest for in-memory datasets.
Reservoir sampling is the standard for streaming or huge inputs.
Seeded randomness improves reproducibility and debugging.
Pick algorithm based on data size, access pattern, and memory budget.