C++
random numbers
sorting algorithms
uniform distribution
programming efficiency

How can I generate sorted uniformly distributed random numbers efficiently in C?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Generating a large sequence of sorted uniform random numbers is a common task in simulations, Monte Carlo methods, and statistical sampling. The naive approach of generating all values and then sorting them works, but it costs O(n log n) time for the sort step. When n is in the millions, that sorting overhead becomes a bottleneck. There is a mathematically elegant O(n) technique based on exponential spacings that produces sorted uniform random numbers directly, skipping the sort entirely.

The Naive Approach: Generate Then Sort

The straightforward method generates n uniform random numbers and sorts them afterward:

c
1#include <stdio.h>
2#include <stdlib.h>
3#include <time.h>
4
5int compare_double(const void *a, const void *b) {
6    double da = *(const double *)a;
7    double db = *(const double *)b;
8    if (da < db) return -1;
9    if (da > db) return 1;
10    return 0;
11}
12
13void generate_sorted_naive(double *out, int n) {
14    for (int i = 0; i < n; i++) {
15        out[i] = (double)rand() / RAND_MAX;
16    }
17    qsort(out, n, sizeof(double), compare_double);
18}

This runs in O(n log n) due to the sort. For small n (under a few thousand), this is perfectly fine. But for large n, we can do better.

The Exponential Spacings Method: O(n) Generation

This technique exploits a property of order statistics. If you take n uniform random variables on [0, 1] and sort them, the gaps between consecutive values follow a specific distribution. Specifically, the gaps are proportional to exponential random variables.

The algorithm works as follows. Generate n + 1 exponential random variables, compute their cumulative sum, then divide each partial sum by the total. The result is n sorted uniform values on [0, 1]:

c
1#include <stdio.h>
2#include <stdlib.h>
3#include <math.h>
4
5void generate_sorted_exponential(double *out, int n) {
6    // Generate n+1 exponential random variables (spacings)
7    double *spacings = malloc((n + 1) * sizeof(double));
8    double total = 0.0;
9
10    for (int i = 0; i <= n; i++) {
11        // Exponential variate via inverse transform: -ln(U)
12        double u = ((double)rand() + 1.0) / ((double)RAND_MAX + 1.0);
13        spacings[i] = -log(u);
14        total += spacings[i];
15    }
16
17    // Cumulative sum normalized by total gives sorted uniforms
18    double cumsum = 0.0;
19    for (int i = 0; i < n; i++) {
20        cumsum += spacings[i];
21        out[i] = cumsum / total;
22    }
23
24    free(spacings);
25}

This runs in O(n) time with a single pass to generate spacings and a single pass to compute the cumulative sums. No sorting is needed because the cumulative sum is monotonically increasing by construction.

Why This Works: The Mathematical Intuition

Consider n + 1 points placed uniformly on a circle of circumference 1. The arc lengths between consecutive points are symmetric Dirichlet-distributed, which is equivalent to normalized exponential spacings. When you unwrap the circle into a line segment [0, 1], the cumulative spacings give you the sorted order statistics of a uniform distribution.

The key insight is that generating exponential variates and normalizing them is mathematically identical to generating uniform variates and sorting them, but computationally cheaper.

C++ Version with the Standard Random Library

For C++ projects, use the standard library's random facilities for better quality random numbers:

cpp
1#include <vector>
2#include <random>
3#include <numeric>
4#include <cmath>
5
6std::vector<double> generate_sorted_uniform(int n, unsigned seed = 42) {
7    std::mt19937 rng(seed);
8    std::exponential_distribution<double> exp_dist(1.0);
9
10    std::vector<double> spacings(n + 1);
11    for (auto &s : spacings) {
12        s = exp_dist(rng);
13    }
14
15    double total = std::accumulate(spacings.begin(), spacings.end(), 0.0);
16
17    std::vector<double> result(n);
18    double cumsum = 0.0;
19    for (int i = 0; i < n; i++) {
20        cumsum += spacings[i];
21        result[i] = cumsum / total;
22    }
23
24    return result;
25}

Using std::exponential_distribution directly is cleaner and more numerically robust than manually computing -log(U), since it handles edge cases around zero properly.

Benchmark Comparison

Here is a simple benchmark to compare the two approaches:

cpp
1#include <chrono>
2#include <algorithm>
3#include <iostream>
4
5void benchmark(int n) {
6    std::vector<double> data(n);
7    std::mt19937 rng(42);
8    std::uniform_real_distribution<double> uniform(0.0, 1.0);
9    std::exponential_distribution<double> exp_dist(1.0);
10
11    // Naive: generate + sort
12    auto t1 = std::chrono::high_resolution_clock::now();
13    for (auto &x : data) x = uniform(rng);
14    std::sort(data.begin(), data.end());
15    auto t2 = std::chrono::high_resolution_clock::now();
16
17    // Exponential spacings
18    auto t3 = std::chrono::high_resolution_clock::now();
19    std::vector<double> spacings(n + 1);
20    double total = 0.0;
21    for (auto &s : spacings) { s = exp_dist(rng); total += s; }
22    double cumsum = 0.0;
23    for (int i = 0; i < n; i++) {
24        cumsum += spacings[i];
25        data[i] = cumsum / total;
26    }
27    auto t4 = std::chrono::high_resolution_clock::now();
28
29    auto naive_ms = std::chrono::duration<double, std::milli>(t2 - t1).count();
30    auto fast_ms = std::chrono::duration<double, std::milli>(t4 - t3).count();
31
32    std::cout << "n=" << n
33              << "  naive=" << naive_ms << "ms"
34              << "  exponential=" << fast_ms << "ms"
35              << "  speedup=" << naive_ms / fast_ms << "x\n";
36}

Typical results on modern hardware show that for n = 10 million, the exponential method is roughly 3 to 5 times faster than generate-then-sort, and the gap widens as n increases because O(n) versus O(n log n) diverges.

Memory Optimization

The basic exponential method requires O(n) extra memory for the spacings array. You can reduce this to O(1) extra memory with a two-pass approach. First compute the total, then regenerate the spacings with the same seed:

cpp
1std::vector<double> generate_sorted_low_memory(int n, unsigned seed = 42) {
2    std::exponential_distribution<double> exp_dist(1.0);
3
4    // First pass: compute total
5    std::mt19937 rng1(seed);
6    double total = 0.0;
7    for (int i = 0; i <= n; i++) total += exp_dist(rng1);
8
9    // Second pass: generate cumulative values using same seed
10    std::mt19937 rng2(seed);
11    std::vector<double> result(n);
12    double cumsum = 0.0;
13    for (int i = 0; i < n; i++) {
14        cumsum += exp_dist(rng2);
15        result[i] = cumsum / total;
16    }
17
18    return result;
19}

This trades one extra pass over the RNG for eliminating the spacings array, which matters when n is so large that memory is the constraint.

Common Pitfalls

  • Using rand() for serious work: The C rand() function has poor statistical properties and limited range (often only 15 bits). Use std::mt19937 or another quality generator for simulations.
  • Forgetting the +1 in spacings count: You need n + 1 exponential variates to produce n sorted uniforms. Using exactly n spacings produces values that do not cover [0, 1] correctly.
  • Division by zero when U is 0: Computing -log(U) where U is rand() / RAND_MAX can produce infinity when rand() returns 0. Always shift the range slightly or use a library distribution that handles this.
  • Assuming uniform quality for the sorted output: The exponential method produces mathematically exact uniform order statistics, but the quality depends on your underlying RNG. A poor generator produces poor sorted uniforms regardless of the method.
  • Ignoring numerical precision for very large n: When n exceeds roughly 10 to 100 million, accumulated floating-point rounding in the cumulative sum can degrade the uniformity of the last few values. Using compensated (Kahan) summation mitigates this.

Summary

  • The naive generate-then-sort approach runs in O(n log n) and is fine for small n.
  • The exponential spacings method generates sorted uniform random numbers in O(n) by exploiting the relationship between exponential variates and uniform order statistics.
  • Use std::exponential_distribution in C++ for numerically robust exponential variate generation.
  • A two-pass approach can reduce extra memory usage from O(n) to O(1) at the cost of regenerating the random sequence.
  • For large n (millions or more), the exponential method is 3 to 5 times faster and the advantage grows with n.
  • Always use a quality random number generator like Mersenne Twister rather than rand() for any serious numerical work.

Course illustration
Course illustration

All Rights Reserved.