Introduction
MATLAB provides a rich set of functions for generating artificial (synthetic) data for testing algorithms, simulating systems, and prototyping models before real data is available. The core functions are rand (uniform), randn (normal/Gaussian), and randi (random integers) for basic random data. For structured synthetic data, MATLAB offers functions to create linear trends, sinusoidal signals, noisy datasets, clustered data, and time series. This article covers the most common data generation patterns used in engineering, machine learning, and statistical analysis.
Random Number Generation
1% Scalar random number between 0 and 1
2x = rand;
3
4% 5x3 matrix of uniform random numbers in [0, 1]
5A = rand(5, 3);
6
7% Random numbers in a custom range [a, b]
8a = 10; b = 50;
9data = a + (b - a) * rand(100, 1); % 100 values between 10 and 50
10
11% Reproducible results with seed
12rng(42); % Set seed
13data1 = rand(5, 1);
14rng(42); % Reset seed
15data2 = rand(5, 1);
16% data1 == data2 (identical)
Normal (Gaussian) Distribution (randn)
1% Standard normal: mean=0, std=1
2data = randn(1000, 1);
3
4% Custom mean and standard deviation
5mu = 100;
6sigma = 15;
7iq_scores = mu + sigma * randn(1000, 1);
8
9% Verify
10fprintf('Mean: %.2f, Std: %.2f\n', mean(iq_scores), std(iq_scores));
11% Mean: ~100.00, Std: ~15.00
Random Integers (randi)
1% Random integers from 1 to 10
2dice_rolls = randi(6, 1, 100); % 100 dice rolls
3
4% Random integers in range [low, high]
5ages = randi([18, 65], 50, 1); % 50 ages between 18 and 65
6
7% Random binary data
8bits = randi([0, 1], 1, 256); % 256 random bits
Structured Data Generation
Linear Data with Noise
1% y = mx + b + noise
2n = 100;
3x = linspace(0, 10, n)';
4slope = 2.5;
5intercept = 3;
6noise = randn(n, 1) * 1.5; % Gaussian noise, std=1.5
7
8y = slope * x + intercept + noise;
9
10plot(x, y, 'b.', x, slope*x + intercept, 'r-', 'LineWidth', 2);
11legend('Noisy data', 'True line');
12xlabel('x'); ylabel('y');
13title('Linear Data with Gaussian Noise');
Sinusoidal Signal with Noise
1t = linspace(0, 2*pi, 500)';
2frequency = 3;
3amplitude = 5;
4clean_signal = amplitude * sin(frequency * t);
5noisy_signal = clean_signal + randn(size(t)) * 0.8;
6
7plot(t, noisy_signal, 'b', t, clean_signal, 'r', 'LineWidth', 1.5);
8legend('Noisy', 'Clean');
9title('Sinusoidal Signal with Noise');
Polynomial Data
1n = 200;
2x = linspace(-3, 3, n)';
3% y = 0.5x^3 - 2x^2 + x + 3 + noise
4y = 0.5*x.^3 - 2*x.^2 + x + 3 + randn(n, 1) * 2;
5
6scatter(x, y, 10, 'filled');
7title('Polynomial Data with Noise');
Classification Data (Clusters)
Two-Class Gaussian Clusters
1n_per_class = 200;
2
3% Class 1: centered at (2, 3)
4class1 = [2 + randn(n_per_class, 1) * 0.8, ...
5 3 + randn(n_per_class, 1) * 0.8];
6
7% Class 2: centered at (5, 6)
8class2 = [5 + randn(n_per_class, 1) * 1.0, ...
9 6 + randn(n_per_class, 1) * 1.0];
10
11X = [class1; class2];
12labels = [ones(n_per_class, 1); 2*ones(n_per_class, 1)];
13
14gscatter(X(:,1), X(:,2), labels);
15title('Two-Class Gaussian Data');
16xlabel('Feature 1'); ylabel('Feature 2');
Multi-Class Clusters
1rng(42);
2n = 150;
3k = 3; % Number of classes
4
5centers = [0 0; 4 4; 8 0]; % Cluster centers
6spread = 1.2;
7
8X = []; labels = [];
9for i = 1:k
10 cluster = centers(i,:) + randn(n, 2) * spread;
11 X = [X; cluster];
12 labels = [labels; i * ones(n, 1)];
13end
14
15gscatter(X(:,1), X(:,2), labels);
16title(sprintf('%d-Class Clustered Data', k));
Time Series Data
1% Trend + seasonality + noise
2n = 365;
3t = (1:n)';
4
5trend = 0.05 * t; % Linear trend
6seasonality = 10 * sin(2*pi*t/365); % Annual cycle
7weekly = 3 * sin(2*pi*t/7); % Weekly cycle
8noise = randn(n, 1) * 2; % Random noise
9
10time_series = 50 + trend + seasonality + weekly + noise;
11
12plot(t, time_series);
13xlabel('Day'); ylabel('Value');
14title('Synthetic Time Series (Trend + Seasonality + Noise)');
Autoregressive (AR) Process
1n = 500;
2ar_coeff = 0.7;
3data = zeros(n, 1);
4data(1) = randn;
5
6for i = 2:n
7 data(i) = ar_coeff * data(i-1) + randn;
8end
9
10plot(data);
11title(sprintf('AR(1) Process, coefficient = %.1f', ar_coeff));
Specific Distributions
1% Exponential distribution
2lambda = 0.5;
3exp_data = -log(rand(1000, 1)) / lambda;
4
5% Poisson distribution
6poisson_data = poissrnd(5, 1000, 1); % Mean = 5
7
8% Chi-squared distribution
9chi2_data = chi2rnd(3, 1000, 1); % 3 degrees of freedom
10
11% Multivariate normal
12mu = [1, 2];
13sigma = [1 0.5; 0.5 2]; % Covariance matrix
14mv_data = mvnrnd(mu, sigma, 500);
15
16scatter(mv_data(:,1), mv_data(:,2), 10, 'filled');
17title('Multivariate Normal Data');
Common Pitfalls
Not setting the random seed for reproducibility: Without rng(seed), every run generates different data, making debugging impossible. Always set a seed at the start of your script when you need reproducible results.
Confusing rand and randn: rand generates uniform numbers in [0, 1]. randn generates standard normal numbers (mean 0, std 1). Using rand when you need Gaussian data (or vice versa) produces incorrect distributions.
Wrong matrix dimensions with rand(n): rand(n) creates an n-by-n matrix, not an n-by-1 vector. For a column vector, use rand(n, 1). For a row vector, use rand(1, n).
Generating correlated features unintentionally: [rand(n,1), rand(n,1)] creates independent features. If you need correlated features, use mvnrnd with a specified covariance matrix, or apply a transformation like Cholesky decomposition.
Using integer seeds with rng inconsistently across MATLAB versions: rng(42) uses the Mersenne Twister by default, but older MATLAB versions may use a different generator. Specify the generator explicitly with rng(42, 'twister') for cross-version reproducibility.
Summary
Use rand, randn, and randi for uniform, Gaussian, and integer random data
Scale and shift with a + (b-a)*rand(n,1) for custom ranges or mu + sigma*randn(n,1) for custom normal distributions
Generate structured data by combining linear/polynomial/sinusoidal functions with noise
Use mvnrnd for multivariate normal data with specified correlations
Always set rng(seed) for reproducible synthetic data generation