Audio Preprocessing
Neural Networks
Machine Learning
Data Preparation
Signal Processing

How to preprocess audio data for input into a Neural Network

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Neural networks do not learn directly from raw audio files in a useful way unless you first convert the signal into a stable numeric representation. Good preprocessing makes samples comparable, controls noise, and gives the model fixed-size inputs that can be batched efficiently.

The exact pipeline depends on the task, but most projects follow the same pattern: standardize the waveform, extract time-frequency features, and reshape them into a consistent tensor. That is the part worth getting right before tuning the model itself.

Start with a Consistent Waveform

Audio datasets often mix sample rates, channel counts, and recording levels. If you feed that directly into a network, the model spends capacity learning recording quirks instead of the target task.

A strong baseline is:

  • resample every clip to one sample rate, often 16000 Hz for speech,
  • convert stereo to mono unless spatial information matters,
  • trim long silent regions,
  • normalize amplitude so clips have comparable scale.

Here is a minimal preprocessing function using librosa:

python
1import numpy as np
2import librosa
3
4
5TARGET_SR = 16000
6
7
8def load_waveform(path):
9    signal, _ = librosa.load(path, sr=TARGET_SR, mono=True)
10
11    peak = np.max(np.abs(signal))
12    if peak > 0:
13        signal = signal / peak
14
15    signal, _ = librosa.effects.trim(signal, top_db=25)
16    return signal.astype(np.float32)

Peak normalization is easy and often sufficient for a first model. For some tasks, dataset-level normalization is better, but only if you compute statistics from the training split alone.

Convert Audio into Features

Most neural audio models do not use the raw waveform directly. A log-mel spectrogram is a common choice because it preserves frequency content over time while reducing dimensionality.

python
1import numpy as np
2import librosa
3
4
5N_FFT = 1024
6HOP_LENGTH = 256
7N_MELS = 64
8
9
10def waveform_to_log_mel(signal, sr=16000):
11    mel = librosa.feature.melspectrogram(
12        y=signal,
13        sr=sr,
14        n_fft=N_FFT,
15        hop_length=HOP_LENGTH,
16        n_mels=N_MELS,
17        power=2.0,
18    )
19    log_mel = librosa.power_to_db(mel, ref=np.max)
20    return log_mel.astype(np.float32)

For speech commands, speaker identification, and environmental sound classification, log-mel features are often a practical default. If you are solving a waveform-native task with a one-dimensional convolutional model, you might skip this step, but feature extraction remains the simpler and more data-efficient starting point.

Make Every Sample the Same Shape

Batches require a fixed tensor shape. Spectrograms vary in time dimension because recordings have different durations, so pad short clips and crop long clips to the same number of frames.

python
1import numpy as np
2
3
4TARGET_FRAMES = 256
5
6
7def fix_length(feature_matrix):
8    frames = feature_matrix.shape[1]
9
10    if frames < TARGET_FRAMES:
11        pad_amount = TARGET_FRAMES - frames
12        feature_matrix = np.pad(
13            feature_matrix,
14            ((0, 0), (0, pad_amount)),
15            mode="constant",
16        )
17    else:
18        feature_matrix = feature_matrix[:, :TARGET_FRAMES]
19
20    return feature_matrix[..., np.newaxis]

That final channel dimension is convenient for convolutional models because it turns the array into a shape such as (64, 256, 1).

A complete preprocessing function then becomes:

python
1def preprocess_audio(path):
2    signal = load_waveform(path)
3    features = waveform_to_log_mel(signal, sr=TARGET_SR)
4    tensor = fix_length(features)
5    return tensor.astype(np.float32)

Apply Augmentation Only During Training

Once the baseline pipeline is stable, augmentation can improve robustness. Common options include additive noise, random gain, small time shifts, and time-frequency masking. These techniques help the model generalize, but they should be applied only to training data.

If you augment validation or test data, you distort the evaluation and make model comparisons unreliable. Keep the evaluation path deterministic and reserve randomness for the training pipeline.

Common Pitfalls

  • Mixing sample rates within the same dataset. Resample early so every clip uses the same temporal scale.
  • Keeping stereo when the task does not need it. Extra channels increase complexity without adding useful signal.
  • Normalizing with statistics from the full dataset. That leaks information from validation or test splits into training.
  • Forgetting fixed shapes before batching. Variable-length spectrograms break ordinary mini-batch loaders.
  • Applying augmentation during evaluation. That makes accuracy unstable and hard to interpret.

Summary

  • Standardize raw audio first by resampling, converting channels, trimming silence, and normalizing.
  • Log-mel spectrograms are a strong default feature representation for many audio tasks.
  • Pad or crop features to a fixed frame count before batching.
  • Keep augmentation in the training path only.
  • A simple, consistent preprocessing pipeline usually matters more than a complicated model on noisy inputs.

Course illustration
Course illustration

All Rights Reserved.