Algorithms for determining the key of an audio sample

Music Analysis

Key Detection

Audio Processing

Algorithm Design

Computational Musicology

Algorithms for determining the key of an audio sample

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Determining the musical key of an audio sample is a pattern-recognition problem built on pitch distribution, harmony, and time. The simplest workable systems compare the pitch content of the audio to major and minor key templates, while more advanced systems use harmonic pitch class profiles, temporal smoothing, and machine-learning models.

Start with Pitch-Class Features

Most key-detection algorithms do not work directly on raw waveform samples. They first transform the audio into a representation that says how strongly each of the twelve pitch classes is present over time.

Common features include:

chroma vectors
harmonic pitch class profiles, often called HPCP
spectrogram-derived harmonic summaries

A chroma representation collapses pitches across octaves, so C2, C3, and C4 all contribute to the pitch class C. That is useful because musical key depends more on pitch class relationships than exact octave placement.

With librosa, a minimal chroma extraction looks like this:

python

1import librosa
2import numpy as np
3
4y, sr = librosa.load("example.wav", mono=True)
5chroma = librosa.feature.chroma_cqt(y=y, sr=sr)
6profile = np.mean(chroma, axis=1)
7print(profile.shape)

The resulting profile is a 12-element summary of pitch-class energy across the sample.

Use Template Matching as the Baseline Algorithm

A classic key-detection method compares the pitch-class profile with predefined major and minor key templates. The best-known approach is in the style of Krumhansl-Schmuckler key finding.

The idea is simple:

build a 12-bin pitch profile from the audio
rotate a major template through all 12 tonics
rotate a minor template through all 12 tonics
score each candidate with correlation or cosine similarity
choose the best match

A simplified example:

python

1import numpy as np
2
3major_template = np.array([6.35, 2.23, 3.48, 2.33, 4.38, 4.09,
4                           2.52, 5.19, 2.39, 3.66, 2.29, 2.88])
5minor_template = np.array([6.33, 2.68, 3.52, 5.38, 2.60, 3.53,
6                           2.54, 4.75, 3.98, 2.69, 3.34, 3.17])
7
8pitch_profile = np.array([5.9, 2.1, 3.3, 2.2, 4.2, 4.0,
9                          2.4, 5.0, 2.3, 3.5, 2.1, 2.7])
10
11best_score = -1
12best_key = None
13
14for shift in range(12):
15    major_score = np.corrcoef(pitch_profile, np.roll(major_template, shift))[0, 1]
16    minor_score = np.corrcoef(pitch_profile, np.roll(minor_template, shift))[0, 1]
17
18    if major_score > best_score:
19        best_score = major_score
20        best_key = (shift, "major")
21    if minor_score > best_score:
22        best_score = minor_score
23        best_key = (shift, "minor")
24
25print(best_key, best_score)

This baseline is surprisingly effective for stable tonal music.

Improve Robustness with Harmonic Emphasis

Raw chroma can be noisy. Percussion, transient attacks, and dense arrangements can blur the tonal center. That is why many practical systems use harmonic pitch class profiles or harmonic isolation before building the pitch summary.

For example, separating harmonic content from percussive content can help:

python

1import librosa
2import numpy as np
3
4y, sr = librosa.load("example.wav", mono=True)
5y_harmonic, y_percussive = librosa.effects.hpss(y)
6chroma = librosa.feature.chroma_cqt(y=y_harmonic, sr=sr)
7profile = np.mean(chroma, axis=1)
8print(np.round(profile, 3))

This tends to produce a cleaner key profile for music where drums or transient-heavy material would otherwise dominate the spectrum.

Handle Key Changes Over Time

A single global key label is not always enough. Real music modulates. A piece may begin in one key and end in another, or a short sample may emphasize accidentals that confuse a global summary.

A more advanced approach is to compute chroma or HPCP in windows, score each window, and then smooth the key sequence over time. Hidden Markov models and similar temporal models are often used here because they prefer stable key sequences while still allowing occasional transitions.

This is one reason professional key-detection systems feel smarter than a simple whole-track average.

Machine Learning Can Help, but Features Still Matter

Modern systems may train classifiers on chroma-like features, spectrogram slices, or end-to-end audio embeddings. That can improve robustness on complex audio, but the classic problems remain:

ambiguous tonality
modulation
short clips with weak harmonic evidence
genre-specific harmonic patterns

A neural network does not remove the need for good feature design and evaluation. It just shifts more of the decision-making into the learned model.

Common Pitfalls

The biggest mistake is trying to infer key directly from raw waveform values without extracting pitch-related features first. Another common issue is averaging chroma over an entire track even when the music modulates, which can blur the result badly. Developers also underestimate how much percussion and noise can distort pitch-class profiles unless harmonic emphasis is used. Finally, a key estimate is often a best guess, not an absolute truth, especially for short, modal, or harmonically ambiguous excerpts.

Summary

Key detection usually starts by converting audio into pitch-class features such as chroma or HPCP.
Template matching against rotated major and minor profiles is the classic baseline approach.
Harmonic isolation often improves results by reducing percussive noise.
Windowed analysis helps when the music changes key over time.
Machine-learning models can improve robustness, but the core challenge is still extracting reliable tonal evidence from the audio.