Andrew Ng's Coursera Assignment - Training full Trigger Word detection model

Andrew Ng

Coursera

Trigger Word Detection

Machine Learning

Online Courses

Andrew Ng's Coursera Assignment - Training full Trigger Word detection model

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

The trigger-word assignment in Andrew Ng’s sequence models material is a compact example of real audio event detection. The goal is not just to classify a whole clip as positive or negative, but to detect when a target word occurs in time so the model can fire only after the spoken phrase appears.

What the Model Is Learning

A trigger-word detector usually consumes a time-frequency representation of audio, such as a spectrogram. Instead of predicting one label for the entire clip, it predicts a sequence of probabilities across time steps.

That distinction matters. The training target is aligned to the timeline, not just the clip. If the trigger word appears at one point in the recording, only the output frames immediately after that region should be marked positive.

The assignment commonly follows this workflow:

generate synthetic 10-second clips
overlay positive and negative word recordings on background noise
convert the mixed audio to a spectrogram-like representation
train a sequence model to emit high probability shortly after the trigger word ends

Building Frame-Level Labels

The hardest part is usually not the network. It is producing labels that match the output timeline.

A simple strategy is to mark the next few frames after a trigger event as 1 and everything else as 0. The short positive window teaches the model to respond after the word is heard, while still leaving most frames negative.

python

1import numpy as np
2
3output_steps = 50
4labels = np.zeros(output_steps, dtype="float32")
5trigger_end_step = 18
6
7# Mark a short activation region after the trigger word ends.
8labels[trigger_end_step + 1 : trigger_end_step + 6] = 1.0
9print(labels.astype(int))

If the positive region is too wide, the model learns a blurry target. If it is too narrow, training becomes fragile because the positive class is already sparse.

A Minimal Sequence Model

The Coursera assignment uses a sequence architecture because the input is ordered over time. The exact layer mix can vary, but a practical design uses a small convolution front end followed by recurrent layers.

python

1import numpy as np
2import tensorflow as tf
3
4# Synthetic spectrogram-like inputs: batch, time, features.
5X = np.random.randn(32, 100, 32).astype("float32")
6y = np.zeros((32, 100, 1), dtype="float32")
7y[:, 40:45, 0] = 1.0
8
9inputs = tf.keras.Input(shape=(100, 32))
10x = tf.keras.layers.Conv1D(16, kernel_size=5, strides=1, padding="same", activation="relu")(inputs)
11x = tf.keras.layers.GRU(32, return_sequences=True)(x)
12x = tf.keras.layers.Dropout(0.2)(x)
13outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
14
15model = tf.keras.Model(inputs, outputs)
16model.compile(optimizer="adam", loss="binary_crossentropy")
17model.fit(X, y, epochs=2, batch_size=8, verbose=0)
18
19print(model.predict(X[:1], verbose=0).shape)

This code does not reproduce the full audio pipeline, but it shows the core supervised learning shape: a three-dimensional input and a time-aligned output probability at each step.

Why Synthetic Mixing Is Useful

Real trigger-word datasets are expensive to label frame by frame. The assignment gets around that by synthesizing training clips from smaller recordings. That gives you exact placement information because the training script knows where each overlay was inserted.

Synthetic generation also lets you control the difficulty:

add more background noise to improve robustness
vary speaker loudness and timing to reduce overfitting
insert negative speech clips so the model does not fire on any voice activity

This is one of the best lessons in the assignment. Good labels and data generation often matter as much as model size.

Training Considerations

Frame-wise detection creates strong class imbalance because most time steps are negative. If the model predicts zero everywhere, it may still achieve deceptively good accuracy. That is why looking only at raw accuracy is a mistake.

Better checks include:

inspecting predicted probability curves over time
listening to clips where the model fires incorrectly
measuring false triggers on negative-only audio
measuring missed detections on clips that contain the trigger word

During development, plot a few label sequences and model outputs side by side. Many errors come from wrong alignment, not from lack of capacity.

Common Pitfalls

The biggest pitfall is label misalignment. If the target frames do not correspond to the model output timeline, the network cannot learn a stable mapping.

Another common problem is training on unrealistic synthetic data. If the positive samples are always clean and centered, the detector performs poorly on real microphone audio.

A third issue is trusting accuracy as the main metric. With sparse positives, accuracy can look high while the detector is useless in practice.

Summary

Trigger-word detection is a sequence labeling problem, not just a clip classification problem.
The critical step is aligning output labels to the audio timeline.
A small Conv1D plus recurrent model is a reasonable architecture for the assignment.
Synthetic data generation is useful because it gives exact placement labels.
Evaluate false triggers and missed activations, not just overall accuracy.