Bucketing of variable length sequences input for `RNN`

`RNN`

variable length sequences

sequence bucketing

deep learning

neural networks

Bucketing of variable length sequences input for `RNN`

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Recurrent models and sequence models often receive examples with very different lengths. Sentences, click streams, and sensor traces rarely arrive in neat fixed-size arrays, so batching them efficiently becomes part of the model design.

Bucketing is a practical compromise. Instead of padding every sequence to the global maximum length, you group examples with similar lengths into the same batch and pad only within that smaller range. That reduces wasted computation and usually improves training throughput.

Why Plain Padding Becomes Expensive

Suppose one batch contains sequences of lengths 8, 10, 12, and 80. If you pad to the longest example, most of the tensor is padding. The model still computes over those padded timesteps unless you use masking or packed sequences, so memory and time are wasted.

Bucketing improves that by putting the 80-step example with other long sequences and the shorter examples together in their own batch. You still pad, but far less aggressively.

A Simple Bucketing Strategy

The core idea is straightforward:

Measure sequence lengths.
Sort or group examples by length.
Build batches from nearby lengths.
Pad only to the maximum length inside each batch.

Here is a small PyTorch example that creates bucketed batches and pads them in a custom collate function:

python

1from torch.nn.utils.rnn import pad_sequence
2import torch
3
4sequences = [
5    torch.tensor([1, 2, 3]),
6    torch.tensor([4, 5]),
7    torch.tensor([6, 7, 8, 9, 10]),
8    torch.tensor([11]),
9    torch.tensor([12, 13, 14, 15]),
10]
11
12def make_buckets(items, batch_size):
13    items = sorted(items, key=len)
14    for i in range(0, len(items), batch_size):
15        yield items[i:i + batch_size]
16
17for bucket in make_buckets(sequences, batch_size=2):
18    lengths = torch.tensor([len(seq) for seq in bucket])
19    padded = pad_sequence(bucket, batch_first=True, padding_value=0)
20    print("lengths:", lengths.tolist())
21    print(padded)

This is the simplest form of bucketing: sort by length, then slice into batches. Many production pipelines add randomization within buckets so training does not become too deterministic.

Feeding Bucketed Data Into an RNN

Padding alone is only part of the solution. You usually want the RNN to ignore padded timesteps. In PyTorch, pack_padded_sequence is the usual tool:

python

1import torch
2from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence
3
4batch = [
5    torch.tensor([1.0, 2.0, 3.0]),
6    torch.tensor([4.0, 5.0]),
7    torch.tensor([6.0]),
8]
9
10lengths = torch.tensor([len(x) for x in batch])
11padded = pad_sequence(batch, batch_first=True)
12
13rnn = torch.nn.GRU(input_size=1, hidden_size=4, batch_first=True)
14inputs = padded.unsqueeze(-1)
15
16packed = pack_padded_sequence(
17    inputs,
18    lengths=lengths,
19    batch_first=True,
20    enforce_sorted=False,
21)
22
23packed_output, hidden = rnn(packed)
24print(hidden.shape)

With this approach, the RNN processes only the real timesteps. Bucketing reduces padding, and packing prevents the model from spending work on the remaining padded positions.

Choosing Bucket Boundaries

There is no perfect universal bucket scheme. A few workable options are:

Sort everything by length and build contiguous batches.
Define ranges such as 1-10, 11-20, 21-40, and so on.
Build quantile-based buckets so each bucket contains roughly the same number of examples.

If your data has extreme outliers, cap the maximum allowed sequence length or isolate the outliers in their own bucket. Otherwise one unusually long example can still distort a whole batch.

Bucketing Versus Truncation

Bucketing is not the same as truncation. Bucketing preserves the full sequence and only changes how samples are grouped for batching. Truncation throws away steps beyond a chosen limit.

That means bucketing is usually a better first optimization when the long tail is still important to model quality. Truncation is a stronger compromise and should be applied intentionally.

Common Pitfalls

The most common mistake is sorting the entire dataset by length and never reshuffling. That can produce highly correlated batches and hurt generalization. A common fix is to shuffle examples first, then sort within a moving window or shuffle inside each bucket.

Another pitfall is forgetting masking or packed sequences. Bucketing reduces padding, but it does not eliminate it. If the model still treats padded zeros as real data, the training signal is polluted.

Very narrow buckets can also backfire. They reduce padding but increase data loader complexity and may create many tiny batches. Very wide buckets are easier to implement but recover less efficiency. The right balance depends on your length distribution.

Finally, remember that some modern architectures such as Transformers often use masking with dynamic padding and may not need classic RNN-style packing. Bucketing still helps, but the exact implementation differs.

Summary

Bucketing groups sequences of similar lengths so each batch needs less padding.
It reduces wasted computation and often improves RNN training throughput.
In PyTorch, combine bucketing with pad_sequence and pack_padded_sequence.
Choose bucket ranges based on the actual length distribution, not guesswork.
Keep some randomness in batching so the model does not see data in a rigid sorted order.