Generating MNIST numbers using LSTM-CGAN in TensorFlow

LSTM

CGAN

MNIST

TensorFlow

Deep Learning

Generating MNIST numbers using LSTM-CGAN in TensorFlow

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Using a conditional GAN to generate MNIST digits is common, but combining it with an LSTM changes the shape of the problem. Instead of treating the image only as a spatial grid, an LSTM-CGAN can treat the digit as a sequence, usually row by row or as a flattened pixel stream, while still conditioning generation on the target digit label.

Why Use an LSTM in a CGAN

A normal CGAN for MNIST often uses convolutional layers because digits are images. An LSTM-based variant is less common, but it can make sense if you want the generator to emit the image as a sequence of steps. For a 28 x 28 MNIST image, one practical interpretation is a sequence of 28 rows, each row containing 28 pixel values.

The conditional part matters because it tells the generator which digit to create. Without the label, the generator only learns "make something digit-like." With the label, it learns "make a three" or "make a seven."

That means the full input to the generator usually includes:

a random noise vector
a class label from 0 to 9
a mechanism to combine the two

Build a Conditional Generator in TensorFlow

The example below uses TensorFlow Keras to build a small generator. The label is embedded, concatenated with noise, projected into a sequence, and then decoded through an LSTM.

python

1import tensorflow as tf
2
3latent_dim = 100
4num_classes = 10
5sequence_length = 28
6row_size = 28
7
8noise_input = tf.keras.Input(shape=(latent_dim,))
9label_input = tf.keras.Input(shape=(1,), dtype="int32")
10
11label_embedding = tf.keras.layers.Embedding(num_classes, 16)(label_input)
12label_embedding = tf.keras.layers.Flatten()(label_embedding)
13
14x = tf.keras.layers.Concatenate()([noise_input, label_embedding])
15x = tf.keras.layers.Dense(sequence_length * 64, activation="relu")(x)
16x = tf.keras.layers.Reshape((sequence_length, 64))(x)
17x = tf.keras.layers.LSTM(128, return_sequences=True)(x)
18x = tf.keras.layers.TimeDistributed(
19    tf.keras.layers.Dense(row_size, activation="sigmoid")
20)(x)
21
22generator = tf.keras.Model([noise_input, label_input], x)
23generator.summary()

This model outputs a tensor shaped like 28 x 28, which can be interpreted as a grayscale MNIST image with values in the range 0 to 1.

Build a Conditional Discriminator

The discriminator also needs the label. It receives an image sequence and the digit class, then predicts whether the sample is real or fake.

python

1image_input = tf.keras.Input(shape=(sequence_length, row_size))
2label_input = tf.keras.Input(shape=(1,), dtype="int32")
3
4label_embedding = tf.keras.layers.Embedding(num_classes, sequence_length)(label_input)
5label_embedding = tf.keras.layers.Flatten()(label_embedding)
6label_embedding = tf.keras.layers.RepeatVector(row_size)(label_embedding)
7label_embedding = tf.keras.layers.Permute((2, 1))(label_embedding)
8
9x = tf.keras.layers.Concatenate(axis=-1)([image_input, label_embedding])
10x = tf.keras.layers.LSTM(128)(x)
11x = tf.keras.layers.Dense(64, activation="relu")(x)
12out = tf.keras.layers.Dense(1, activation="sigmoid")(x)
13
14discriminator = tf.keras.Model([image_input, label_input], out)
15discriminator.summary()

The exact label-conditioning scheme can vary. The important part is that both networks see the digit label so the adversarial game is conditioned on the requested class.

Train the Model on MNIST

MNIST is easy to load from TensorFlow. Normalize images to 0 through 1, keep the labels, and reshape each image into a sequence of 28 rows.

python

1(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
2x_train = x_train.astype("float32") / 255.0
3
4print(x_train.shape)  # (60000, 28, 28)
5print(y_train.shape)  # (60000,)

Training then follows the usual GAN pattern:

sample real images and labels
sample noise and matching labels
generate fake images
train the discriminator on real and fake pairs
train the generator through the frozen discriminator

The high-level loop looks like this:

python

1batch_size = 64
2noise = tf.random.normal((batch_size, latent_dim))
3labels = tf.random.uniform((batch_size, 1), minval=0, maxval=10, dtype=tf.int32)
4fake_images = generator([noise, labels])

The rest of the training loop depends on whether you use a manual GradientTape setup or a combined Keras training graph. For GANs, a manual training step is usually clearer.

What Makes This Different From a CNN GAN

An LSTM-CGAN is not the standard best-performing architecture for MNIST image generation. Convolutional GANs usually fit the task more naturally because digits are spatial patterns. The LSTM version is mainly useful when you want sequence modeling behavior or when the real target problem is sequential and MNIST is just the prototype dataset.

That distinction matters because a weaker result on MNIST does not necessarily mean the design is wrong. It may simply mean the architecture is optimized for a different structure than pure image generation.

Common Pitfalls

Treating MNIST as a sequence without reshaping consistently leads to mismatched tensor dimensions between the generator, discriminator, and dataset.
Conditioning only the generator and not the discriminator weakens the CGAN setup because the discriminator cannot judge whether the image matches the requested label.
Expecting an LSTM-CGAN to outperform a convolutional GAN on MNIST is usually unrealistic. The architecture choice trades image-specific inductive bias for sequence modeling behavior.
Forgetting to normalize MNIST pixels makes GAN training less stable because the generator and discriminator see poorly scaled inputs.
Debugging the full adversarial loop before checking standalone generator and discriminator shapes wastes time. Verify each model separately first.

Summary

An LSTM-CGAN can generate MNIST digits by treating each image as a sequence, commonly row by row.
The generator and discriminator both need the digit label for proper conditional training.
TensorFlow Keras can build this setup with label embeddings, LSTM layers, and manual GAN training steps.
MNIST must be reshaped and normalized consistently before training.
For raw image quality, CNN-based GANs are often a better fit, but LSTM-CGANs are useful when sequence generation is the real goal.