Digit Recognition on CNN

Digit Recognition

Convolutional Neural Networks

Machine Learning

Image Processing

Deep Learning

Digit Recognition on CNN

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Digit recognition with a CNN is a classic image-classification problem because the task is simple enough to learn from and rich enough to demonstrate how convolutional layers work. The usual benchmark is MNIST, where the model learns to map small grayscale digit images to the labels 0 through 9.

Why CNNs Work Well for Digits

Handwritten digits are spatial patterns. A 3 and an 8 differ by local shapes, loops, edges, and strokes. CNNs are good at learning exactly those local visual features because convolution filters slide across the image and detect repeated patterns regardless of exact position.

For digit recognition, that means the network can learn:

edge detectors in early layers
stroke combinations in deeper layers
class-specific digit structure near the output

This is a much better fit than flattening the whole image immediately and asking a dense network to learn spatial structure from scratch.

A Small Keras Example

python

1import tensorflow as tf
2
3(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
4
5x_train = x_train.astype("float32") / 255.0
6x_test = x_test.astype("float32") / 255.0
7
8x_train = x_train[..., tf.newaxis]
9x_test = x_test[..., tf.newaxis]
10
11model = tf.keras.Sequential(
12    [
13        tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
14        tf.keras.layers.MaxPooling2D(),
15        tf.keras.layers.Conv2D(64, 3, activation="relu"),
16        tf.keras.layers.MaxPooling2D(),
17        tf.keras.layers.Flatten(),
18        tf.keras.layers.Dense(128, activation="relu"),
19        tf.keras.layers.Dropout(0.3),
20        tf.keras.layers.Dense(10, activation="softmax"),
21    ]
22)
23
24model.compile(
25    optimizer="adam",
26    loss="sparse_categorical_crossentropy",
27    metrics=["accuracy"],
28)
29
30model.fit(x_train, y_train, epochs=3, batch_size=64, validation_split=0.1)
31print(model.evaluate(x_test, y_test))

This is already enough to achieve strong performance on MNIST with a compact model.

What the Layers Are Doing

The first convolution layers detect simple image features such as edges and short strokes. Pooling reduces spatial size while keeping the strongest signals. After that, dense layers combine the learned features into a final classification.

The output layer has 10 units because there are 10 digit classes. softmax converts the raw scores into class probabilities.

For labels encoded as integer class IDs, sparse_categorical_crossentropy is the natural loss choice. If labels were one-hot encoded instead, categorical_crossentropy would be more appropriate.

Preprocessing Matters

Even on MNIST, preprocessing is part of the pipeline:

scale pixels into a small range such as 0 to 1
add a channel dimension so the CNN sees height, width, channels
keep train and test preprocessing consistent

For more difficult handwritten data, you may also center digits, crop empty margins, or apply augmentation such as slight rotations and shifts. Those small input improvements can matter as much as an extra convolution layer.

Evaluate Beyond One Accuracy Number

Accuracy is the main headline metric, but a confusion matrix is useful for seeing which digits are confused most often. Models commonly confuse classes such as 4 and 9 or 3 and 5 when handwriting is messy.

That tells you whether the model is weak overall or only struggles on a few visually similar digits. In practical OCR tasks, those error patterns matter because they affect downstream correction strategies.

Common Pitfalls

Forgetting to normalize pixel values before training.
Feeding images with shape (28, 28) when the CNN expects (28, 28, 1).
Using the wrong loss for the label format.
Overcomplicating the architecture for a simple dataset like MNIST.
Judging the model only by one accuracy value without checking the types of mistakes it makes.

Summary

CNNs are a natural fit for digit recognition because they learn spatial features from images.
A small Conv2D plus pooling stack is enough to solve MNIST effectively.
Normalize the images and keep the input shape compatible with convolution layers.
Use softmax with a 10-class output for digit classification.
Inspect confusion patterns, not just accuracy, when evaluating model quality.