TensorFlow
image retraining
multilayer CNN
softmax layer
machine learning

Image retraining in tensorflow, changing the simple softmax layer to multilayer CNN

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

In transfer learning, "replace the softmax layer" usually means "keep the pretrained feature extractor and train a new classifier head." If you want something richer than a single softmax layer, the answer is not to bolt convolutions onto final class probabilities. The answer is to cut the network at a feature-map stage, then attach a new trainable head that still operates on spatial features. Once you see that distinction, the architecture change becomes much cleaner.

Why a Simple Softmax Head May Not Be Enough

A common retraining pipeline uses a pretrained CNN backbone, freezes it, and adds a small classification head such as global pooling plus a dense softmax layer. That works well when the pretrained features already separate the target classes cleanly.

A minimal version looks like this:

python
1import tensorflow as tf
2
3base = tf.keras.applications.MobileNetV2(
4    input_shape=(160, 160, 3),
5    include_top=False,
6    weights="imagenet"
7)
8base.trainable = False
9
10model = tf.keras.Sequential([
11    base,
12    tf.keras.layers.GlobalAveragePooling2D(),
13    tf.keras.layers.Dense(3, activation="softmax")
14])

This is efficient, but it gives the new task only a shallow classifier head. If the new dataset needs more task-specific feature refinement, a deeper head can help.

Add New Convolutional Layers at the Feature-Map Level

If you want a multilayer CNN head, attach it before the representation has been collapsed into logits or pooled scalars. That means taking the feature maps from the pretrained base and feeding them into new convolutional blocks.

python
1import tensorflow as tf
2
3inputs = tf.keras.Input(shape=(160, 160, 3))
4
5base = tf.keras.applications.MobileNetV2(
6    input_tensor=inputs,
7    include_top=False,
8    weights="imagenet"
9)
10base.trainable = False
11
12x = base.output
13x = tf.keras.layers.Conv2D(128, 3, padding="same", activation="relu")(x)
14x = tf.keras.layers.BatchNormalization()(x)
15x = tf.keras.layers.Conv2D(64, 3, padding="same", activation="relu")(x)
16x = tf.keras.layers.BatchNormalization()(x)
17x = tf.keras.layers.GlobalAveragePooling2D()(x)
18x = tf.keras.layers.Dropout(0.3)(x)
19outputs = tf.keras.layers.Dense(3, activation="softmax")(x)
20
21model = tf.keras.Model(inputs, outputs)

This design still ends with softmax, but the classifier now sits on top of additional learned convolutional layers rather than on raw pretrained features alone.

Do Not Add Convolutions After Class Probabilities

The architectural mistake to avoid is trying to replace the final softmax output itself with more convolution layers. Once the network has already reduced the image to class logits or probabilities, the spatial structure that convolutions need is gone.

Convolutions expect feature maps shaped like height by width by channels. Softmax logits are class scores, not images or spatial feature tensors. So the right question is not "how do I change softmax into a CNN," but rather "at what feature-map stage should I branch off and attach a deeper head."

Train in Two Phases

When you add a deeper custom head, a two-stage training schedule is often safer:

  1. Freeze the backbone and train only the new head.
  2. Unfreeze part of the backbone and fine-tune with a low learning rate.

Example:

python
1model.compile(
2    optimizer=tf.keras.optimizers.Adam(1e-3),
3    loss="sparse_categorical_crossentropy",
4    metrics=["accuracy"]
5)
6
7model.fit(train_ds, validation_data=val_ds, epochs=5)
8
9base.trainable = True
10for layer in base.layers[:-20]:
11    layer.trainable = False
12
13model.compile(
14    optimizer=tf.keras.optimizers.Adam(1e-5),
15    loss="sparse_categorical_crossentropy",
16    metrics=["accuracy"]
17)
18
19model.fit(train_ds, validation_data=val_ds, epochs=5)

This reduces the risk that the new head destabilizes the pretrained feature extractor immediately.

Match Head Complexity to Data Size

A deeper CNN head introduces more trainable parameters. That can help on hard tasks, but it can also overfit quickly if the new dataset is small. Before adding multiple convolutional layers, ask:

  1. Is the dataset large enough to justify more capacity.
  2. Are the target classes visually different from the original pretrained domain.
  3. Would better augmentation solve the problem more simply.

Sometimes the softmax head is not the real bottleneck. The problem may be data quality, label noise, or preprocessing mismatch.

Keep the Input Pipeline Consistent

Transfer learning results are highly sensitive to preprocessing. If you use a pretrained application model, apply its expected input preprocessing function.

python
1preprocess = tf.keras.applications.mobilenet_v2.preprocess_input
2
3train_ds = train_ds.map(lambda x, y: (preprocess(x), y))
4val_ds = val_ds.map(lambda x, y: (preprocess(x), y))

Changing the head while forgetting input normalization often leads to disappointing results that look like an architecture problem.

Common Pitfalls

  • Trying to add convolutional layers after class logits or softmax probabilities, where spatial structure no longer exists.
  • Assuming a deeper head is automatically better even when the dataset is too small to support it.
  • Fine-tuning the whole pretrained backbone immediately with a high learning rate.
  • Ignoring the preprocessing requirements of the pretrained base model.
  • Blaming the simple softmax head when the real issue is weak data or poor augmentation.

Summary

  • Replacing a simple softmax head with a deeper CNN means branching from feature maps, not from class probabilities.
  • Keep the pretrained backbone and add new convolutional layers before pooling and final classification.
  • Train the new head first, then fine-tune part of the backbone carefully if needed.
  • More head capacity can help, but it also increases overfitting risk.
  • Check data quality and preprocessing before assuming the architecture alone is the problem.

Course illustration
Course illustration

All Rights Reserved.