CNN Image Recognition with Regression Output on Tensorflow

CNN

Image Recognition

Regression

TensorFlow

Deep Learning

CNN Image Recognition with Regression Output on Tensorflow

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Convolutional neural networks are usually introduced as classifiers, but the same architecture also works when the target is a continuous value. If an image should produce a number such as age, steering angle, or price estimate, you keep the visual feature extractor and replace the classification head with a regression head.

When an Image Task Becomes Regression

A CNN learns spatial patterns through convolution, pooling, and dense layers. None of that depends on the output being a class label. The main change is at the end of the network.

In a classification model, the last layer often uses softmax to produce class probabilities. In a regression model, the last layer usually has a single unit with no activation, or another activation chosen to match a known numeric range. Training also changes from cross-entropy to a regression loss such as mse, mae, or Huber loss.

This pattern is useful when local image structure explains a numeric target. Common examples include:

predicting the age of a person from a face image
estimating the steering angle for a self-driving system
measuring an object's width or position in pixels
forecasting a product quality score from a manufacturing image

The early layers still learn edges, textures, and shapes. Later layers compress those features into a representation that the final dense layer maps to a real number.

Building a TensorFlow Model

The example below is small enough to run anywhere TensorFlow is installed. It creates synthetic RGB images and trains a CNN to predict each image's average brightness. The task is simple, but the model structure is the same one you would use for a real image regression problem.

python

1import numpy as np
2import tensorflow as tf
3
4rng = np.random.default_rng(42)
5x = rng.random((128, 64, 64, 3), dtype=np.float32)
6y = x.mean(axis=(1, 2, 3)).astype(np.float32)
7
8train_ds = tf.data.Dataset.from_tensor_slices((x, y)).shuffle(128).batch(16)
9
10model = tf.keras.Sequential(
11    [
12        tf.keras.layers.Input(shape=(64, 64, 3)),
13        tf.keras.layers.Rescaling(1.0),
14        tf.keras.layers.Conv2D(16, 3, activation="relu"),
15        tf.keras.layers.MaxPooling2D(),
16        tf.keras.layers.Conv2D(32, 3, activation="relu"),
17        tf.keras.layers.MaxPooling2D(),
18        tf.keras.layers.Flatten(),
19        tf.keras.layers.Dense(64, activation="relu"),
20        tf.keras.layers.Dense(1),
21    ]
22)
23
24model.compile(
25    optimizer="adam",
26    loss="mse",
27    metrics=[tf.keras.metrics.MeanAbsoluteError()],
28)
29
30model.fit(train_ds, epochs=3, verbose=2)
31predictions = model.predict(x[:3], verbose=0).flatten()
32
33print("Targets:", y[:3])
34print("Predictions:", predictions)

Three details matter here. First, the final layer is linear because the target is continuous. Second, the labels are numeric scalars rather than one-hot vectors. Third, the metric is MeanAbsoluteError, which is easier to interpret than plain loss during training.

For real data, the input pipeline usually comes from tf.data or tf.keras.utils.image_dataset_from_directory. The labels come from a CSV file, database, or filename mapping rather than the synthetic mean calculation used here.

Preparing Labels and Evaluating Results

Input preprocessing needs to stay consistent between training and inference. If the training pipeline rescales pixels, resizes images, or converts color channels, prediction code must do the same. Regression models are especially sensitive to mismatched preprocessing because even small shifts in numeric scale can move the output noticeably.

The target can also benefit from scaling. A model predicting values around 0.5 trains differently from one predicting values around 500000. Standardizing labels can make optimization more stable, as long as you reverse the transform after prediction.

python

1import numpy as np
2
3prices = np.array([120000.0, 180000.0, 250000.0], dtype=np.float32)
4mean = prices.mean()
5std = prices.std()
6
7scaled_prices = (prices - mean) / std
8predicted_scaled = np.array([0.15, -0.30], dtype=np.float32)
9predicted_prices = predicted_scaled * std + mean
10
11print(predicted_prices)

Evaluation should also match the business problem. MAE is good when average error is easy to explain. RMSE is better when large misses are especially costly. Huber loss is often a solid middle ground because it behaves like squared error for small mistakes and like absolute error for larger ones.

Common Pitfalls

Using softmax or sigmoid in the last layer for an unrestricted numeric target. Unless the output has a known range, a linear output layer is usually the right default.
Forgetting to normalize inference images exactly as training images were normalized. Different pixel scales can make predictions drift badly.
Ignoring label scale. Very large target values can make training slower or less stable if you never standardize the output.
Reporting classification accuracy for a regression task. A prediction of 41.8 for a true value of 42.0 is useful even if it is not an exact match.
Building a very large dense head on top of a small dataset. That often overfits faster than the convolutional stack itself.

Summary

A CNN can perform regression by keeping the image feature extractor and replacing the classifier with a linear output layer.
Use losses such as mse, mae, or Huber depending on how you want to penalize error.
Keep preprocessing consistent between training and prediction.
Scale labels when the target range is large enough to make optimization unstable.
Evaluate with regression metrics such as MAE or RMSE, not classification accuracy.