Caffe
multiple input images
deep learning
image processing
neural networks

Caffe Multiple Input Images

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Caffe can handle multiple input images, but you have to model them as multiple blobs or multiple branches in the network definition. The exact layout depends on the task: stereo matching, siamese similarity, image-plus-mask input, or multi-view classification all use the same basic idea but combine the streams differently.

Model Multiple Images as Separate Inputs

In Caffe, every input arrives through a named blob. If the network needs two images for each sample, define two input blobs and then either process them separately or concatenate them before later layers.

A minimal .prototxt example looks like this:

protobuf
1name: "pair_net"
2input: "left"
3input_shape { dim: 1 dim: 3 dim: 224 dim: 224 }
4input: "right"
5input_shape { dim: 1 dim: 3 dim: 224 dim: 224 }
6
7layer {
8  name: "merge"
9  type: "Concat"
10  bottom: "left"
11  bottom: "right"
12  top: "merged"
13  concat_param { axis: 1 }
14}
15
16layer {
17  name: "fc1"
18  type: "InnerProduct"
19  bottom: "merged"
20  top: "fc1"
21  inner_product_param { num_output: 128 }
22}

This is not the only architecture, but it shows the essential point: Caffe does not think in “one image” or “many images.” It thinks in blobs.

Siamese Networks Usually Use Two Branches

If both images should pass through the same feature extractor, a siamese design is often better than early concatenation. Each image goes through its own branch, but the branches share weights so the extracted features live in the same representation space.

In Caffe, weight sharing is typically done by giving matching parameter names to the parallel layers. After that, the branch outputs can be compared with contrastive loss, concatenated, or passed into another classifier.

That architecture is common when the two images play the same role, such as face verification or patch similarity. If the two images are different kinds of inputs, such as RGB image and segmentation prior, separate non-shared branches may be a better fit.

Feed Multiple Images from Python

At inference time, using the Python interface is often the easiest way to fill several input blobs directly.

python
1import caffe
2import numpy as np
3
4net = caffe.Net("pair_net.prototxt", "weights.caffemodel", caffe.TEST)
5
6left = np.random.rand(1, 3, 224, 224).astype(np.float32)
7right = np.random.rand(1, 3, 224, 224).astype(np.float32)
8
9net.blobs["left"].data[...] = left
10net.blobs["right"].data[...] = right
11output = net.forward()
12
13print(net.blobs["fc1"].data.shape)

The important detail is that both blobs need a batch dimension. Even if you are sending one pair of images, the array shape is still (1, channels, height, width).

Keep the Data Pipeline Consistent

If you train with paired inputs, the data loader must preserve pairing all the way through preprocessing. Random crops, flips, and normalization often need to be synchronized across the two images. Otherwise the network sees mismatched pairs that do not correspond to the intended task.

This matters especially for stereo or before-and-after inputs. If one branch gets a random crop and the other does not, the network may learn artifacts from preprocessing instead of the signal you actually care about.

Common Pitfalls

  • Treating multiple input images as one blob without defining how they should be combined.
  • Forgetting the batch dimension when writing data into input blobs.
  • Using a siamese task but not sharing weights across the parallel branches.
  • Applying inconsistent preprocessing to image pairs that should stay aligned.
  • Concatenating inputs too early when task-specific separate branches would preserve more structure.

Summary

  • In Caffe, multiple input images are represented as multiple named blobs.
  • You can merge the inputs early with Concat or process them in separate branches.
  • Siamese setups usually share weights across the two branches.
  • Python inference is straightforward once each blob is filled with the correct shape.
  • The data pipeline must keep paired images aligned, or the model will learn the wrong thing.

Course illustration
Course illustration

All Rights Reserved.