Building a reverse image search engine with convolutional neural networks

Grid Dynamics
Apr 17, 2018

Mr and Mrs. Smith are a regular middle-class couple in the process of remodeling and re-decorating their home. After a long and tireless search on Pinterest and other lifestyle websites they finally found a concept for their new interior decoration plans. The issue was that actually finding and buying these products online was extremely frustrating and challenging due to the limitations of e-commerce product discovery.

Using the text search engine the Smiths tried in vain to describe the lifestyle products they fell in love with but could never find the right items in the search results. Filtering the catalog by product category and manually searching was extremely time consuming and also not successful. After several similar experiences on a few other e-commerce sites the Smiths gave up and had to consider either browsing a bunch of brick and mortar stores or not pursuing their concept at all.

Retailers are becoming more aware of how these negative customer experiences hurt their sales revenue and turn people away from their sites. In particular, retailers are zeroing in on the limitations of text based search. Even if you have a very clear idea of what you’re looking for, it can be difficult to describe some types of products accurately. This is especially true for decor and lifestyle products, where certain features or elements of a product that are unique can be hard to meaningfully describe in a search bar.

One of our customers that offers decor and lifestyle products was facing this exact problem and was reporting that many of their stocked items were not being found via text based search and that they were getting very low conversion rates in their online store. Wanting to boost their sales, our customer was interested in enabling their customers to search for products using images rather than just text.

The idea involves people who are looking to interior decorate being able to find design concepts online from sites such as Pinterest, then using the photos they found to automatically match items from a catalog. Our customer also wanted to have the ability to recommend products to them that shared similar features such as color, shape, or pattern to their sample images. This would provide a good method of upselling and additionally allow the search engine to be more strategic by actually being able to find products that customers were looking for.

Powering the solution with Convolutional Neural Networks

Our solution was to build a reverse search engine that would be powered by a convolutional neural network. In recent years, deep learning has made incredible improvement in the ability to recognize different types of objects by using convolutional neural networks. This extends to being able to recognize objects independently of their position, viewpoint, or background.

A standard approach would involve taking a pretrained model that could find similar objects using metrics like Euclidean Distance. However, that method would not be effective for this type of project because the same object rotated to a different angle or against a different background would produce different results. We needed to develop a solution that would be able to recognize similar objects as well as be able to provide visual recommendations.

Figure 1 below demonstrates how our Reverse Search Engine works. The left panel shows the photo input from which the objects need to be extracted for identification. This is achieved by using a feature vector from ImSeNet and then a search can be conducted for similar products from an appropriate category.

high level overview of reverse image search solution

The visual search recommendation system needs to be able to retrieve a similar image from a catalog (usually an object against a white background) based on images of objects in varying environments (such as lifestyle photos) uploaded by the user. This task is highly complex because the query image object can be of a different size, orientation or shot against different types of backgrounds or in varied lighting conditions, as can be seen with the three different images of chairs in Figure 2 below.

examples of query images

The entire solution is comprised of the following related parts:

  1. Learning similarity metric
  2. Network architecture
  3. Training dataset
  4. Evaluation metrics
  5. Results

Establishing the learning similarity metric

There are numerous approaches that can be used to perform a visual similarity search. Some build the feature representation based on different types of descriptors of the images and use Compressed Histogram of Gradients (CHoG) or Vector of Locally Aggregated Descriptors (VLAD) methods or complex combinations of patterns and colors. Other types of solutions rely on the ability to automatically learn to produce the features directly end-to-end from the input image.

The learning similarity metric is needed to learn to produce embedding such that the distance metric between the query image and its original object image is smaller than between the same query image and any other original object image in a feature space.

For neural network training, using a triplet of three images <q, p, n> - query, positive and negative images, this is achieved by:
L = Σ Lp (q, p) + ΣLn(q,n)

Triplet loss function consists of two penalties -Lp penalizes a positive pair if the distance metric is too big andLn penalizes a negative pair if the distance metric is less than the margin.

There are a few variants of the final triplet loss function, which are described in detail in several recent papers:

In our work, we use the variant:
L = 1/n Σ max(0, m + Dp - Dn)
Where Dp - Euclidean Distance between q and p, Dn - Euclidean Distance between q and n.

This formula works where Dp is too small and Dn is higher than m,in which case Dp - Dn <0 is negative and the final loss will be 0. This results in a beneficial learning process for the network. In other cases, if the loss is not 0 we need to minimize it. This involves minimizing Dp and maximizing Dn, which makes our network learn the similarity between images. m- margin parameter for fine tuning training can take different values but takes into consideration vector normalization, which affects that hyperparameter.

Making the margin can produce problems of instability or divergence. If the margin is too small, it can make learning too slow or collapse for all products. We chose m = 1 because it makes the feature space more stable. One option is to use this loss function, also called a contrastive or hinge loss, as it has a different role as detailed here:

Minimizing loss in feature space

Neural network architecture

The most efficient way to compute and minimize contrastive loss is to use a Siamese Neural Network. Siamese Neural Network architecture contains two or more identical networks with shared parameters and weights. Subnetworks are used to process multiple inputs, then their output is combined using a different module. This network is used for direct training of the problem we are trying to solve but it cannot be used to resolve all problems as we can only train it to determine the similarity of the three images.

Under the hood, our Siamese Neural Network uses a Convolution Neural Network (CNN). If you are not familiar with CNNs, you can refer to this link. CNNs are used to process images and consist of three layers: convolution, pooling, and rectifications. CNNs in recent years have become very good at tasks such as object classification and detection.

In our work, we experimented with a lot of different architecture types. These architecture types can be separated in two ways. One is the original existing CNN architecture type (let’s call it a pre-trained CNN), which uses pre-trained weights on Imagenet. The second is an assembly of small pre-trained CNNs.

There are currently many different CNN architectures. We had some restrictions on which ones were viable as we wanted to use a pre-trained model. We ultimately decided on the use of Keras as it has some pre-trained CNN architectures that are compatible with Tensorflow.

Model size Top-1 accuracy Top-5 accuracy parameters depth
Xeception 88 MB 0.79 0.945 22910480 126
VGG16 528 MB 0.715 0.901 138357544 23
VGG19 549 MB 0.727 0.91 143667240 26
ResNet50 99 MB 0.759 0.929 25636712 168
InceptionV3 92 MB 0.788 0.944 23851784 159
InceptionResNetV2 215 MB 0.804 0.953 55873736 572

We chose Xception, InceptionV3 and InceptionResNetV2 for comparison purposes and used them without a top layer, i.e. the last layer would be a dense layer. We then introduced a shallow layer to be able to extract low-level details from the image.

A CNN can be thought of as a features identifier, with each new layer being trained to determine more high-level features. Using an image of a ship as an example, the first layer is only able to detect curves or some lines, while the next layer is able to detect a combination of curves. As you go deeper, CNN is able to recognize the mast and the ship’s sails and finally the whole ship.

It was our assumption that this parallel combination of low and high-level features would produce superior results. We also experimented with a large number of different parameters such as regularization, activation functions, batch normalization etc. Later we will describe how it is possible to evaluate the accuracy of the network.

Network high-level architecture overview. Siamese Network + ImSeNet ensemble version.

The final important question is: do we need to freeze the pre-trained CNN and retrain only the last layer, the last N layers, or retrain the whole network? To answer this, we needed to choose the correct dimension space. We tried multiple dimensions - {256, 512, 1536, 2048}. We also needed to avoid the classical problem, The Curse of Dimensionality, in order to not lose any unique features of the product that would be necessary to distinguish it.

Creating a training dataset

Training the network requires query, positive and negative images. A query image is an image of a product within a random scene that we refer to as a lifestyle image. A product image is an image of that product against a white background with no other objects in the scene. The negative image is a product image of any other product within the same product category.

There are several different types of negative images. This includes in-class negative images, which teach the network to pay attention to low-level nuanced differences, like material or texture. Another type is out-of-class negative images, which train the network to pay attention to shapes and colors.

When working with an existing retail customer, you already have access to the structured data that can help you collect the training dataset. Otherwise, it can be difficult to collect positive pairs and hard negative examples.

An example triplet used for training

In some instances we have access to several images of a single product and if we are lucky, at least one of the images will be a lifestyle image with a non-white background. In that case, it is possible to use it as a positive pair and collect the triplet in that way: <q - product #123 lifestyle image, p - product #123 white background image, n - product # != 123 white background image >.

Another problem involves how unsuitable images can be filtered out, such as where only a small part of the product is visible in the image, i.e. a single chair leg. To address this, we used object detection. This is because when you are processing a product’s images, you know its category or label and if object detection finds an object with an appropriate label it means the image contains the desired product. Object detection helps the search to only recognize the target object from an image that may contain several different objects as shown in Figure 1.

Images without a white background are marked as lifestyle images and used as a query in a triplet. For that, we build a histogram of the pixel’s color for each image. For each positive pair, we use 10 to 15 in-class negative images and 1 to 3 out-of-class negative images. Some negative images are not truly hard negative examples as we can choose random products from the same category.

The process can then be further improved after the network has been initially trained by re-using images to build hard negative examples. For each positive pair, 100 to 1000 similar products can be selected using the trained network. Random augmentation of the query can then be used to extend the training dataset by performing actions such as image flipping, rotation, blurring, cropping, adding noise, or randomizing pixels.

Evaluation metrics

We also developed a series of metrics to help evaluate the results and accuracy of the network training process. The first we named absolute_accuracy. This metric shows how many Dp < Dn that we are trying to minimize. However, this metric can only calculate the accuracy of each triplet whereas we want to be able to find exactly the right product among thousands of others.

The next metrics developed were distance_accuracy and distance_accuracy_top_3 (Figure 7). Here, we take N products of the same category and build a distance matrix q × p. A diagonal element is a q and p image of the product. The model processes each image and vectorizes it. From that vectorization, two values are created, P and Q, so images that have similar P and Q values will also look similar. Metrics that indicate good results show that images that have been determined to be similar will have numerically small differences between the P and Q values.

This type of metric gives a more realistic result for the actual accuracy of our model because the real challenge is to find an exact or very similar product to the query image among many other products from the same category. Each row of the distance matrix consists of the distances between the query image and other validation images from the same category where the diagonal element is the exact product for that query image. We then calculate it for n objects to determine the level of accuracy.

Distance matrix

Finally, we selected two original candidates: InceptionResNetV2 and ImSeNet (Figure 5). We couldn’t rely only on evaluation metrics, which is why we also used images not from the customer catalog and checked the results manually to further assess the quality of the models.


Our final decision was to use just the triplet network with pure InceptionResNetV2. We discovered that the neural network effectively learned to overcome the challenges posed by different backgrounds and lighting levels. And even for rotated objects or partially obscured objects, the network performed well. We achieved a great result using state-of-the-art algorithms.

One of the unresolved problems was the ability to say definitively that if the distance between the P and Q values was larger than a set value, then the two products were not considered to be similar. This is a difficult problem to solve because sometimes we need to use an image containing objects that we don’t have any similarities for but still want to try and find the closest match.

query image and search results

Future improvements

The following investigations could be undertaken to further improve the accuracy of the search results:

  1. Use 3D object models to produce a much bigger training dataset.
  2. The neural network could be trained to ignore different backgrounds for the same product. The way to achieve this is to use object segmentation, where a mask is applied and the object extracted from the background. The risk however with using object segmentation is that there is a high risk when using it of corrupting the product image itself, such as cutting off some parts of the object.


Just like our customer, many other retailers are also finding that their text based search engines are not sufficient for accurate product searches and are looking to image search and more capable recommendation engines for improved product discoverability. Additionally, the advance of mobile technologies has led to new use-cases like customers being able to take pictures of products that they want to buy with their smartphones and being able to search with the image. This is a development that retailers are keen to capitalize on.

Companies face the challenge of finding a suitable image search model, training it, then applying it to other mobile applications or integrating it with third-party services. These services include the ability to search for products within a category, search across different categories, search for further uses of a product, and search for similar products from other retailers.

Our solution of using catalog data, a special training pipeline, and training Siamese CNN to produce embedding of product images for different categories could also be applied effectively for other companies. This is especially true for use cases where it would be beneficial to have a model that can be used for categories or products that were not a part of the training of the model.

In this project, we experimented with a large number of different architectures and parameters. We don’t claim to have found the absolutely most optimized system because this task would require a huge amount of computational resources. The system could be further improved by using a larger training set, which would allow for continued advances in optimization. Overall, as image search becomes more and more common, companies will need to adopt reverse search engines similar to the one we developed or risk being left behind.

Leave us a comment, we would love to know what you think

Love Tech? Keep in touch with new posts by Grid Dynamics Subscribe

Subscribe for updates

Subscribe today and get bleeding edge technology updates on machine learning, big data, AI, e-commerce, and QA hot off the press.