Expanding product discovery with ML-powered image similarity search

Grid Dynamics
May 30, 2018

With an online catalog of over 1M SKUs and a multi-billion dollar e-commerce store, things were going well for the online retailer. But then again, they could always be better. Their search engine’s relevancy was not ideal, hurting their shoppers’ ability to discover the right products and lowering conversion rates.

The core problem was imperfect product attribution of catalog data. The attributes were coming from the e-commerce system in a form of text fields that contained a sizable number of errors or omissions. This time, the product management team decided to try something new - use product images as a data source to extract the attributes from the visual product features, such as color, style, height, shape and more.

At the same exact time, a different idea took shape - perhaps it was also possible to use visual similarity between product images to offer customers an entirely new and exciting way to discover products similar to those they “like”. Upon further examination, it turned out that the underlying technology is nearly identical in both cases. The product management team brought in Grid Dynamics as they wanted our ML experience to help them build an entirely new type of search engine, based on visual product similarity.

In the blog we’ll share how such an engine can be designed and trained to address visual similarity search and automated image-based catalog attribution using computer vision and machine learning.

Similarity use cases

Not all shoppers know exactly what they want when they shop. Many of them have a rough idea of what they are looking for but are also interested in comparing items. In many domains, such as apparel or decoration, product look is a major factor in making a purchasing decision. Therefore, an accurate way of determining visual similarity is of great value to online retailers.

For product discovery, the main feature involves the ability to search for a similar-looking item in a catalog given a sample or an “anchor” item:

example of anchor product

Here you can see how the model can pick up the main characteristics of the sample dress and find similar looking short striped dresses from the catalog. This feature can also be termed “visual recommendation” or “show me more like this”.

Once you have determined the visual similarity of your products, you can also unlock several other powerful features, such as “sort by similar”. This allows the customer to sort the category or the search result not only by common sorting criteria such as price or rating, but also by similarity to a particular “anchor” product, which can be explicitly selected by the customer with a “like” button on the UI:

example of sort by similar

As can be seen above, the category can then be resorted by visual similarity to the anchor product, with respect to applied keywords and filters. In a more sophisticated case, the “anchor” product can be implicitly derived from the customer behaviour, such as shopping cart content, session view history, add-to-bag or order history.

Another interesting feature is that similarity can be sorted against multiple anchor products at once, essentially finding matches that have features of both items:

example of mutiple anchors

You can play with a sandbox demo here that features a catalog of dresses, toys, laptops and TVs.

Now that we’re clear on the main use cases, let's take a look under the hood and see how it all works.

Under the hood

As much as we like to think that our computers can “think” and “understand”, in reality that’s not how they work. They are just exceptionally good at remembering things and crunching numbers very, very quickly. For a computer, a picture of a dress is just a set of neatly arranged bytes. Needless to say, searching for other images with similar bytes is very unlikely to yield similar looking images.

Instead, we need to somehow convert those images into a different numerical representation that captures the essence of the image’s content and style. This conversion, known as vectorization, should work in such a way that similar looking images will be mapped to similar vectors. Once images are converted into vectors, we have a wealth of algorithms to search for other similar vectors, thus finding similar images.

image-similarity searcher diagram

This approach to similarity search is known as a vector space model and is traditionally applied to text-based similarity. However, nothing prevents us from applying the same principles to other objects, as long as we can find a proper vectorization model.

The vector space model approach is simple and powerful but can be slow if applied one at a time to many single images. Applied directly to millions of products in a catalog it can take a long time to search through them all to find similarities to an anchor image. The main reason for this is the Curse of Dimensionality. All nearest neighbor search algorithms eventually need to calculate distances between the vectors, which becomes slower as vector dimensionality grows. As any meaningful representation of an image requires hundreds or even thousands of dimensions, the vector space model approach needs to be combined with other external processes to perform well in most practical use cases.

One common technique to speedup nearest neighbor search is dimensionality reduction. Algorithms like Principal Component Analysis can reduce dimensionality of data while minimizing loss of information from dropped dimensions. This makes it possible to reduce vector sizes from thousands to hundreds without a noticeable loss of search quality.

Another approach to speed up the process is to use a traditional search engine to perform results pre-filtering. Using an inverted index, a search engine can very quickly find good candidate matches for similar products, e.g. dresses if the anchor image is a dress, which prevents the visual similarity search needing to be performed on each of the millions of products in the catalog. Instead, it can be run only on the products highlighted by the search engine results. This approach of combining high quality but slower similarity ranking with low quality faster similarity search is called search results re-ranking.

At a high level, the solution architecture is as follows:

visual search reference architecture

  • Catalog Vectorizer - this component is responsible for processing all images in the catalog and converting them to vectors using an Image Vectorizer, with the resulting vectors stored in the Image Vector Store.
  • Dimensionality Reduction - this applies a PCA or other similar algorithm to squash high dimensional vectors into a lower number of dimensions more convenient for similarity search.
  • Catalog Image Indexer - this creates operational indexes (data structures to support fast search of nearest neighbor vectors) for the Visual Search service to use.
  • Navigation service - this uses a Keyword Search Service to retrieve keyword search results or category navigation results. Those results are then re-ranked using a Visual Search Service to sort products according to the visual similarity to the anchor products.

Vectorization model

By far the most important part of the visual similarity feature is a vectorization model. Since being introduced to wide community in 2012, Convolutional Neural Networks (CNN) have taken the visual recognition industry by storm, continually pushing the envelope on tasks of image classification and object detection. There are a lot of different CNN architectures available, created by a host of different research groups. Description of CNN internals is beyond the scope of this post but there are a lot of excellent explanations available. This 3D demo gives a good introduction to the internals of CNN.


The arXiv:1605.07678 shows a nice comparison of popular CNN architectures.

For our visual similarity project we wanted to use Convolutional Neural Networks to extract latent features of images, e.g. features that are very important for capturing image meaning, can be used to compare images, yet are impossible to map to well-known “human” categories like length, color or shape.

CNNs internally do exactly that - in the training process they “learn” which features are important for the task at hand, encoding this information into trained network weights. All we need to do is to forward propagate an image through the trained network and take neuron activations on some hidden layer as a vector encoding the image. The exact choice of the hidden layer or hidden layers to extract features from depends on the application.

The deeper we go in the hierarchy of layers from the input image, the more information about the content of the image we obtain, e.g. our vectors encode shapes and their arrangement. Shallow layers can be used to obtain encoding of style of the image, e.g. texture, palette and other spatially distributed features.

The simplest way to get started with extracting image features is Transfer Learning, e.g. to use pre-trained network weights that are publically available. Those weights are typically obtained by training CNN towards image classification goals on a large public training dataset such as imagenet. If your images are very different from images available on imagenet, or you are a lucky owner of a large and well-attributed image catalog, you will be able to retrain the model from scratch to obtain network weights finely tuned for your kind of images. Still, it is always a good idea to start with pre-trained weights to have a solid performance baseline to compare to.

In order to make an informed choice of vector embedding approach, we used a simple search quality assessment method based on gathering a “golden set” of anchor images and quality assessments about how similar some other images are in the catalog to the anchor image. Once such a dataset is available, it is easy to score every vectorization implementation using a standard search quality scoring technique like Discounted Cumulative Gain.

For the purposes of apparel search, the shape and arrangement of the image is more important, thus we chose the last fully connected “bottleneck” layer as a source of image vectors. For other applications, such as a fine art search, more style-biased embeddings are generally used.

We evaluated a number of possible embeddings and found that Inception V3 performed the best for our case. Even though Inception V4 is a more recent model, it didn’t show better performance, which was surprising. One plausible explanation is that V3 produces larger embeddings and is able to capture more latent features.

Similarity Search

Once we optimized the vectorization process and obtained vectors for every image in the catalog, the next step was to select a similarity search algorithm. In other words, given a vector, we should be able to find K “closest” vectors in a multidimensional vector space. To define exactly what “closest vectors” means we have to decide on a formula to calculate a distance between the vectors. There are many different ways to define vector distance, called metrics. One of the most well-known metrics is Euclidean distance, with other popular metrics being Cosine and Manhattan metrics, which correspondingly measure angle and per-coordinate distances between vectors. We used Euclidean distance as a baseline.

Once the vector distance is defined, we can use a brute force algorithm to calculate vector distance between the anchor vector and each of the other vectors in the catalog. We can quickly obtain the top K closest neighbors in an almost linear time from the size of the catalog but the complexity of calculation of the distance itself, growing proportionally to the dimensionality of the data, dominates the complexity of the search.

There are a number of algorithms (kd-tree and ball-tree) that try to address this problem by cleverly partitioning multi-dimensional space and presenting it as a tree structure at indexing time, thus avoiding the necessity to calculate distances to all of the vectors in a catalog at query time. The advantage of those algorithms is that they still produce exact results, e.g. functionally identical to brute force approach. However, efficiency of such data structures quickly diminishes with the number of dimensions in the image vectors, so for very large embeddings they become little more than a fancy wrappers over brute force vector space search.

We ended up using NMSLIB, which makes a tradeoff by obtaining an approximate nearest neighbors result, e.g. it can occasionally miss some actual nearest neighbors, sacrificing recall. Still, the performance gain is tremendous and it was an acceptable trade-off for our application.
You can always evaluate the level of mistakes created by approximate nearest neighbor search by comparing it with brute force results to keep algorithm recall under control.

Expanding product similarity

As we can see, CNN-based vectorisation empowers us to tackle visual similarity of the products. But why stop there? We can still look back at the product attributes and textual product information and see if we can vectorize this data as well. Attribute data, being mostly categorical, can be vectorized using one-hot encoding, e.g. all possible values of an attribute are listed as a vector dimension and marked as 1 if a given product has such an attribute value.


For textual fields, we can use semantic encoding available from word2vec or fastText models. Again, transfer learning can be used to get word embeddings obtained from training those networks on really large corpora, such as Wikipedia. Attribute vectors may play an important role in product comparisons and in situations when visual similarity does not mean much, for example in electronics.

example of feature similarity search

Resulting vectors can be used separately or concatenated to obtain a compound vector embedding of the product based on all data available about a product. This should allow similarity search to tap into all features of the image - hidden features from product images, textual description and explicit product attributes.


Product images and product data encoding powered by deep learning models unlocks a trove of new customer facing features: starting from the “more like this” search or sort by visually similar feature. Product vectors encoding visual features can boost product recommendations and be used to automatically improve catalog attribution. All this makes image recognition capability an indispensable part of the e-commerce enterprise ecosystem.

If you are one of the many e-commerce companies with a large product catalog and are engaged in a never-ending fight for better product attribution, or if you are looking to give your customers new and exciting ways to discover your products, give us a call, we would be happy to help!

Leave us a comment, we would love to know what you think

Love Tech? Keep in touch with new posts by Grid Dynamics Subscribe

Subscribe for updates

Subscribe today and get bleeding edge technology updates on machine learning, big data, AI, e-commerce, and QA hot off the press.