Vector space retrieval model for e-commerce
Sep 04, 2020 • 11 min read
Sep 04, 2020 • 11 min read
In the middle of your home improvement project, you realize that you've lost a very important screw. You have the matching screw that you need in your hands, and simply need to make another trip to the hardware store. However, a trip to the local hardware store is a hassle, as many of the screws look alike, and the staff there isn't always helpful. Traditional approaches to matching these kinds of parts are all frustrating and slow, and usually involve visually comparing your item to a huge catalog of choices. Wouldn’t it be nice if you could just point your smartphone to the screw that you need, and have it easily show up online?
Grid Dynamics partnered with one of our customers, a large retailer, who focuses on selling hardware directly to consumers and contractors. Seeing an opportunity to improve this part of the buying process, our customer asked us to find a way to use computer vision to do this laborious, visual matching process more quickly and easily.
In this article, we will walk you through the steps on how to develop a visual search application using deep learning techniques, and discuss both approaches that worked, as well as approaches that didn't.
Our earlier blog post, “Building a reverse image search engine with convolutional neural networks”, described a typical visual search solution similar to this example. The neural network was trained to map input images into “feature vectors”, which allowed us to search for similar images in the so-called “feature space”. This approach produced great results for visual recommendations, yet did not guarantee an exact match. We could adjust various hyperparameters, like thresholds and distance metrics, which improved results -- but the perfect match remained elusive. For this application, we decided to take a classification approach. We identified all of the key visual features of the image before searching the catalog for similar products.
Screws are visually very similar to each other, and therefore hard to distinguish. We needed to understand what characteristics made the screw types unique, and define the minimum feature set that makes each screw unique. After investigating the screw’s taxonomy, we decided to choose seven key visual attributes that fully identify a screw:
The difficulty with identifying screws is that there are about 1,500 unique types of screws, and a certain screw can often be distinguished only through a single feature. To ensure high-quality results, input images should pass through multiple processing stages before the catalog can be searched. We had to localize and segment the image, determine object dimensions and extract visual attributes.
First, we must locate the object of interest in the photo, in this case, the screw. Because the user supplied the photo, we didn't have any control over the background or object location of the photo. Ideally, we wanted to select a minimum area in the photo containing the object of interest, and ignore the rest of the image. To solve this problem we tried the Object Detection and Semantic Segmentation approaches.
Object Detection models help to localize the object of a particular class and find the minimum rectangular area around the object. Such models consist of multiple steps: they start by proposing candidate location regions, then extract features by running convolutional neural networks (ConvNet) on top of each region and finally, classify those regions by tightening the bounding box of the object. Let’s review some of the pros and cons of using Object Detection models:
Considering these limitations, we chose to abandon Object Detection nets, and instead considered the Semantic Segmentation approach (see U-Net). This approach worked much better for our application. U-Net models classify each pixel that belongs to the object of interest. The main advantage of the U-Net architecture is that it relies on a lot of data augmentation, which allows for the use of a small training dataset, and generalizes well. This approach, however, has one drawback: it can not differentiate the instances of the same object class in the image.
A U-Net architecture supports multiple object classes at once by adding channels. The U-Net model produced a very accurate mask by avoiding shadows and various background distortions. However, instance segmentation restriction was a limitation that we needed to solve, so we had to add custom post-processing of the masks to separate instances of the same class.
Since 2014, ConvNet has shown great results in image classification for thousands of classes, as it supports all kinds of viewpoints and distortions. Our task in this case was much simpler, as we did not have thousands of classes, and we already segmented an object so that background and shadows did not interfere. We used catalog images and their attributes to train relatively simple ConvNet models to predict each of the key attributes.
We chose the MobileNetV2 Keras model for all of the classification features except color. We then used transfer learning with pre-trained models using ImageNet weights. This fine-tuned a few final layers of ConvNet for each of the extracted features. The average accuracy for all models was about 94%. To learn more about Convolutional Neural Networks, refer to references [1, 2, 3, 4].
A screw's finish can be classified by color. Before applying deep learning classification, we decided to try a few simple algorithmic approaches as a baseline.
We ignored shape-related features, and focused only on color. We experimented with some different algorithms to establish a color baseline. Initially, we tried several ground truth colors. We extracted colors from the image using the color histogram, and searched for the closest ground truth colors by calculating the intersection of histograms. Another approach we tried was to map object colors to HUE/LAB color spaces, and calculate the distance between all ground truth points and the object color. The problem is that two identical colors for human perception that differ only slightly in tone may have radically different values in any representation, making machine learning a difficult solution for this issue.
Turning to a deep learning approach, we decided to use a CNN model with a few convolutional layers to avoid learning shape-related features, focused the network attention on the color and ignored light conditions. We drew our inspiration from “Vehicle Color Recognition using Convolutional Neural Network”, an in-depth research article on the subject. We implemented a similar architecture with some simplifications, and achieved 93% accuracy for 9 color classes. We also used the imgaug library for data augmentation. It integrates seamlessly, and was able to give us a very good performance boost.
Aside from classification of visual features, there are metrical features such as length, major diameter and thread pitch. For each of these features, we used a different image processing technique. Before extracting metrical features, we had to do some preprocessing, which included these steps:
Once these steps were complete, we could calculate the major diameter, total length and screw length.
The thread pitch was difficult to extract. The best approach was to calculate a Fourier transform of the threads to get a spectrum from which to derive the pitch. In order to improve spectrum quality, a Hann window was used, and a discrete Fourier transform for every row was averaged by column. The last step was to find the distance between the central peak and the closest non-central peak on the spectrum.
After all these steps were completed, we had predicted all of the screw features’ values. The last step was to find the target product based on the predicted features.
The final step was to find the products that matched the predicted features. In an ideal case, if all features were predicted correctly, we could find the proper unique product by filtering the catalog. However, feature predictions are not ideal, and this approach may easily result in incorrect or no results.
|Model number||Head type||Material||Length, width, pitch||Thread coverage||Tip type|
|#146258||oval||zinc||#10 x 3/4 in||fully_threaded||cone|
|#496728||flat||zinc||#8 - 32 x 2 in||fully_threaded||die|
|#529714||pan||brass||#6 - 32 x 1/2 in||fully_threaded||die|
|#128032||one way||zinc platted||#12 x 1 - 1/2 in||fully_threaded||cone|
|#103972||pan||zinc platted||1/4 in - 20 x 3/4 in||fully_threaded||die|
|#104071||pan||zinc platted||#10 - 24 x 3/4 in||fully_threaded||die|
|#916468||pan||stainless||#12 x 5/8 in||fully_threaded||cone|
|#194109||type p||zinc platted||1/4 in - 20 x 1/2 in||fully_threaded||die|
|#129586||one way||zinc platted||#8 x 1 in||fully_threaded||cone|
|#194383||type p||zinc platted||#10 - 24 x 1/2 in||fully_threaded||die|
|#386687||round||zinc||#10 x 1 - 1/2 in||threaded_on_one_end||die|
|#841215||oval||zinc||#6 x 1 in||fully_threaded||cone|
|#302180||dowel||zinc||5/32 in x 1 - 1/4 in||threaded_on_both_ends||cone|
|#40921||flat||zinc||#8 x 1 in||threaded_on_one_end||cone|
A great application is one that assists users in a process that is unwieldy for humans. Screw identification certainly falls into this category. A user can get easily frustrated identifying an item like a screw, but it is essential to find the right one. These visual search techniques utilizing deep learning can used for many applications where there is a large variety of similar-looking items. We hope this example of a visual search application illustrates the potential of an efficient and fun way to assist users in their e-commerce experience.
Grid Dynamics specializes in visual search and deep learning solutions. We can make your product catalog manageable for your online customers. Contact us at (650) 523-5000 to speak to a Grid Dynamics Representative.