In our previous post, we discussed the impact of product misattribution in e-commerce and how image recognition with Machine Learning can be an important tool to resolve this issue. In this post, we will get into the details of how to detect and correct misattribution using Machine Learning, Google TensorFlow and image vectorization.
We have chosen TensorFlow, Google’s Open Source Machine Learning Framework, as the basis for our solution. TensorFlow is a ready to implement framework that provides great capabilities for designing Machine Learning models from scratch. It also provides a great library that takes care of the underlying infrastructure for you.
Like the name states, the main concept behind TensorFlow is tensors, which are bits of data with multiple dimensions (values/attribute). These tensors are then passed along from mathematical computation to mathematical computation, called nodes, according to a certain hierarchical rules. Along its journey the data transforms as it moves from node to node. These transformations (dimensional changes) illustrate the relationship between the values to the system. This allows the system to predict new values as it learns to recognize different patterns within the data. The framework enables engineers to process data end-to-end from different sources and output it to different Business Intelligence (BI) tools.
TensorFlow has a general, flexible, and portable architecture and has been used for deploying Machine Learning systems for information retrieval, simulations, speech recognition, computer vision, robotics, natural language processing, geographic information extraction, and computational drug discovery. TensorFlow libraries make it easier to incorporate self-learning elements and AI features like speech recognition, computer vision and natural language processing.
Google has recently released a new version of TensorFlow. Its features include:
Vectorization a method for converting the object of learning into a vector and is central to any Machine Learning system. Machine Learning algorithms can only process vectors, so everything, from human speech to texts and images, has to be converted into vector form before training can start.
Modern image recognition systems rely on Convolutional Neural Networks (CNNs) to produce a vector representation of an image. CNNs are built to mimic how the human brain processes images and recognizes their content.
Our brain contains a set neurons capable of interpreting data in terms of basic visual properties like lines, circles, etc. It also contains multiple sets of other neurons (layers) that are able to combine the outputs from previous layers to recognize more complex patterns. For example, let's say the first layer recognizes four lines. Then, a following layer recognises that these lines intersect at 90° angles. The next layer, therefore, puts together that the image is a rectangle. This process goes on until the information is processed through the all the different layers of neurons needed to, eventually, let it build a full picture of the image that’s being observed. Obviously, this all happens in a matter of milliseconds when the brain does it.
In the same way, CNNs contain layers of nodes. These nodes, as mentioned above, are mathematical computations that process vectors and interpret them according to certain rules. CNNs also have many identical copies of the same node. This allows them to express large computations while keeping the number of rules for processing fairly small. This larger amount of neurons can be leveraged to run many computations quickly and reliably. This is why frameworks like TensorFlow can be implemented to build Machine Learning systems that perform quickly, at scale, and provide accurate predictions in business-critical use cases.
What is important for our task is that the architecture of CNNs consists of convolutional and pooling layers that are used to build vector representation of an image (feature vector), and a fully connected layer that is used for classification.
Convolution is the core building block of CNN, it preserves the spatial relationship between pixels by learning image features using small squares of input data. In practice, a CNN learns the values of these attributes on its own during the training process. The more number of filters we have, the more image features get extracted, and the better our network becomes at recognizing patterns in new images.
The pooling layer reduces the dimensionality of each feature map (also called subsampling or downsampling), while retaining the most important information (width, height).
Fully connected layer
The fully connected layer looks at what high-level features most strongly correlate to a particular class and assigns particular scores accordingly. When you compute the product of the classification weights and the previous layer, you get the correct probabilities for the different classes.
One of the benefits of using a framework like TensorFlow is that it ships with the Inception-v3 model, which is already trained and is able to recognize thousands of features. Inception-v3 is trained for the ImageNet Large Visual Recognition Challenge using data from 2012. This is a standard task in computer vision, where models try to classify entire images into 1000 classes, like "Zebra", "Dalmatian", and "Dishwasher".
However, Inception-v3 is not trained for our e-commerce catalog use case and, as a result, doesn’t have attributes of interest, like “mini skirt” for example. That said, we can still take advantage of other aspects of the model, like the already trained vectorization layers. All we need to do is to retrain the fully connected layer that classifies images to recognize our attributes.
In order to correct misattribution we must decide:
As mentioned above, TensorFlow does not contain a dataset that meets our criteria. So, instead, we need to start from a given dataset that includes misattribution and find the discrepancies within the data. Old product catalog images and descriptions are a good candidate for this kind of datasets.
We start by training the model to catch inconsistencies between product attributes and the attributes that can be extracted from the product image. By doing this the model starts capturing the labels were interested in recognizing.
We have built a system, where we could easily train our models based on existing images and use it for attribute validation purposes. We have leveraged Spark/Hadoop ecosystem to speedup training set preparation and attribute validation tasks and Docker to deploy containers for model retraining.
There are three main parts of our process flow:
If the predicted label and assigned value are different, we’ve successfully identified misattribution.
It is of utmost importance that the trained model is as accurate as possible. As such, we’ve introduced the following techniques to increase the quality of our models:
We have also built an UI reporting application that is able to visualize the results of an image-based attribute verification process. Via the UI the model provides a second opinion about the value of a given attribute. You can see it in action below.
Dress Length attribute:
Heel height attribute
Women shoes style attribute:
A benefit of using the UI is that you are able to unify attribute value assignment across the entirety of the catalog. For example, there are thousands of different colors. In some cases, it is a pretty subjective judgement whether a dress is in fact yellow or gold, or whether a top is truly pink or purple. For example, during training, the model learns “what is purple” and, thereby, provides color unification across the catalog:
After training models for different attributes, we found that some attribution errors are more significant than the others. For example, if a white dress is misattributed as black, it’s definitely a mistake. However, the differences between orange and red can be much more subtle. Just as, the difference between “long dress” and “short dress” is definitely more pronounced than “knee length dress” and “below the knee dress”.
To embed this into the system we added an errorScore.
As previously mentioned, the fully connected layer is a multi-label classifier. So, the resultant predictions generated are not a single class, but a vector which can be treated as the probabilities of all predicted classes. Based on this, you can treat a threshold between the most probable label probability and original value label probability as an error score. As a result, the errorScore for Short versus Long is higher than that for Knee length versus Below the knee.
It is subsequently possible to find the appropriate level of errorScore and use it as a tiebreaker, should a tie arise. For example:
There are a lot of product properties that can’t be extracted directly from an image. Additionally, sometimes there’s not enough data to train the CNN model to recognize tiny differences between, for example, a Sweater or Sweatshirt.
To deal with these kinds of cases, we need to corroborate between different data sources. For example, between an image, attributes and textual description of the product.
In this case, we can use a copywriter’s product descriptions as alternative source of truth. As usual, we needed a vectorization approach to apply ML to the product descriptions. There are a lot of ways to build vectors from text. For our use case we’ve chosen Continuous BagOfWords (CBOW) model built into TensorFlow. CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'). This type of prediction has the effect that CBOW smooths over a lot of the distributional information (by treating an entire context as one observation), which is useful for smaller datasets like ours.
We built the model to convert text into vectors and we trained the same classifier to be able to predict the same labels as our retrained CNN model. As a result, we have two models that are able to predict a particular value for the same product based on product’s image and product’s description, respectively.
After deeper understanding of business needs, it’s also possible to build a heuristic model that will choose one model’s predictions over the others based on its values.
It’s important to note that the trained system is able to pick up important business details and rules of a given catalog without actual deduction and implementation of these rules. This means that the output of a given system is customized in terms of the data that was used for its training.
In this post, we discussed how to set up and use Machine Learning and Imagine Recognition for detecting misattribution in product catalogs. This post is the second part of our series on using TensorFlow for e-commerce business cases linked to calatog attribution. In our third and final article, we will show how easy it is to expand from attribute verification to filling attribution gaps and gaining new attribute values from vectorised inputs.
Don’t forget to subscribe to our blog to get the latest Open Source blueprints for Search, QA, Real-time Analytics, In-Stream Processing and more. If you liked this post, comment below.