How Machine Learning can address attribution issues in e-commerce catalogs

Machine Learning and Artificial Intelligence Jun 20, 2017 Grid Dynamics

Nina Pakhomova

A richly attributed and well-curated product catalog is the key asset of online retailers. However, products are frequently misattributed, which makes it a pain for customers to find the products they’re looking for.

Catalog product attribution issues are a major pain point in e-commerce. They lead to poor user experience, lost revenue, and high customer turnover. Luckily, new developments in Machine Learning in image recognition and text classification can help resolve these issues, improve catalog quality, and tailor user experience to individual customers.

The problem of misattribution & attribution gaps in product discovery

  1. There are two main approaches to facilitating product discovery (the process of a consumer finding a product) in e-commerce:
  • Browsing category hierarchy
  • Keyword search

The success of these methods strongly depends on the quality and consistency of attribution within a catalog. For example, let’s say a customer searches for a Hawaiian shirt and none of the shirts in the catalog have been attributed as “Hawaiian”.

In this case, even the best keyword search engine will not be able to discover a relevant product, because the attribute itself doesn’t exist in the index. Likewise, standard solutions like faceted search, which use multiple filters to comprehensively describe products, are not able to help customers refine by “Hawaiian” in the “Shirts” category for the same reason.

This problem is a consequence of product misattribution and/or product attribution gaps.

Machine Learning as a method for resolving attribution issues

Solutions based on image and text classification are a way to resolve attribution issues. Not only do they help with misattribution, they also  improve catalog quality systematically and search-ability dynamically. Both of which result in a catalog and search experience tailored to actual individual customers based on real actions and insights.

How does it work?

Machine Learning (ML) uses algorithms, and other approaches, to weigh data by relative importance. These models learn over time, by training on examples to infer rules for recognizing different outputs.  Machine Learning leverages those results to provide insights, recognize unknown patterns, and create high-performing predictive models, without requiring any explicit programming. That means that the model learns to assess input by itself.

Here are a few applications of Machine Learning models for catalog attribution:

  • Learn from existing catalog  images and identify misattribution
  • Leverage text and image-based models to fill attribution gaps
  • Identify new attributes
  • Build image-based search engines

Automated product attribution based on product images

As mentioned above, product attribution problems can be largely divided into two groups:

  • misattribution
  • attribution gaps

Misattribution refers to the situation when a long dress is attributed as short, or a red tie is attributed as blue. Search engines are happy to trust indexed attribute data and retrieve completely irrelevant products as a result. Obviously, these kinds of situations lead to very frustrating customer experiences, where you just can’t find what you’re looking for.

Attribution gaps
Attribution gaps are more subtle and harder to notice, yet by far more common. Dresses which are not attributed by length, style, or material are effectively invisible for corresponding queries, like “long dress” or “wedding dress”. In this case, retailers miss the opportunity to showcase these products to their customers. These situations lead directly to loss of revenue.

We can use Machine Learning classification based in a framework like Google’s Tensorflow to address these challenges. ML frameworks include libraries that make it easier for engineers to incorporate self-learning elements and AI features like speech recognition, computer vision and natural language processing into systems.

In this case, by using a framework as our basis for a ML image recognition engine, we are able to exploit product images as an additional set of data to improve attribution. This is because the framework has provided some of the necessary infrastructure to get started. Rather than having to build the entire system from scratch, we're able to introduce a new data set and retrain the model accordingly. As a result, we can build image classification models that are able to recognize particular values or image features with a certain level of confidence based on our own criteria. This allows us to:

  • Solve misattribution by comparing the assigned value in the catalog to the model’s prediction. Basically, this means if the predicted value and the assigned value don’t match, the model will provide a more accurate value based on its collected data.
  • To fill in gap attribution by using the model’s prediction in place of the missing attributed value. In this case, wherever a value is missing, the value from the model is provided in its place.


In this series of posts, we will discuss how to tap into product images as an important additional data source for e-commerce catalogs. We have already covered product attribution issues and how recent advancements in the field of image recognition enable many powerful use cases — from filing product attribution gaps to product discovery based on image similarity  which allow for more accurate search results and a faster, friendlier user experience.

In our next post, we will dive into the details of how to use image recognition, Machine Learning and Google’s Tensorflow framework to solve misattribution. Using Machine Learning to resolve the issues of misattribution and gap attribution is only the tip of iceberg for Machine Learning’s potential for improving e-commerce as a whole.

So, stay tuned, and subscribe to our blog to be the first to read about these up and coming technology solutions, and to get the latest Open Source blueprints for Search, QA, Real-time Analytics, In-Stream Processing and more. If you liked this post, comment below.