Vector space retrieval model for e-commerce
Sep 04, 2020 • 11 min read
Sep 04, 2020 • 11 min read
Since the dawn of ecommerce in the end of the nineties, online retailers have coined a famous “If they can’t find ‘em, they can’t buy ‘em” motto, which stands today as fresh as ever. Online shoppers have a lot of choices, and their patience is getting ever shorter: they expect instant, precise, relevant and personalized search results closely matching their buying intent.
Over the last two decades retail web search engines have made quite a journey to meet this expectation. In this blog post, we will track the history of retail web search from its roots in general-purpose full text search engines to current, state-of-the-art, deep learning-based semantic vector search.
Initially retail search engines were based on general-purpose keyword search algorithms. Those engines were originally designed to handle keyword search within web pages and electronic document libraries. The famous Apache Lucene library which powers both Solr and Elasticsearch systems is a good example of a leading full text search engine.
When applied to product search, those systems often treat all the product information such as a title, attributes, and a description as parts of one large text document. Customer queries are split into single words (known as terms) and each term is searched within the product’s description text using boolean logic.
All products matching the boolean expression need to be ranked, e.g. every product is given a score of how relevant it is to the customer query. This ranking is based on word statistics, e.g. it relies on counting how often particular words from the query are mentioned in the product data and how common this word is across all the products in the catalog. Ranking formulas can become pretty sophisticated, and take into account a variety of statistical factors and additional data, such as the newness of the product, but at their core they are still about counting words.
There are a few additional tricks which full text search engines employ to achieve fuzzy matches: they normalize words, reducing them to their stems, so that “jog” will match “jogging”. They also employ thesauri (or thesauruses) with numerous synonyms so that a “dress” will match a “gown”.
The full text approach to retail search suffers from multiple problems, however.
First of all, the method of matching individual words often backfires when a word is part of a multi-word concept. Compare the usage of the word “board” in such different contexts as “board game” and “cheese board”. Matching those words outside of their context will lead to displaying many irrelevant, accidental results which will drown customers in useless data.
Additionally, word counting is not the best proxy for defining customer intent. If we are searching for a particular combination of attributes, such as “turquoise skirt”, we may have a situation where the fact that “turquoise” is a very rare word in the catalog will dominate the scoring formulae and the result will show all turquoise products on top, even if they are not skirts.
Surely we should be able to do better than that!
Enter concept search.
To overcome some of the limitations of the full text search method we should focus on matching things, not words. In this approach the internal structure of product data and product attributes is preserved and leveraged to drive more precise and relevant results.
For that, the search engine should learn quite a bit about the domain. This data is consolidated in a “knowledge graph”, which enables more advanced processing of queries, such as:
Utilizing this data helps us achieve much better precision by eliminating accidental matches of individual words. Concept search also improves recall through the semantic expansion of queries with relevant synonyms, abbreviations and slang.
In the concept search approach ranking formulas depart from pure word counting mechanisms of full text search. They now mostly rely on the combination of several factors, such as what set of attributes is matched by a user query or what kind of normalization had to be applied to achieve the match. They also process a variety of business signals about the product itself, such as inventory levels, popularity and margins.
In addition to purely relevance-based matching and ranking, commerce search systems often include merchandising engines, which help to add merchandising bias to search results, boosting or burying particular products based on current business needs, irrespective of their relevance.
All those improvements and features help achieve the next level of search quality and, if implemented and tuned correctly, can drive business results to pretty impressive levels.
However, the concept search approach requires a significant amount of data discipline and manual curation when it comes to the quality and correctness of product attributes and the knowledge base data. If product data is sparsely or unevenly attributed, some of the relevant products could be ranked poorly or totally missed by the concept search engine. It also fully relies on skill and experience of search relevance engineers and curated data sources to properly tune and balance ranking formulas.
At the same time, when customers interact with the online retail system they produce a wealth of data related to their search and shopping behavior. Can this data be leveraged to further improve the quality of product discovery experience?
Machine Learning to the rescue!
There are many things in retail search which are extremely hard to get right in the absence of customer interactions. For example, when a customer is searching for “nike”, does she mean “nike shoes” or “nike tops”? A “skirt” probably means a different thing if we know that a customer is searching for things in the bedding category, doesn’t it? And what does “madonna dress” even mean?
Machine learning derives this knowledge from aggregating and generalizing shopping behavior data of all customers on the site. Results of intent classification can be used by the search engine to filter and boost relevant products, thus helping to resolve such ambiguities.
ML models can also discover unknown search terms and relate them to a particular product based on shopping behavior. Using this approach, product attributes can be enriched with the terms which customers frequently used when they interacted with a given product.
Machine learning can also be applied directly to the central problem of ranking search results. In this approach, called learn-to-rank, customer clickstream data is collected and aggregated into a so-called click model, which produces training data by collecting and ranking products that seem to be relevant for a given query based on customer behavior.
The ML model is then trained on this data trying to predict how relevant a given product will be for a given query based on dozens of carefully selected features, such as for what attributes the match happens, what the price and popularity of the product are, etc. Essentially, this approach allows the search engine to “learn” an optimal ranking formula instead of relying on the expertise of search relevance engineers.
Learn-to-rank has a lot of advantages. When implemented well, ranking tunes itself based on the customers shopping behaviour and takes into account such factors as trends, seasonality, etc. It is not a surprise that a combination of concept search and learn-to-rank approaches represents the current state of the art in search engine solutions for many online retailers.
ML models are quite hard to evaluate in real-time, especially if they have to score every product in the catalog that matches the query. This is why ML-based ranking is very often applied only to the top few dozen of products, not to the whole result set. Sometimes this means the products which are highly relevant from the ML model’s perspective may not be boosted to the first page of results, harming results relevance.
The learn-to-rank approach still piggybacks on the same classical boolean retrieval and deals with re-ranking of top results already arranged by conventional ranking formulae.
But is there a different approach which combines the advantages of learn-to-rank with a promise to overcome challenges of boolean retrieval? Read on.
As it often happens, true innovation stems from a critical review of the field foundations and an introduction of new ideas from adjacent fields of knowledge. In case of search, a new wave of innovation began by reviewing the classical vector space retrieval model in the light of deep learning.
The vector space model (VSM) is a theoretical foundation of word statistics ranking formulas described above. The main idea of the VSM is to put documents (such as products and queries) in the relationship with some concepts, which are represented as dimensions in multi-dimensional vector space:
Here, in a simplistic example of the two-dimensional space with only two concepts of “fruits” and “veggies” we can say that a cabbage is mostly a veggie, an apple is very much a fruit and a watermelon is something in between. The “pear” concept as a fruit will be mapped in the neighbourhood of “apple”.
Such mapping allows us to perform searches in this vector space. We can use some distance metric, such as angular distance, and measure the similarity between objects. For example, looking at the angles between the vectors representing our produce, we can say that search results for the “pear” query should be ranked as [apple, watermelon, cabbage].
With classical VSM, documents are represented as vectors of very high dimensions, where each component represents a particular word in the vocabulary and the value of this component is based on the frequency of the word in a given product or query. This provides a decent baseline for ranking products and with many special corrections used in real life.
Naturally, this approach produces very high-dimensional, sparse vectors, e.g. vectors where the majority of values are zero. Metric-based ranking becomes hard in very high-dimensional space because of the curse of dimensionality. Also, VSM ignores the word order, so “shirt dress” is the same as “dress shirt”. But most importantly, it ignores the word meaning, e.g. “dress” and “gown” dimensions are independent and do not help each other. Words are just tokens which are present or not present in the document, and can be substituted by any unique identifier.
The semantic vector model puts vector search model on its head: instead of trying to map the human language to a vector scape, getting thousands of dimensions, it attempts to fix a reasonable number of dimensions and adopt a “language” which would describe data in this fixed-size vector space.
You have probably heard that the Innuit people have dozens of words to describe snow in a variety of conditions: falling snow, frozen snow, melting snow, snow useful for igloo building, etc. As snow is an integral part of Innuit lives, they have developed a vocabulary to efficiently deal with all its variety. If we are tasked to describe kinds of snow with extreme precision, we would probably create whole paragraphs of text, whereas an Innuit would use just one or a few words.
The main idea behind semantic vector search is to let a computer algorithm “invent” its own language to describe both the queries and the products in a particular domain. “Words” in this language are called “latent features“. Latent means hidden, e.g. the real meaning of those “features” is unknown to us, yet those features/words allow us to represent our data in a very brief and concise way, capturing the “meaning” or “semantics” of our products.
Semantic vectors, representing our data in this semantic vector space, are relatively small and dense, with just hundreds of dimensions and having non-zero values in most components. This representation allows us to reduce the impact of the dimensionality curse and to cluster our products with our queries together by meaning, not but word co-occurrence, thus greatly improving the effectiveness of the vector space model.
Semantic vectors directly address the central problems of polysemy and synonymy, by mapping concepts having a similar meaning to be close in terms of vector distance. Thus, “dress”, “gown”, “frock” will all have similar vector representation, and will be picked up by the nearest neighbor search in semantic vector space.
So, how exactly can we make an algorithm “invent is own language”? Enter deep learning. Deep learning models can “learn” the semantic vector representation of raw data (expressed in the form of free text and attribute values) which would be most useful towards the particular goal at hand.
For search purposes, we train the deep learning model towards the goal of predicting similarity between the query and the product, thus ensuring that vector representations of queries and relevant products will be clustered together.
Training data for the similarity model can be obtained much like in the learn-to-rank approach - by mining user engagement signals from clickstream data. It is worth mentioning that similarity training is greatly helped by transfer learning from modern natural language processing which provides the initial semantic representation. This significantly reduces the size of the training data necessary. See more details in the “Semantic vector search: the new frontier in product discovery” blog post.
As we can see, semantic vector search redefines the classic vector space model to include the knowledge obtained from text semantics, sourced from pre-trained natural language models, and gained from customer engagement data (similarity between queries and products). This is how a very powerful retrieval model is created which can process queries previously out of reach for conventional boolean-based search engines.
Additionally, semantic vector search allows us to include a wealth of additional data about the products, customers and queries within the same conceptual framework. Besides text-based features, we can also use image-based features, such as artistic style vectors and product features coming from collaborative filtering. We will explore those techniques in upcoming blog posts.
In this blog post, we discussed the long journey of retail search engines in their quest towards accurate, relevant, timely and personalized search results.
We tracked how approaches to ranking and matching have evolved from a purely statistical approach of classic full text search through a more sophisticated concept-based search to modern learn-to-rank models with manual feature engineering.
Now, we see the next generation of search engines emerging, based on the semantic vector space model and fueled by impressive advancements in deep learning-based natural language processing.
This new generation of search engines will power new retail search systems which will make customers’ shopping journeys even more intuitive, smooth and delightful than ever.
If you'd like to get more information or to contact the author, please click here.