In the ever-growing world of e-commerce, providing customers with an efficient and relevant product ranking experience is crucial for driving sales and maintaining customer satisfaction. Online retailers invest significant resources into optimizing their search and recommendation systems to ensure that users find what they are looking for quickly and effortlessly. However, with the constant expansion of product catalogs and the increasing complexity of user queries, traditional algorithms may fall short of meeting these needs.
This blog post explores an application of SPLADE (SParse Lexical AnD Expansion Model) for addressing the product retrieval challenge in e-commerce. Developed as a sparse model for first-stage ranking in information retrieval, SPLADE leverages sparse lexical representations and query expansion techniques to better understand user intent and deliver more relevant search results.
Let's take a step back and dissect the nature of existing product retrieval approaches. Product retrieval is the process of finding products that are relevant to a user's query. There are several ways to achieve this, each with its own strengths and weaknesses.
Boolean retrieval, or the “classical” bag of words (BOW) approach, is a simple and straightforward method. For example, a user might search for "black straight short dress". When employing the Inverted Index algorithm, each search term yields a list of products containing that specific term. This list is then intersected with others, ultimately returning products that encompass all the query terms or a relevant subset thereof:
The determination of a search result as either an exact match, when all terms align, or a partial match, contingent on the extent of term correspondence, characterizes Boolean retrieval. It is called Boolean retrieval because it is based on Boolean logic and set intersection.
However, instead of representing a query as a set of arrays with documents, a more efficient approach involves creating a vector for each search query and document. Each vector is designed with a length corresponding to the size of the word dictionary. Taking our example search query into account, the representation will look like this:
In this scenario, a set is utilized to represent the index of a word in the dictionary, with a value of 1 denoting the presence of the term in the query and 0 indicating its absence. Because word dictionaries are usually pretty big, often exceeding 30k terms, such vectors will be very sparse. In mathematics and computer science, a dense vector is a vector where most of the elements are non-zero. In other words, a dense vector contains a significant number of non-zero elements relative to its total number of elements.
On the other hand, a sparse vector is a vector where most of the elements are zero. In other words, a sparse vector contains very few non-zero elements relative to its total number of elements. That is why such a retrieval approach is also called Sparse retrieval.
Some people think of TF/IDF (Term Frequency-Inverse Document Frequency) and BM25 (Best Matching 25) as a form of Sparse vectors. It is important to note, however, that both of these algorithms represent relevance scoring approaches and have nothing to do with how embedding is composed. Here is an example:
On the other hand, when we talk about Dense retrieval, the situation is very different. Dense retrieval is a type of vector search retrieval where features from both the query and products are represented as compressed dense vectors. Due to the compression, the conventional retrieval approach employed with sparse features is not applicable. Instead, the Nearest Neighborhood Search algorithm takes precedence in this scenario. In this algorithm, the distance metric between Query vector and Product vectors is pivotal. The proximity of products to the search query, as determined by this distance metric, establishes their relevance. Essentially, the closest products to the search query are identified as the most relevant ones in this dense retrieval framework.
Both retrieval systems have their pros and cons. Modern dense retrieval systems show better accuracy than classic TF/IDF and BM25 retrieval methods in general domain retrieval tasks. Sparse systems, such as Boolean retrieval, fail to deliver relevant results in the case of out-of-vocabulary queries. Moreover, in scenarios where multiple matches occur in different sections of documents, Boolean retrieval may experience confusion with ranking. However, in the context of exact matches, vector search behavior in dense retrieval systems can be unstable due to the inherent complexities of the multi-dimensional space retrieval process. Additionally, dense retrieval lacks the provision of explanations for why a particular product is ranked at a specific position. On the other hand, classical approaches provide more predictable behavior and results can be explained.
Sparse neural search represents a departure from the static nature of classic sparse retrieval approaches, which necessitate constant resources for maintenance and tuning. Recognizing the challenges posed by this static nature, researchers have actively sought solutions, resulting in numerous publications and models integrating machine learning (ML) components into Boolean retrieval processes. Among these, the SPLADE model stands out as a significant advancement.
The SPLADE, or SParse Lexical AnD Expansion Model, was introduced by Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant In their paper presented at the SIGIR 2021 conference.
This paper proposes an approach that attempts to harmonize dense and sparse retrieval methods, leveraging the strengths of both. In this approach, a neural network based on the transformer architecture plays a pivotal role. This network is adept at receiving input in the form of queries or product descriptions and is trained using a methodology akin to dense embedding models, incorporating insights gleaned from clickstream data. The model inference flow unfolds as follows:
The biggest difference is that the model does not produce a dense vector for text embedding, but a sparse vector with the same length and order as the Transformer word corpus dictionary. In other words, the model is trying to predict two things: determining the terms to search for and estimating the importance of each term. Using this term importance, SPLADE can achieve the following:
After that, the output vector can be used to compose a search query and run it against an inverted index.
Such an approach gives us benefits from both classic Boolean retrieval and vector search, but it actually inherits some cons as well.
In this context, the advantages are twofold, encompassing the utilization of existing infrastructure and enhanced explainability, courtesy of the index approach. Moreover, drawing inspiration from dense retrieval methods, the system reaps benefits such as automatic query expansion through a self-learning mechanism. Additionally, the fine-tuning of rankings, achieved through meticulous weight adjustments, serves to amplify the overall effectiveness of the retrieval process.
The SPLADE model automatically expands the user's query by adding synonyms, related words, and other lexical elements. This process helps the model better grasp user intent and identify more relevant documents (or products). Query expansion is based on the analysis of document structures and their content, as well as the utilization of external knowledge sources like semantic networks.
A notable distinction from Dense retrieval is the transparency SPLADE offers regarding the terms employed for expansion. This unique feature provides invaluable insights, serving as a wellspring for linguistic enrichment. Importantly, these insights permeate across all system components, spanning search functionality and autosuggest features.
A prominent challenge observed in much of the publicly available research is the reliance on general datasets, often failing to demonstrate consistent results when applied to structured data. The ubiquitous MS Marco dataset is a frequent culprit in this regard.
To address this limitation, our research delved into the efficacy of utilizing pre-trained and fine-tuned SPLADE models on domain-specific data, specifically in the context of product catalog search use cases. A crucial aspect of our investigation was to assess the feasibility of automatic training for the SPLADE model without delving into extensive hyperparameter tuning.
A significant hurdle encountered during our exploration was the misalignment between the SPLADE model, designed with a BERT input, and the structural nature of e-commerce data. To streamline the process, we opted for a simplified approach, consolidating product title, category, and key attributes into a singular product string.
The evaluation results of these models on our test dataset yielded compelling insights:
The discernible trend in our observations highlights that SPLADE, whether pre-trained on MSMARCO or fine-tuned on our proprietary dataset, consistently outperforms BM25 in terms of results.
The SPLADE model stands out as a superior alternative to traditional ranking algorithms like BM25, boasting faster performance, enhanced scalability, and heightened accuracy. These advantages position the model as an enticing choice for diverse information retrieval applications, particularly in the domain of product ranking within e-commerce platforms.
What makes SPLADE particularly appealing is its seamless integration into existing retrieval system workflows based on inverted indexes. This integration comes without the need for significant infrastructure modifications, allowing for swift implementation with immediate impact. The model aligns with Lucene-based or legacy retrieval systems, eliminating the necessity for a separate indexing pipeline and vector database. Consequently, SPLADE proves versatile for both creating new search solutions and enhancing existing ones.
However, it's crucial to recognize the inherent trade-offs associated with SPLADE. While it offers efficiency gains, the model may yield less precise results due to query expansion and relaxation. Therefore, we recommend caution when considering it as a primary stage in the search pipeline or in scenarios where precision is paramount.
Despite these considerations, the ongoing advancements in term-based search approaches underscore the relevance and value that SPLADE brings to search systems. The inclusion of this model in our Semantic Vector Search Starter Kit further exemplifies its practical application and efficacy.