Semantic query parsing blueprint
Sep 16, 2020 • 10 min read
Sep 16, 2020 • 10 min read
A customer of ours, one of the largest omni-channel retailers in the US, was having issues with product discovery. Their e-commerce site had a massive catalog with hundreds of thousands of SKUs, a modern e-commerce backend and a powerful search engine, yet the conversion was less than stellar. Frustrated customers often couldn’t find what they were looking for, even when the retailer had the goods.
The issue was broadly related to poor autocomplete features, which made product discoverability unnecessarily complicated. For example, these were common issues that plagued the user experience:
The search engine itself didn’t have sufficiently sophisticated autocomplete functionality built-in, failing to address such problems as:
The customer was faced a dilemma between two unappealing options: to commission an expensive and technologically complex heart surgery on the existing search engine, or replace that search engine altogether with another one that had autocomplete features.
Luckily, there was a third choice: to provide autocomplete functionality as an add-on service, that would integrate with their existing search engine and e-commerce back-end. This is one of our company’s specialities, developed over the years as a result of many generations of large-scale implementations of e-commerce search engines for big retailers. Here is how we did this. You can also see a demo of this solution below.
Our approach is to create an autocomplete service that extends a retailer’s search bar with new features that are search engine-agnostic and do not require any customizations with the backend or search engine index. Furthermore, the quality of these autocomplete features matches that of those found in current state-of-the-art search catalogs offered by Google and Amazon.
To build this sophisticated suggestion catalog service, we combined two powerful ideas:
The first technique, catalog-based suggestion generation, functions to transform unnatural and often “synthetic-like” suggestions generated from a catalog, into more relevant and applicable suggestions. Catalog indexes can grow to be very large with large numbers of SKUs and additional product attributes. Therefore, to tune relevancy, we simplify and edit the product attributes to remove non-business related attributes that likely would not have been searched for in the first place. While this process may limit the number of product attributes, it is advantageous because it can be applied to any product catalog independent of the presence of user query statistics.
The second technique, most frequent user query-based suggestion generation, utilizes a more data-driven approach. This technique bypasses the manipulation of large catalog indexes and maintains a high number of product attributes, but is restricted by its requirement of the collection and possession of most frequent user query statistics. Additionally, this technique employs the search engine directly during suggestion index generation, whereas the catalog technique does not.
Both techniques outlined above present their own advantages and disadvantages in terms of the quality of the generated suggestion sets. We have opted to implement both of these techniques to maximize the solution's user experience.
In order to address the relevancy of catalog-based suggestions independent of user query statistics, this technique processes data using a multi-step algorithm and produces index data on generated suggestions.
The solution is based on the following steps:
Finally, all of these suggestions are indexed into Solr using a structure such as the one presented below:
<field name="input" type="search"/> <field name="importance" type="pint"/> <field name="length" type="pint"/> <field name="rank" type="pint"/>
The “input” field stores the original text of a generated suggestion, “length” stores a number of words inside the suggestion, and the others match the description above.
For the input field, we don’t have to use EdgeNGramFilterFactory or other filters that split the phrase into a list of shingles anymore. A simple field config like the one posted below is sufficient:
<fieldType name="search" class="solr.TextField" stored="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" /> </analyzer> </fieldType>
In order to provide suggestions with only 1-2 words added to the original user input, we perform grouping on the “length field” and sort groups by length increase. Inside groups can be sorted into suggestions by their importance and ranked according to specific needs. It is important to note that for inside groups, the prefix input needs to be matched manually by building a special query with a PrefixQuery match on the last entered word.
Grouping the resulting suggestions allows the elimination of duplicates. Business-rule filtering eliminates indistinguishable and nonsensical suggestions. And the fact that we build our index on the actual catalog ensures that there will be no dead ends.
As a result, we eliminate all scenarios of naive catalog-based suggestions, while still supporting all the necessary business rules. This requires a lot of effort on large catalogs so the fact that our solution scales to very large catalog sizes is vital.
Our solution also includes a second technique that uses data to improve the quality of suggestions. This technique may be restricted by the requirement that statistics for user query frequencies have been separated into linguistic groups, but is still worth adopting. This technique is implemented by grouping all semantically equal suggestions (e.g. “prom dresses” should be given equal weight as “dress for prom”), with a normalized number of user queries for the text (rank value). If the collection of raw statistics isn’t trivial, then groupings like this may require additional complex data analyses, like tagging, however in most cases native search engine features are sufficient.
This technique is based on the following steps:
Finally, all of these suggestions are indexed into Solr using a structure like that presented below:
<field name="input" type="search"/> <field name="length" type="pint"/> <field name="rank" type="pint"/> <field name="group_rank" type="plong"/>
As you can see, the resulting suggestion structure is nearly the same as for the catalog-based suggestion technique. However, due to the nature and sheer volume of data used, this technique provides a superior ranking system. Once the suggestions are indexed into a search engine, they can be queried using the same algorithm described for in this article for the catalog-based technique.
In combining these two techniques, we were able to successfully create an autocomplete feature for the search engine that used both user statistics and a cleaned index to generate more relevant results. Even though we did need to work with an extremely large index and a large volume of user query data, we were able to build the solution without altering the search engine or messing with the customers backend, simplifying the project and lowering its cost.
Going the extra mile and using two techniques to create highly relevant search suggestions paid off enormously. After our solution went into production, the customer conversion rate went up by 5% overall and 9% on mobile devices. This was especially promising as it showed that a Solr based search engine using only the basic index based approach, would not solve the problem as it wouldn’t provide the same quality of search suggestions. Overall, our extra features saved the customer money on the implementation and improved their user experience.
The challenges we outlined in this article that our customer experienced in terms of, struggling to provide an adequate search experience and experiencing low online conversion rates, are not exclusive to just this one retailer. Competition in the digital retail space has become increasingly more intense, requiring every digital retailer to continually innovate to keep pace with the ever-rising bar of customer expectations. Furthermore, increasing catalog sizes requires more advanced product discovery to continue to allow retailers to provide a complete omni-channel experience. Like our customer, many large retailers are facing this rise in standards and are expected to have the whole breath of their inventory available in their online catalog. This means that they will need to be improving their search features, namely begin implementing autocomplete.
The struggles that large retailers face with attempting to implement autocomplete search functions can be solved with the same solution that worked for our customer. Our solution is an agnostic search engine add-on and therefore, is compatible with any existing search engine such as Solr, Endeca, Elasticsearch and many more. Furthermore, our solution doesn’t conflict with the backend and is equipped to handle even the largest volume e-commerce catalogs. This solution can achieve Google and Amazon quality autocomplete with just two key steps, catalog index customization and integration of the digital store’s search bar. Grid Dynamics is a specialist at these kinds of autocomplete implementations and our team is happy to assist any retailers struggling with this technology, just drop us a line here.
Ivan Mamontov, Vladislav Trofimov