Semantic query parsing blueprint
Sep 16, 2020 • 10 min read
Sep 16, 2020 • 10 min read
According to Hedges & Co. report, US online sales of auto parts and accessories amounted to $12B and, while new car sales projection are looking bleak, the future of the automotive aftermarket is quite positive. Both specialized auto parts retailers and online commerce powerhouses like Amazon and eBay are competing for the share of this market.
When you are looking for auto parts online, it is crucially important to ensure that part you are ordering will fit your vehicle. That is why all online retailers selling auto-parts first try to establish what vehicle you have in mind when searching for auto-part.
Vehicle information is used to filter parts catalog and surface only the parts which will fit the target vehicle. This kind of filtering should work seamlessly with both keyword search and category browse.
Turns out, it is pretty hard to achieve high quality search with legacy search platforms like Oracle Endeca. Many specialized retailers struggle with performance and scalability issues which are preventing them from rolling out new high-value features to successfully compete with tech giants like Amazon on auto parts market.
In this blog post, we will describe how to overcome those issues with modern open source search platforms.
Auto parts catalogs are unique because they significantly depend on part fitment information. Fitment, also referred to as compatibility or application, refers to the vehicle year, make, model, and engine that your parts and accessories fit. Often this information expands to trim/submodel, motor, and so on.
Products in auto parts catalog may have any number of fitments, in the range from zero, which means that product fits any vehicle, to hundreds of thousands for some widely applicable products. Typically, products have about a hundred of fitments.
Essentially, we are dealing with parent-child relationship here, where product (parent) has multiple children (fitness) entities. Our core use case is to filter products by attributes of their fitnesses, e.g. by a full or partial specification of the vehicle.
Parent-child relationship requires special modeling in search engines due to the nature of the very data structure which gives them their speed - inverted index. Core abstraction for search engine is a document which is a collection of fields, where each field can have multiple values. Each document is completely independent and doesn't "know" about any other document in the index.
So, in order to support parent-child relationship with inverted index, we have to apply special techniques. Performing join and filtering in query time is technically possible, yet it is a known performance killer. So, most search implementations go for variety of tricks to move join work to the index time.
Endeca goes for all-in denormalization, which leads to the structure of index which looks like this:
With denormalization, or attribute roll-down approach, parent attributes are copied to child documents and only child documents are indexed in the search engine. At the query time, child documents are searched, retrieved and grouped by parentId to return only parent documents. Here is how it works in auto parts search use case:
This is a straightforward approach and is widely used in many search applications. However, it has important drawbacks which significantly affect search performance:
Those issues may lead to response times in several hundreds of milliseconds or even seconds and strongly impact customer journey, creating perception of sluggish search experience. Customer patience is short, and options are plenty, so we should strive for immediate response from the system.
Is there a better way to model a parent-child relationships? Sure there is! One of the neat techniques available in Apache Lucene is index-time join (a.k.a. block-join) of parents and children, which essentially encodes those relationships as the order of documents in the index. This functionality is nicely exposed by both Solr and Elasticsearch search engines. Here is how it works:
Parents and children are arranged in the blocks where child documents are followed by their immediate parents. This means that in order to find a parent document by attribute of the child, all you need to do is to find a matching child and jump ahead in the index to the closest parent, which is a very natural and fast operation for inverted index. Similar approach works for finding children by their parents.
This approach has multiple advantages over all-in denormalization:
Here is how the same search for "filter for ford mustang" will work in Lucene-based block-join index
Of course, as it usually happens, there is no free lunch, and this data modeling approach has some drawbacks as well:
In majority of applications, this trade-off is perfectly acceptable.
One additional important consideration is the scalability of search solution. With large catalog, it is increasingly important to be able to parallelize query processing over multiple nodes to be able to keep response time under control. This is achieved with index sharding and map-reduce query processing. This is another key scalability feature not available in Endeca.
Original Endeca implementation had typical response time of 600 ms, with frequent multi-second "spikes" for high-recall queries. Let's see if we can do better than that with a modern search engine.
We implemented a sample auto parts search application based on Elasticsearch. Please note that a very similar system can be built with SolrCloud, as both systems rely on core functionality available in the Apache Lucene search library.
We used a catalog of auto-parts with about a 1M parts. We used just a single cloud node and deployed a 4-shard Elasticsearch cluster.
We emulated the Endeca's denormalization indexing approach and compared it with block-join (nested) index structure with respect to both index size, indexing time and query response time.
As you can see, nested index is 4x more compact compared to flat index due to absence of data duplication. Indexing time is also 3x better for nested index.
Lets look at the query time statistics:
As of query time, we can see that nested query shows consistent high performance, while flat index queries are 3-5x time slower. even though they can be brute-forced with additional shard parallelism, at the expense of overall search cluster capacity.
If we look at the performance of some selected queries, we will see a typical picture of search performance highly depending on number of matching documents.
We can see that typical response time stays below 50 ms. It becomes slower for queries which match big portion of the catalog, like "bulb" or "engine", but still stays under 100 ms. Also, there is no direct correlation between latency and number of fitments in retrieved products.
In this blog post, we discussed a practical approach to solving the main auto parts search use cases with modern open source search engines. Turns out, even on the single node we can improve Endeca response time ~10x. On top of that, migration to Elasticsearch or SolrCloud unlocks horizontal scalability and allows to support catalogs of arbitrary size without response time impact.
If you are dealing with auto parts search and experiencing performance or search quality issues, we would be happy to help.