High-performance auto parts search: moving from Endeca to Elasticsearch

High-performance auto parts search: moving from Endeca to Elasticsearch

Jun 23, 2020 • 7 min read
Grid Dynamics

According to Hedges & Co. report, US online sales of auto parts and accessories amounted to $12B and, while new car sales projection are looking bleak, the future of the automotive aftermarket is quite positive. Both specialized auto parts retailers and online commerce powerhouses like Amazon and eBay are competing for the share of this market.

Amazon is a vivid retailer of auto parts 

When you are looking for auto parts online, it is crucially important to ensure that part you are ordering will fit your vehicle. That is why all online retailers selling auto-parts first try to establish what vehicle you have in mind when searching for auto-part.

Vehicle selection on online auto parts retailer website

Vehicle information is used to filter parts catalog and surface only the parts which will fit the target vehicle. This kind of filtering should work seamlessly with both keyword search and category browse.

Turns out, it is pretty hard to achieve high quality search with legacy search platforms like Oracle Endeca. Many specialized retailers struggle with performance and scalability issues which are preventing them from rolling out new high-value features to successfully compete with tech giants like Amazon on auto parts market.

In this blog post, we will describe how to overcome those issues with modern open source search platforms.

Modeling auto parts catalog

Auto parts catalogs are unique because they significantly depend on part fitment information.  Fitment, also referred to as compatibility or application, refers to the vehicle year, make, model, and engine that your parts and accessories fit. Often this information expands to trim/submodel, motor, and so on.

Products in auto parts catalog may have any number of fitments, in the range from zero,  which means that product fits any vehicle, to hundreds of thousands for some widely applicable products. Typically, products have about a hundred of fitments.

About 50% of products have no fitments, while some products have over 10,000 fitments

Essentially, we are dealing with parent-child relationship here, where product (parent) has multiple children (fitness) entities. Our core use case is to filter products by attributes of their fitnesses, e.g. by a full or partial specification of the vehicle.

Parent-child relationship requires special modeling in search engines due to the nature of the very data structure which gives them their speed - inverted index. Core abstraction for search engine is a document which is a collection of fields, where each field can have multiple values. Each document is completely independent and doesn't "know" about any other document in the index.

So, in order to support parent-child relationship with inverted index, we have to apply special techniques. Performing join and filtering in query time is technically possible, yet it is a known performance killer. So, most search implementations go for variety of tricks to move join work to the index time.

Endeca goes for all-in denormalization, which leads to the structure of index which looks like this:

Denormalization approach to indexing parent-child relationships

With denormalization, or attribute roll-down approach, parent attributes are copied to child documents and only child documents are indexed in the search engine. At the query time, child documents are searched, retrieved and grouped by parentId to return only parent documents. Here is how it works in auto parts search use case:

Endeca-based parent-child search

This is a straightforward approach and is widely used in many search applications. However, it has important drawbacks which significantly affect search performance:

  • Deep parent-child relationships, where majority of attributes are located on the parent level, cause massive data duplication. Problem only proliferates with additional levels of document nesting.
  • Many duplicate values on the children document defeats the key performance advantage of inverted index - ability to skip a whole blocks of documents, jumping to the next document containing the unique value you are looking for. Search engine index sweep has to stop frequently on every matching repeating values with limited opportunities to jump far ahead in the index.
  • Group by id operation at query time is quite expensive and can often become a bottleneck

Those issues may lead to response times in several hundreds of milliseconds or even seconds and strongly impact customer journey, creating perception of sluggish search experience. Customer patience is short, and options are plenty, so we should strive for immediate response from the system.

Is there a better way to model a parent-child relationships? Sure there is! One of the neat techniques available in Apache Lucene is index-time join (a.k.a. block-join) of parents and children, which essentially encodes those relationships as the order of documents in the index. This functionality is nicely exposed by both Solr and Elasticsearch search engines. Here is how it works:

Structure of the block-join index

Parents and children are arranged in the blocks where child documents are followed by their immediate parents. This means that in order to find a parent document by attribute of the child, all you need to do is to find a matching child and jump ahead in the index to the closest parent, which is a very natural and fast operation for inverted index. Similar approach works for finding children by their parents.

Searching in block-join index

This approach has multiple advantages over all-in denormalization:

  • High performance search for parents by children and children by parents
  • No data duplication whatsoever
  • Nested document structure can support arbitrary parent-child relationship depth with modest impact on performance

Here is how the same search for "filter for ford mustang" will work in Lucene-based block-join index

Lucene-based parent-child search

Of course, as it usually happens, there is no free lunch, and this data modeling approach has some drawbacks as well:

  • Update requires reindexing of the whole document’s block
  • Whole document block should be collocated within the index shard

In majority of applications, this trade-off is perfectly acceptable.

One additional important consideration is the scalability of search solution. With large catalog, it is increasingly important to be able to parallelize query processing over multiple nodes to be able to keep response time under control. This is achieved with index sharding and map-reduce query processing. This is another key scalability feature not available in Endeca.

Implementing auto-parts search with Elasticsearch

Original Endeca implementation had typical response time of 600 ms, with frequent multi-second "spikes" for high-recall queries. Let's see if we can do better than that with a modern search engine.

We implemented a sample auto parts search application based on Elasticsearch. Please note that a very similar system can be built with SolrCloud, as both systems rely on core functionality available in the Apache Lucene search library.

We used a catalog of auto-parts with about a 1M parts. We used just a single cloud node and deployed a 4-shard  Elasticsearch cluster.

Deployment architecture

We emulated the Endeca's denormalization indexing approach and compared it with block-join (nested) index structure with respect to both index size, indexing time and query response time.

Flat vs nested index structure

As you can see, nested index is 4x more compact compared to flat index due to absence of data duplication. Indexing time is also 3x better for nested index.

Lets look at the query time statistics:

Query time performance and scalability vs number of shards

As of query time, we can see that nested query shows consistent high performance, while flat index queries are 3-5x time slower. even though they can be brute-forced with additional shard parallelism, at the expense of overall search cluster capacity.

If we look at the performance of some selected queries, we will see a typical picture of search performance highly depending on number of matching documents.

Latency of some selected queries with different recall

We can see that typical response time stays below 50 ms. It becomes slower for queries which match big portion of the catalog, like "bulb" or "engine", but still stays under 100 ms. Also, there is no direct correlation between latency and number of fitments in retrieved products.

Conclusion

In this blog post, we discussed a practical approach to solving the main auto parts search use cases with modern open source search engines.  Turns out, even on the single node we can improve Endeca response time ~10x. On top of that, migration to Elasticsearch or SolrCloud unlocks horizontal scalability and allows to support catalogs of arbitrary size without response time impact.

If you are dealing with auto parts search and experiencing performance or search quality issues, we would be happy to help.

Happy searching!

Subscribe to our latest Insights

Subscribe to our latest Insights