High-performance auto parts search: moving from Endeca to Elasticsearch
Jun 23, 2020 • 7 min read
Jun 23, 2020 • 7 min read
In online commerce, catalog navigation functionality is one of the key aspects of the overall user experience. Customers spend the majority of their time on the site discovering and selecting products and making purchase decisions. The mission of quality search and browse functionality is to provide the customer with the smoothest and shortest path to the most relevant products. Removing frustration from the product discovery and selection experience is a big part of this mission.
In this post, we will talk about out-of-stock items in search results. Just imagine the frustration of an online shopper who, after hours of shopping, found a perfect dress only to discover on the product data page, or even worse, on the checkout page, that the desired item is out of stock. This problem is especially painful in fashion, luxury, jewelry, and apparel retail, as well in online specialty marketplaces, where many products are unique, inventory is low, and customers spend significant time discovering products and making purchase decisions.
Some systems try to avoid the problem of making search engine inventory-aware. After all, out-of-stock products can be hidden from search results as a post-processing step. This works, and is definitely an improvement over showing out-of-stock inventory as if it was available to buy. However, faceted navigation becomes clunky; as a result of post-filtration, facet counts calculated by the search engine do not match actual product counts seen by the shopper. This may become very visible when the customer uses facet filters and gets "zero result" pages instead of the expected products, which is pretty frustrating.
An alternative approach is to grey-out product thumbnails for unavailable products on the UI. This solves the results vs facets inconsistency problem, but wastes precious real estate on the first page of results with unsellable items.
Neither of these approaches deals with a situation where some products become available at some point in time after being out-of-stock. Depending on the catalog reindexing cycle, this can postpone reappearance of the product on search and browse result pages for hours or even days.
To avoid these issues, e-commerce search system designers need to embed up-to-date information about inventory availability right into the search system. Not only must it accurately and consistently filter out-of-stock items, and reveal in-stock items in a timely fashion, but having this information available also unlocks a range of powerful features. Omnichannel customers can find products available in nearby brick-and-mortar stores. Products can be boosted (or buried) in search results based on underlying inventory levels. In-store customers can filter enterprise catalog by what is available in the store where they are shopping right now.
Making a search engine support inventory availability is easy. After all, a search engine such as Solr excels in filtering. What is hard, though, is keeping inventory data up-to-date. From a technical perspective, displaying up-to-date inventory availability means supporting high volume, high-frequency updates to the search index. This is bad news for most search engines that are designed for read-mostly workloads and are optimized assuming immutability of the index.
In this post, we focus on e-commerce search applications based on Solr, yet the same ideas should be applicable to other search engines and verticals concerned with embedding near real-time signals of various types in their search indexes.
Let’s consider some of the approaches implemented in the industry.
The brute force approach would be to delete and reindex all documents affected by each availability change. This is typically easy to implement and is perfectly fine for small-scale, low-traffic web shops. For large-scale, high-traffic sites this approach becomes impractical. Brute-force indexing adds quite a bit of overhead, as product and SKU documents have to be repeatedly retrieved from a data source, processed by business logic, and transmitted to the search service, where they are re-indexed again as they pass through all analysis chains. This is obviously the heavies and lowest-performing solution possible. Can we do better than that? We certainly hope so!
The brute force approach can be improved upon by not constantly reloading and retransmitting product information. All we need to transmit is information about what actually changed inside the existing document — in this case, the SKU availability field. This is called partial document update, and it is supported in both Solr and Elastic. It seems like we are done now, doesn’t it? Unfortunately, this is not so. Both Solr and Elastic support partial document update through full reindexing under the hood. This means that they store the original document data in the index and, once a partial update request is received, restore the document from the original, patch it according to the partial update data, and reindex it again as a new document, deleting the old version.
In general, this approach takes a high toll on search performance:
This makes partial document update suitable only for infrequent, low volume updates, e.g. 10-15 minutes apart, thousands of documents in update.
The problem with volume of updates can be addressed by moving inventory information into separate documents and performing query time join in Solr or Elastic (based on Lucene JoinUtil). This way, we will have a product documents and inventory documents which can be separately updated.
This approach has the advantage that separate lightweight inventory-only documents will save a lot of reindexing costs, especially in the case of block index structure.
Few things to keep in mind: in the case of distributed search custom sharding is required to correctly join products and inventory documents. Also, Lucene query time join is not a true join as only parent side of relationship is available in the result, e.g. it is similar to SQL inner select.
Still, this functionality is enough to support filtering by inventory availability. We can create a join query to select all available SKUs. How will it work under the hood? JoinUtil will first filter all inventory records by availability, then retrieve their SKU IDs, then filter SKUs by those IDs and intersect with the main query applied to the SKUs.
If this doesn’t feel like a high-performance solution, your intuition is correct. Indeed, query time join doesn’t perform all that well out of the box. It can only traverse a relationship in one direction — from child to parent, which can be very wasteful if the main query for the parent is highly selective. It also requires access to the child’s foreign key field value, which can be relatively slow even with docValues. The search by primary key on the parent side is faster, but still requires expensive seek on the terms enumerator.
Also, this solution still suffers from the necessity to commit to the Solr core and reopen searches, losing accumulated caches in the process.
Query time join can be avoided with the following trick. Since Solr 4.0 we have been able to perform query time join across separate collections. Assuming we have separate products and inventory collections, we can have a post-commit listener on inventory collection which will access and patch an availability filter within a product core. This approach eliminates the performance hit of query time join and retains caches in the product collection. Still, there is no such thing as a free lunch. This approach requires a custom Solr plugin and pretty involved code. Out-of-the-box Solr uses filter cache in order to speed up cross-core-join performance. Using the plugin, this cache needs special treatment and must be regenerated asynchronously.
It is well-known that the secret sauce of search engine performance is an inverted index. However, for many tasks which are traditionally performed by search engines, such as faceting and sorting, an inverted index is not efficient. To facet or sort documents you need mapping from the document ID to a value of the field. Historically, in Lucene, this mapping was delivered by fieldCache which was created by “uninverting” the indexed field. Such an approach is not optimal because:
To address these issues, Lucene developers introduced the concept of docValues, which by now has completely replaced fieldCache. In a few words, “docValues” is a column-oriented database embedded directly into a Lucene index. docValues allow flexible control over memory consumption and are optimized for random access to the field values.
But there is one feature which makes them stand aside: docValues are updatable, and, more importantly, can be updated without touching the rest of the document data. This unlocks true partial updates in the search engine and can support very high throughput of value updates (as in tens to hundreds of thousands per second). Replication of the updates across the cluster also comes out of the box and are pretty efficient, as docValues changes are stored separately from the rest of the segment data and thus do not trigger merge policies or require segments to be replicated.
This makes updatable docValues great candidates to model inventory availability. Unfortunately, updatable docValues are not yet exposed through Solr in the general case (track SOLR-5944). Still, Lucene support is there, so it is pretty easy to plug in your own update handler.
However, there is still a caveat: making docValues updates visible requires commit and reopening of the segment readers and wiping out valuable filter caches, which may take many seconds or even minutes to regenerate. Fortunately, in the case of pure availability updates, no actual segment data is changed so all existing filter caches are still valid, and Solr CacheRegenerator can be used to recover cached filters from previous Searchers into the new ones.
Updatable docValues work well for inventory availability tracking for simple use cases when there is a predictable number of updatable fields to support, and it is acceptable to pay the price of a Solr/Lucene commit periodically, e.g. when updates can be micro-batched, say with a 1-5 minute period.
However, even updatable docValues are not a silver bullet. There are limitations and tradeoffs to be aware of.
In case true near real-time availability updates are needed, or the number of available fields to track is very large, off-the-shelf features in Lucene and Solr won’t be much help. Only custom solutions tailored for particular use cases can work.
The main idea of such solutions is to embed a highly optimized in-memory store (NRT store) into each Solr instance. Lucene searches can be bridged to this store by a ValueSource interface providing FunctionValues.
Implementing this approach, we are almost completely on our own. We can not rely on Solr/Lucene to keep our NRT data durable or replicate it across the cluster. So, in general, following concerns have to be addressed:
With such a design, it is possible to build a search system which is completely decoupled from the normal Solr document update/commit cycle that can, indeed, support a near real-time view of inventory availability.
Up-to-date inventory availability is a must-have feature in a modern e-commerce search system. There is a whole spectrum of implementation approaches, varying in complexity and cost, which can be applied depending on the actual use case. In our practice, we have found that the updatable docValues approach is sufficient and works well in a majority of e-commerce cases. However, for some high-end implementations a more sophisticated custom solution will be necessary. The best thing to do is to experiment with your data and use cases to see what method of handling inventory availability is the best for you!
Ivan Mamontov, Eugene Steinberg