Post 4: How to Use Block Join to Improve Search Efficiency with Nested Documents in Solr

Search Aug 10, 2016 Grid Dynamics

Mikhail Khludnev

Faster responses make customers happy. Lower hardware requirements make budget people happy. Block Join can help accomplish both these goals, which is why we strongly suggest using it for nested document searches in Solr. But that’s enough about why we advocate using Block Join for nested and faceted searches in Solr. Now we’ll talk about how to do it.

Indexing

SolrInputDocument has methods — getChildDocuments()and addChildDocument() — for nesting child documents into a parent document. XML and Javabin formats are now able to transfer them. JSON support is ongoing.

Start by indexing a few t-shirts, as a sample product-SKU hierarchy using post.jar.

To check how blocks is laid out, run a match-all query with csv output. You will see that the parent document is placed right after its children.

It is necessary to be aware of the implicit _root_ field which works as a block identifier; all child documents obtain _root_ value from the parent’s uniqueKey field. This is used for overwriting whole blocks on update.

Searching

Let’s assume we have a query matching our Red-XL child documents (SKUs AKA UPCs):

q=+COLOR_s:Red +SIZE_s:XL. It returns children with IDs 11 and 31.

Now let’s join from children to parent by calling a special “parent” query parser:

q={!parent which='type_s:parent'}+COLOR_s:Red +SIZE_s:XL that returns parents 10 and 30, as expected.

The local parameter “which” provides a filter that distinguishes parent documents from child. Keep in mind two important things about it:

  • It should not match any child documents
  • It should always match all parent documents

Make sure block join avoids the cross-match problem; that it doesn’t capture parent 20, which is a candidate for a potential false positive match as it has Red and XL SKU’s, but doesn’t have an SKU that is both Red and XL.

This {!parent} query can be combined with any other query and filter. For example, we can constrain results by brand:

q=+BRAND_s:Nike +_query_:"{!parent which=type_s:parent}+COLOR_s:Red +SIZE_s:XL"

The same can be achieved by employing a filter query:

q={!parent which=type_s:parent}+COLOR_s:Red
+SIZE_s:XL&fq=BRAND_s:Puma

Don’t try to constrain children by filter queries; it doesn’t work because filter queries explicitly constrain the {!parent} query.

There is a “reverse” query parser for searching child documents by parent filter:

{!child of=type_s:parent}BRAND_s:Puma returns SKUs that belong to the single Puma product.

Note that even as the local parameter name changes, it keeps the same meaning by supplying a parent filter.

If you are not familiar with nested queries and local parameters check this short intro.

Last but not least: it works for distributed search, too.

Caveat

You always need to be quite accurate with updating blocks. They always need to be updated as whole. To show an unlucky example, let’s remove the parent and leave the children in the index:

<update><delete><query>id:10</query></delete><commit/></update>

At first, It seems like everything still works. Children 11 and 12 are left in the index, but ToParentBlockJoinQuery somehow detects it and q={!parent which='type_s:parent'}+COLOR_s:Red +SIZE_s:XL correctly returns parent 30. However after <optimize/> is executed, the deleted parent document is purged from the index and all of a sudden children 11 and 12 start to look like they belong to parent 20 The same query q={!parent which='type_s:parent'}+COLOR_s:Red +SIZE_s:XL now returns 20 and 30, which is wrong! I’m afraid there are a few other similar cases of wrong behavior, too. As a reliable workaround I suggest sending explicit deletes by query with the implicit field _root_.

Further Directions

Here are a few further desirable features in random order:

Faceting:

The Facet component for block indexes is quite useful in e-commerce. The trickiest thing it does is count SKU field values and aggregate them into product counts, as we described in an earlier post. Solr has had this capability since patch SOLR-5743.

Schema:

An application should be aware of relationships between documents while it indexes and searches. However, it might be more convenient if our search engine provides a “flat” navigation model to the front end, so the front end just refines search results by color, and the search engine figures out on its own which documents to filter and which ones to join.

Scoring Mode:

ToParentBlockJoinQuery supports several score calculation modes. {!parent} parser has None mode hardcoded.

Group Collecting:

Use [child] doctransformer, a feature added in the SOLR-5285 patch.


Many things to think about

Implementing Block Join in Solr takes more than a little work and thought, and possibly a bit of research along the way. Is it worth the effort? We think so, because more efficient search improves the customer experience, which leads to more sales in the long run. And that’s what it’s all about, isn’t it?

Notes: BlockJoin support has been available in Solr since 4.5 (Solr 3076), which was when Solr caught up with ElasticSearch in handling Nested Documents.

We also recommend reading these two articles on the subject: 2010's Proposal for nested document support in Lucene and Searching relational content with Lucene's BlockJoinQuery from 2012. You might also want to check this benchmark test we wrote about in our earlier blog post, High-Performance Join in Solr with BlockJoinQuery.

If you have a question about Block Join in Solr, please post a comment below or contact us via email for a prompt response.