Post 5: The Segmented Filter Cache and Block Join Query Parser in Solr

Search Aug 16, 2016 Grid Dynamics

Mikhail Khludnev

The “law of unintended consequences” applies to using the block join query parser in Solr, just as it does to many other things in life (and software). Leave out certain query strings in Solr, and It seems to make no difference. But this action can actually have positive effects, especially when working with Solr in a Near Real Time (NRT) environment. There are a number of other steps you can take to make Solr more NRT-capable, too.

Let’s look at an accidental finding about the block-join query parser in Solr. What query do you think this parser yields if you omit a query string such as q={!parent which='type_s:parent'}? It might not seem obvious, but it yields the same parent filter (type_s:parent) from perSegFilter cache as it does when you include the query string. The initial intention for this code branch was to expose a parent bitset to users who wanted to reuse it as Solr’s filter query. It turns out that it can solve a filter cache regeneration issue and, therefore, make Solr more NRT-friendly (Near Real Time).

We can start with the caching basis. If you specify fq=SIZE:XL in request params, Solr will create an on-heap bitset on top of all segments and will use it as a filter in a very efficient manner. However, when you perform a commit (no matter whether it’s hard or soft), this bitset gets scratched and you are faced with a slowdown either at commit time, when filter bitsets are regenerated or at query time, when unlucky ‘cold’ requests have to regenerate those bitsets. Such pauses make Solr not really NRT- friendly. If you are dealing with such commit pauses and/or have to commit frequently, read on. Otherwise, you can consider this an unusual Solr use case.

NRT-filters

To get rid of these pauses, try to rewrite fq=SIZE:XL to fq={!parent which='SIZE:XL'}. Also, make sure that perSegFilter has the proper size and has NoOpRegenerator specified. Now filters shouldn’t slow down searches on commit nor should the commits themselves. To make sure this works as expected, look at cache entries by enabling cache introspection. That’s what you should see in the perSegFilter dropdown in SolrAdmin:item_SIZE:XL: FixedBitSetCachingWrapperFilter(QueryWrapperFilter(SIZE:XL))

Make sure that there is no hit in filterCache while you experiment with these filters.

It’s worth mentioning that intersecting such filters (when you specify several fqs) is not as efficient at comparison as plain Solr fq, which uses bitwise and eight-byte words.

Another drawback of this hack is that it uses memory-wasteful plain bitsets (like Solr fq), rather than a more compact one.

OR Filters

One of the questions which regularly hits the mailing lists is about disjunction of cached filters; i.e. if fq=SIZE:L and fq=SIZE:M are cached as two separate cache entries, can’t we reuse these bitsets in disjunction filter fq=SIZE:L OR SIZE:M and avoid caching them separately? Yes, we can: fq={!cache=false}{!parent which='SIZE:L'} OR {!parent which='SIZE:M'}.

In addition to the cache introspection mentioned in the previous paragraph, you can check that you do it right by placing this string to q= param and requesting debugQuery=true, you something like this:

{!cache=false}{!cache=false}ConstantScore(FixedBitSetCachingWrapperFilter(QueryWrapperFilter(SIZE:L))) {!cache=false} cache=false}ConstantScore(FixedBitSetCachingWrapperFilter(QueryWrapperFilter(SIZE:M)))

Here you can see the non-cached disjunction of two filters cached in perSegFilters. The last two notes from the previous paragraph (about inefficient combining and storing) are applicable here as well. 

Filters 2.0

Note that all this dancing around filters is about using a heap to cache the postings list. Providing that most times a postings list file is mmaped according to this great advice, how much sense is in it? The reason for caching is the postings on-disk format, which is CPU-intensive while decoding on reading. This format also stores some scoring necessary data like tf which is not needed for filtering; also, Solr’s filters use the bitwise operation for an intersection that usually gets some gain. Thus, we can think about a specialized bitset codec as a feature of filters. There is a modest patch that should help this approach. 

NRT-Facets

What else makes Solr unfriendly to NRT?[a] UnInvertedFields! What can you do with them? If you count facets on single value fields, you can use Lucene’s FieldCache by facet.method=fcs. if you deal with multivalue fields you can specify docValues for them that trigger an alternative faceting engine. DocValues facets use heap data structure (OrdinalMap) that leads to pauses similar to those caused by UnInvertedField. However, they should be much shorter.

One last note: NRT doesn’t mean better throughput in general; it just means more predictable latency — which doesn’t necessarily mean decreasing average latency.

If you have a question about Block Join in Solr, please post a comment below or contact us via email for a prompt response.