Share
Follow

A case study: birthday search queries using Solr

Apr 16, 2019

“Find all customers who have a birthday in the next 10 days”
“Which employees have achieved 10 years of service this month?”

These types of questions are simple for a human to answer with the aid of a calendar, but it is much more difficult for a computer to answer them on its own. Calendars have many rules built into them that are not immediately obvious to users that must be accounted for by a computer.

One of our customers was looking to easily query customer milestones like birthdays, anniversaries, graduation dates and years of service. To help them execute queries for their customer marketing campaigns, we built a data layer to support high performance range queries with Solr. This blog post will discuss the various issues we encountered, and how we solved them with a Solr search engine that leveraged range and function queries.

How to fix date range search

Let’s focus on one particular problem - birthday searches. Other date search-related use cases are conceptually similar. We need to find all customers with a birthday between today and some time in the near future, and we have available the following data:

• A static date-field stored in UTC time zone, e.g.“1984-02-15T00:00:00Z”.
• A query date range such as the “next 10 days” or the “next month”. For this blog post, let’s consider the “next 10 days” as a default query.

So we want to find all of the customers who satisfy our search criteria, and sort them by the number of days until their birthday, starting from the closest birthday. This means that we will need to have a synthetic field in our search results, “days_to_birthday”, which is sorted in ascending order.

We will also have to consider the challenges presented by calendar issues such as:

• The problem of ‘leap years’. Some years have 365 days, whereas others have 366. So you need to consider more unique situations where some customers are born on the 29th of February vs. the 1st of March.
• Scenarios where days left in the current year are less then the query date range, which requires being able to join the current and subsequent years.
• Time zone differences. Imagine a customer who’s timezone is Pacific daylight time (PDT), while the search request might be issued from the Central European summer time (CEST) timezone. We don’t want the customer to miss a notification, so we need to account for both the potential time zone and possible calendar issues.

We can obviously bruteforce this problem with a table scan, inspecting each customer record separately. However, this option works poorly for large customer databases, so we want to focus on more sophisticated index-based solutions.

The naive solution

There is a straightforward, but naive approach that could be taken - to index all birthdays within the next 20 years in separate fields. For example to have: “person_date_of_birth.2018”, “person_date_of_birth.2019”, … But this is not a good long term solution as you have to deal with ‘border’ cases, and join together years when requests cross into the following year. Moreover, it pollutes the database with many unnecessary fields, which is not ideal when using Solr/Lucene.

The optimal solution

As long as date-related information is static, we can enrich our documents with additional “synthetic” fields. Given the “person_date_of_birth” initial field value, we can extract a useful piece of information from it; the ordinal of the day within the year, like in Java’s Calendar#DAY_OF_YEAR (e.g. “person_date_of_birth.yday”). It can be done either during the data ingestion phase (e.g. from DIH), or it can be a custom UpdateRequestProcessorFactory. In our GitHub sample, it is a separate Indexing component, which is responsible for such types of decoration: DocumentIndexer.

Example Input:

{ “person_first_name”: “John”, “person_last_name”: “Smith”, “person_date_of_birth”: “1984-02-15T00:00:00Z” }


Solr output:

{ “person_first_name”: “John”, “person_last_name”: “Smith”, “person_date_of_birth”: “1984-02-15T00:00:00Z”, “person_date_of_birth.yday”: 46 }


This approach, step by step

The following provides an overview of the solution:

• A query is made to the search engine
• We know the current timestamp so we can identify:
• The current year (whether it's a leap year or not). Example: 2018
• The current day of the year (a number ranging from 1 to 366). Example: 39 (8th of Feb)
• We already have all necessary information indexed for our customers:
• “person_date_of_birth.year” (and we can identify whether it’s a leap year or not)
• “person_date_of_birth.yday” (also a number ranging from 1 to 366)
• We issue a range query to filter out all the necessary candidates (+/- one day in case of a leap day on 29th Feb. For example:
    q=person_date_of_birth.yday:[39 TO 49] (“next 10 days”)

• To calculate a synthetic field, “days_to_birthday” we will employ function query. For example:
    fl=*,days_to_birthday:(sub(person_date_of_birth.yday,39))

• To sort by the “days_to_birthday” in ascending order:
    sort=sub(person_date_of_birth.yday,39) asc


So far so good. This looks like a viable approach, and it forms the basis of our solution. But the devil is in the details. There are several scenarios that need to be handled very carefully to ensure that every query works 100% correctly. Several of these scenarios are presented below and allow us to outline the final solution step-by-step.

Scenario #1 - Dealing with a leap year

This approach will not work correctly when:

• The query is issued from a non-leap year; or
• The customer wasn born in a none-leap year

To refresh your knowledge of leap years - here is a Wiki link. To use a non-linear scan approach, we should be using Solr’s range queries to crop the matching docs. We will represent each birthday point as an ordinal ranging from 1 to 365 (or 366).

There is a problem dealing with leap vs. non-leap years when there are 365 vs. 366 days in the year. The calendar#DAY_OF_YEAR scale would work differently. Take a look:

Example: 10th of March from 2014 (69th ordinal) is not the same as the 10th of March from 2016 (70th ordinal). Therefore, we need to make sure that the day ordinals are comparable between leap and non-leap years. One of the possible solutions is to introduce a “fake” leap day which streamlines all years:

This unifies the search logic, and allows for simpler calendar arithmetics. Thus, during ingestion we will see something like this:

So how can we resolve the pesky leap year issue? If the current year is a non-leap year, we will need to join the ordinals of 60 (Feb 29) and 61 (Mar 1). People who were born on Feb 29th, will celebrate their birthdays on Mar 1st in non-leap years.

Let’s consider the following scenario: it is Feb 27, 2016 (a leap year) and we are looking for customers who have birthdays within the next five days:

It’s important to note that even though our query includes a leap day, we have properly compensated for the potential additional day at index time so are comparing “apples to apples”. Let’s call such a leap day, a “real” leap day (because it exists in the real world).

In this case, the function query to calculate “days_to_birthday” would look like (string returned from BirthdaySearchComponent#daysToBirthdayFunctionScore(...)):

sub(
person_date_of_birth.yday,
58
)


In the case of a non-leap year, when there will be a “fake” leap day, we will need to join together ordinals 60 and 61. So it would look like this:

We have to adjust the “days_to_birthday” field calculation, and we will have to expand the selecting query to be “yday:[58 TO 64]” rather than “yday:[58 TO 63]”.

The function-query in such a case would look like:

sub(
if(
ifLessThan(person_date_of_birth.yday, 60),
person_date_of_birth.yday,
sub(person_date_of_birth.yday, 1)
),
58
)


Here, we defined a new function-query ifLessThan(leftArg, rightArg). It is fairly easy to implement it (if in doubt, follow BirthdaySeachComponent#ifLessThan(...)). “60” here is the constant, which defines the ordinal of “fake” leap day (60 == 31 days in Jan + 29 days in Feb).

Scenario #2 - Before the new year

We need to be cautious around Christmas time because the ordinals are reset after New Years Day, so we need to handle this time period with our Solr query carefully. The overall picture might look like this:

There is nothing challenging about the Solr query itself, but we need to issue a disjunction query that will cover the end of the previous year, and the beginning of the new upcoming year (in Solr syntax):

person_date_of_birth.yday:[364 TO 366] person_date_of_birth.yday:[1 TO 2]

The interesting part is the function query that calculates ‘days_to_birthday’. Now let’s try and understand what that function query would look like. We need to cross the border of “366” -> “1” because sub(person_date_of_birth.yday, 364) works only until #person_date_of_birth.yday <= 366:

Human Date person_date_of_birth.yday sub(person_date_of_birth.yday, 364) Expected
Days-to-Birthday
December 29 364 0 0
December 30 365 1 1
December 31 366 2 2
January 1 1 -363 3
January 2 2 -362 4

We could make use of modulo arithmetics (by modulo of 366), and turn all calculations into dealing with positive numbers only:
mod(sub(sum(person_date_of_birth.yday, 366), 364), 366):

Human Date person_date_of_birth.yday x :=
sub(sum(person_date_of_birth.yday, 366), 364)
mod(x, 366) Expected
Days-to-Birthday
December 29 364 366 0 0
December 30 365 367 1 1
December 31 366 368 2 2
January 1 1 369 3 3
January 2 2 370 4 4

Finalizing the formula

There is nothing overly challenging composing the Solr query that will crop the ‘documents that satisfy the birthday criteria’. It is going to be a simple range-query against the person_date_of_birth.yday field.

Now let’s try to generalize a formula that calculates ‘days_to_birthday’, so that it addresses both scenario #1 and scenario #2. We will use the example that today is ‘29 Dec’ (or yday=364) and use a random searching criteria (let’s say “all people who have birthdays within the next 100 days”). This example provides us a time period that overlaps both the New Year (1 January) and a potential leap day (29 February).

It is important in the calculation to establish whether we are covering a “real” leap day or a “fake” leap day in the future. In the case of a fake leap day, the formula would look like this:

mod(
sub(
if(
ifLessThan(person_date_of_birth.yday, 60),
sum(person_date_of_birth.yday, 366),
sum(person_date_of_birth.yday, 365)
),
364
),
366
)


And in the case of a real leap day, it would look like this:

mod(
sub(
sum(person_date_of_birth.yday, 366),
364
),
366
)


You’ll always be able to know in advance whether you will or will not overlap fake or real leap days, so only a single version of the function query from above will be applied at the query-time.

Time zone differences

Time zone differences do not create a major problem. This is because they simply require that you calculate your ordinal day depending on your desired time zone. For the sake of simplicity, we assume that all birthday dates are indexed, as dates in the UTC time zone. Therefore, in order to make proper calculations, you just need to calculate your ordinal day in a custom time zone. Typically, it is represented as a separate parameter in your query (i.e. &timeZone=...)

Example (birthday within the next 10 days):

• System time: 9:20PM EST 20/01/2019; time zone = America/New_York;
• UTC day-ordinal = 21 (because it’s already January 21 in UTC).
• Time zone day-ordinal = 20 (because it’s still January 20 in EST).
• Solr’s q=yday:[20 TO 30] (and not q:yday[21 TO 31])

Start day offset

This is to address the scenario where you want to adjust the “current day”, and execute your search from a particular day. You can think of it as overriding the system time, and issuing your ‘birthday search’ tomorrow instead of today.

In order to implement a start day offset, an additional query parameter is used (&birthdayOffset=...) to control how the day-ordinal is calculated. This allows you to exclude those entries in your result set that are having a birthday today.

Example (birthday within the next 10 days)

• System time: 5:45PM PST 13/02/2019; birthday offset starting tomorrow (offset=1);
• Day-ordinal = 44 (it’s the February 13);
• Offset = 1; therefore the query would look like q=yday:[45 TO 55].

Conclusion

With this approach, we were able to design a viable solution using the Solr search engine. This solution leverages two key features of Solr, range queries and function queries. We can now solve this problem without needing to conduct a full scan, and the solution can be easily scaled within the search index.

While we were able to successfully solve this problem for our customer, their issue was certainly not unique. Many companies run complex, personalized marketing campaigns, and need to easily query their customer database in order to gather lists of customers with relevant milestones. Therefore, our approach has a wide range of useful applications across the retail sector.