#### Multi-agent deep reinforcement learning for multi-echelon supply chain optimization

**Ilya Katsov**

Jun 10, 2020 •

**11 min read**

Mar 08, 2019
• **17 min read**

Speed of delivery is one of the key success factors in online retail. Retailers fiercely compete with each other in this space, implementing new order fulfillment options such as Same-Day Shipping, Buy Online Pick Up in Store (BOPUS) and Ship from Store (SFS). These capabilities help to both improve customer experience, and utilize the in-store inventory more efficiently by making it available for online customers.

Working with a number of retail clients on implementation of BOPUS and SFS capabilities, we figured out that these use cases bring in a number of interesting optimization challenges related to inventory reservation. Moreover, the quality of these reservation decisions directly impacts both customer experience and inventory utilization, so that the business benefits offered by new fulfillment options can be severely damaged by a lack of analysis and optimization.

In this post, we describe the inventory optimization problem for BOPUS and SFS use cases, and provide a case study that shows how this problem has been solved in real-life settings using machine learning methods.

The main idea of BOPUS and SFS is to make in-store inventory available for purchase through digital channels, so that online customers can either reserve items in a local brick-and-mortar store for pick up, or get them shipped and delivered promptly. Both scenarios assume that orders are served directly from store shelves, and thus order fulfillment personnel (order pickers) compete with customers for available units:

Order processing has some latency, so the units sold online can become sold out by the time the order is being fulfilled, even if the inventory data is perfectly synchronized between the store and online systems (e.g. in-store points of sale update the availability database used by the e-commerce system in real time). This leads to the inventory reservation problem: given that the store has a certain number of units of some product on-hand, how many units can be made available for online purchase, and how many units need to be reserved for store customers? Let us use the terms Available to Promise (ATP) for the first threshold, and Safety Stock (SS) for the second one. The relationship between these thresholds is shown below:

We assume that the online order placement process checks ATP for products in the shopping cart, and accepts the order only if the ordered quantity is less than the product’s ATP. If the ordered quantity is greater than the ATP, the order is rejected to avoid a fulfillment exception.

We also assume that the system works in discrete time: the inventory is replenished by some external process at the beginning of each time interval, and the on-hand inventory in the beginning of the interval is known to the order placement process. Thus, the problem boils down to the optimization of safety stock levels for individual products, stores and time intervals:

$$

\text{ATP(product, store, time)} = \text{onhand(product, store, time)} - \text{SS(product, store, time)}

$$

For example, we worked with a retailer that was able to replenish their inventory overnight, so the goal was to set a safety stock level for each product, in each store, each day. The order placement process uses the ATP calculated through the above formula as a daily quota, and compares it with the running total of units ordered online.

If our algorithm produces safety stock estimates that exactly match the in-store demand, it is a perfect solution - we do not have order fulfillment exceptions caused by stockouts, and all the merchandise that is unsold in the store is available for online customers. If our safety stock estimate is above the true in-store demand, we still do not have fulfillment exceptions, but our online sales can be worse than potentially possible, as some fulfillable orders will be rejected. If our estimate is below the true demand, we will have fulfilment exceptions, because some units will be double-sold. In other words, the quality of our safety stock estimates can be measured using the following two conflicting metrics:

**Pick rate:**The ratio between the number of successfully fulfilled items and the total number of ordered items over a certain time interval.**Exposure rate:**The ratio between the number of items potentially available for online ordering (the difference between on-hand inventory and true in-store demand) and the actual ATP.

A retailer can balance these two metrics differently depending on their business goals, the cost of fulfillment exceptions and the cost of underexposed inventory. For example, some retailers can redirect SFS orders from one store to another in case of exceptions at a low cost, and thus can accept a low pick rate to achieve high exposure rates. On the other hand, some retailers might be concerned about customer dissatisfaction caused by order fulfillment exceptions, and choose to maintain high pick rates at the expense of exposure rates. Consequently, it is important to develop a safety stock optimization model that enables retailers to seamlessly balance the two metics.

The most basic approach to the optimization problem defined above is to set some fixed safety stock level for all products (such as 1 unit, 2 units, and so on). While this clearly provides some flexibility in terms of balancing between the pick and exposure rates (the level of zero maximizes the exposure, the level of infinity maximizes the pick rate, etc.), differences between products and demand fluctuations are not taken into account.

We can expect to build a better solution for safety stock settings using predictive models that forecast the demand, taking into account product properties and sales data, as well as external signals such as weather. In the remainder of this article, we will discuss a case study on how such a model has been built, and then explain how it was wrapped into another econometric model to estimate and optimize pick and exposure rates.

The first step towards building a safety stock optimization model was to explore the available data sets to understand the gaps and challenges in the data.

The first challenge apparent from the preliminary data analysis was high sparsity of sales data at a store level. We had the sales history for several years for a catalog with tens of thousands products, so that the total number of product-date pairs was close to a hundred million. However, the sales rate for the absolute majority of the products was far less than one unit a day, as shown in the table below, and thus only about 0.7% of product-date pairs had non-zero sales numbers.

The second major challenge was that although we had store sales and catalog data at our disposal, the historical inventory data was not available. Consequently, it was not possible to distinguish between zero demand and out of stock situations, which made it even more challenging to deal with zeros in the sales data.

The third challenge was that we had data for only one store for the pilot project, so it was not possible to improve the quality of demand forecasting by learning common patterns across different stores.

The sparsity of the sales data makes it difficult to predict the demand using standard machine learning methods, which is the key to a good safety stock model. One possible approach to deal with this problem is to calculate special statistics that quantify the variability of the demand timing and magnitude, and then use these statistics to switch between different demand prediction models. ^{[1]}^{[2]} Consider the following two metrics that can be used for this purpose:

**Average demand interval**(ADI): The average interval between two non-zero demand observations. This metric quantifies the time regularity of the demand. It is measured in the demand aggregation intervals (days in our case, because we worked with daily demand and safety stock values).**Squared coefficient of variation**(CV^{2}): The standard deviation of the demand divided by the average demand. This metric quantifies the variability of the demand by magnitude.

These metrics can be used to classify the demand histories into the following four categories or patterns that can often be observed in practice:

**Smooth demand**(low ADI and low CV^{2}): The demand history has low variability both in time and magnitude. This pattern in often easy to forecast with good accuracy.**Intermittent demand**(high ADI and low CV^{2}): The demand magnitude is relatively constant, but there are large and irregular gaps between non-zero demand values that make forecasting more challenging. Specialized forecasting methods can be applied in this case.**Erratic demand**(low ADI and high CV^{2}): The demand does not have a lot of gaps, but the magnitudes vary significantly and irregularly, which makes forecasting more challenging. Specialized forecasting methods can be applied in this case.**Lumpy demand**(high ADI and high CV^{2}): The demand is both intermittent and erratic, with irregular gaps and sharp changes in the magnitude. This pattern is very challenging or impossible to predict.

The cut-off values that are commonly used to differentiate between high and low ADI and CV2 values are 1.32 and 0.49, respectively. In our case, the distribution of demand histories according to the above classification was as follows:

Only 0.94% of demand histories were smooth, 0.16% were erratic, 3.56% were lumpy, and the absolute majority of histories, or 95.33%, were intermittent.

The demand classification model above can be used to switch between different demand forecasting models and techniques. However, we eventually chose to use the notion of intermittent demand in a different way, and create features that help to combine individual zeros in the demand history into groups, then use these features as inputs to demand prediction models to combat data sparsity. We will discuss this approach in the next section in further detail.

We built several preliminary demand models using XGBoost and LightGBM, and found that it is quite challenging to achieve good forecast accuracy using just basic techniques. However, this preliminary analysis helped to identify several areas of improvement:

- The large number of intermittent demand histories significantly impacts the accuracy of the forecast, and needs to be addressed.
- The absence of inventory data needs to be addressed and, ideally, stockouts need to be identified.
- The model should be able to balance pick and exposure rates.

Weather has a significant impact on the model accuracy, so weather signals (historical data or forecasts) have to be incorporated into the model as well.Each of these four issues deserves a detailed discussion, and we will go through them one by one in the following sections.

As was mentioned earlier, we concluded that the problem with intermittent demand can be sufficiently mitigated by calculating the following three statistics for each date in the demand history, and using these series of statistics as inputs to the demand forecasting model:

- Number of zeros before the target date
- Number of non-zero demand samples before the target date (length of the series)
- Interval between the two previous non-zero demand series

These metrics help the forecasting model to group zeros together and find some regularities, which decreases the noise in the original data created by the sparsity of non-zero demand observations. This approach is conceptually similar to latent variables in Bayesian statistics.

Intermittent demand statistics improve the accuracy of the forecast by finding regularities in zero demand samples, so we expected to get even better results by building a model that discovers even more regularities caused by stockout (that were not observed explicitly because we lacked inventory data).

We started with a heuristic assumption that a long series of zeros in the demand history indicates that the item is out of stock. We used this assumption to attribute each date in the demand histories from the training set with a binary label that indicates whether or not the product is considered out of stock. This label was then used as a training label to build a classification model that uses demand history, product attributes and intermittent demand statistics as input features to predict stockouts for a given pair of a product and date.

We implemented this model using LightGBM and a Bayesian optimization from the GPyOpt package for hyperparameter tuning. This model in isolation achieved quite good accuracy on the test set, as shown in the confusion matrix below:

There are two elements to the problem of balancing pick and exposure rates:

- First, we need to provide some mechanism to trade the pick rate for the exposure rate and vice versa using some hyperparameter.
- Second, we need to estimate the actual values of pick and exposure rates for a selected tradeoff. Note that this problem is different from the first one, as the tradeoff hyperparameter can be just a number that is not explicitly connected with the absolute values of the rates.

The first problem can be solved by introducing a controllable bias into the demand prediction model. A model that systematically underestimates the demand will be biased towards higher exposure rate, and a model that systematically overestimates will be biased toward higher pick rate.

This idea can be implemented using an asymmetric loss function where the asymmetry is controlled by a hyperparameter. We decided to use the following loss function, which can be readily implemented in LightGBM:

$$

\begin{aligned}

L(x) = \begin{cases}

\beta \cdot x^2, \quad &x\le 0 \\

x^2, \quad &x > 0

\end{cases}

\end{aligned}

$$

where $\beta$ is the asymmetry parameter that controls the model bias. This asymmetric loss function enabled us to build a family of models for different values of the penalty parameter, and switch between them to achieve arbitrary tradeoffs.

The second problem with estimating the actual business metrics is much more challenging, and it required us to develop a special mathematical model which we will discuss later in this article. The model for business metrics is clearly important for safety stock optimization, but it does not directly impact the design of the demand prediction model.

The in-store demand is obviously influenced by weather conditions, so we used the data set provided by the National Oceanic and Atmospheric Administration (NOAA) to pull features like the following:

- Average temperature
- Average daily wind speed
- Precipitation
- Direction of fastest 2-minute and 5-minute wind
- Snowfall, fog, thunder, hail and rime flags

The final architecture includes two predictive models - the stockout classification model described above, and the second one for the final demand prediction. This layout is shown in the following diagram:

Both models are implemented using LightGBM and the GPyOpt library for hyperparameter tuning. This architecture was used to produce 20 different demand predictions for 20 different values of the asymmetry parameter.

The following chart shows how the accuracy of prediction gradually improved as more components and features were added (for the non-biased variant of the model):

The demand prediction model produces the outputs that can be used for setting safety stock. However, safety stock values alone are not enough to estimate the exposure and pick rates. In this section, we show how we developed a model that helps to estimate these business metrics. The estimation of exposure and pick rates is important because the inventory optimization system has to guarantee certain service level agreements (SLAs) that are defined or constrained in terms of such metrics, and cannot just blindly pick some value of the asymmetry parameter.

Let us denote the observed in-store demand (quantity sold) for product $i$ at time interval $t$ as $d_{it}$. The demand prediction model estimates this value as $\widehat{d}_{it}$, and we set safety stock based on this estimate

$$

\text{safety stock}_{it} = T(\widehat{d}_{it})

$$

where $T(x)$ is a rounding function, as we can stock and sell only integer numbers of product units. The estimation error is given by the following expression:

$$

\varepsilon_{it} = d_{it} - \widehat{d}_{it}

$$

A positive error means that safety stock is underestimated (too few units allocated for in-store customers) and generally leads to double selling in-store and online. A negative error means an overestimate (too many units allocated for in-store customers), leading to low exposure rates. Let us also denote the on-hand inventory as $q_{it}$. Thus, the actual in-store residuals available for online sales (which is the ideal ATP) will be as follows:

$$

x_{it} = q_{it} - d_{it}

$$

In our case, the on-hand inventory was not known, so we made an assumption that this quantity is proportional to several past demand observations:

$$

\widehat{q}_{it} = \alpha \cdot \frac{1}{n} \sum_{\tau=1}^n d_{i, t-\tau}

$$

where $\alpha$ is a model parameter that controls or reflects the product replenishment policy. The smaller values of this parameter correspond to lower average stock levels (compared to the average demand), and the greater values correspond to higher stocks levels. From a business perspective, this parameter is linked to the inventory turnover rate, and reflects how conservative the inventory management strategy is. We can set the inventory level parameter based on our estimate of what this level actually is for the given replenishment policy, or we can evaluate the model for different levels, see how the stock level influences the pick and exposure rates, and adjust the replenishment policy based on the results.

The above assumption enables us to estimate the residuals for online sales as follows:

$$

\widehat{x}_{it} = \max \left[ T(\widehat{q}_{it} - d_{it}), \ 0 \right]

$$

Using our models, we can now easily estimate the online exposure rate as the expected ratio between the predicted ATP (which includes the prediction error) and the ideal ATP (residuals) based on the historical data:

$$

\text{ER}(\alpha) = \mathbb{E}_{it} \left[ \frac{\widehat{x}_{it} + T(\varepsilon_{it})}{\widehat{x}_{it}} \right]

$$

Next, we need to estimate the pick rate, which is slightly more complicated, as it generally depends on the online demand. Let us start with the observation that an order fulfillment exception occurs when the online demand (let’s denote it as $d_{it}^o$) exceeds the in-store residual:

$$

\text{fulfilment exception:}\quad d_{it}^o > x_{it}

$$

The pick rate is essentially a probability of the fulfilment without any exceptions that can be expressed through probabilities, and is conditioned on positive and negative prediction errors:

$$

\begin{aligned}

\text{PR}(\alpha) &= p(d_{it}^o \le x_{it}) \\

&= p(\varepsilon_{it} \le 0) \cdot p(d_{it}^o \le x_{it} \ |\ \varepsilon \le 0) -

p(\varepsilon_{it} > 0) \cdot p(d_{it}^o \le x_{it} \ |\ \varepsilon > 0)

\end{aligned}

$$

In the first term, the probability of fulfillment without an exception given the non-positive prediction error equals 1 (the exception cannot occur because we overestimate the demand and thus set safety stock too conservatively).

The probability in the second term generally depends on the online demand distribution. If this distribution is known, then the term can be estimated straightforwardly. In our case, the distribution of online demand was unknown because the the safety stock model was developed in parallel with transactional systems for the ship from store functionality. We worked around this issue by making certain assumptions about the demand distribution. First, let us note that the online demand is bounded by the estimated ATP (the online system will simply start to reject orders once ATP is exhausted):

$$

d_{it}^o \le x_{it} + T(\varepsilon_{it})

$$

Second, we can choose some parametric demand distribution over the interval from zero to the ATP based on what we know about the online demand. The simplest choice will be a uniform distribution:

$$

d_{it}^o \sim \text{uniform}(0,\ x_{it} + T(\varepsilon_{it}))

$$

This assumption can be illustrated as follows:

Consequently, the probability that a demand sample falls into the availability zone is given by:

$$

p\left(d_{it}^o \le x_{it}\ |\ \varepsilon > 0\right) = \frac{x_{it}}{x_{it} + T(\varepsilon_{it})}

$$

Collecting everything together, we get the following expression for the pick rate that can be evaluated as empirical probabilities based on the historical data:

$$

\text{PR}(\alpha) = \mathbb{E}_{it}\left[ p(\varepsilon_{it} \le 0) - p(\varepsilon_{it} > 0) \cdot \frac{x_{it}}{x_{it} + T(\varepsilon_{it})} \right]

$$

The above model is a convenient and flexible framework for safety stock evaluation. This model makes a number of assumptions regarding the online demand and inventory distributions to work around the data gaps we had, but it is quite easy to adjust for other scenarios depending on the available information and data about the environment.

The predictive and econometric model defined above can be used to estimate the pick and exposure rates for different values of model parameters, and then choose the optimal tradeoff. Recall that these models have two major parameters:

- The replenishment policy parameter $\alpha$ that reflects or defines the ratio between the in-store demand and the number of product units stocked after each replenishment.
- The loss function asymmetry parameter $\beta$ that controls the bias of the demand prediction model, either towards the pick rate or exposure rate.

In this section, we show how the full model was evaluated for different values of these two parameters to build safety stock optimization profiles, find optimal values, and estimate the uplift delivered by the model.

First, we plot the dependency between the pick rate and the replenishment parameter $\alpha$ averaged by all products. The following plot shows this dependency for different values of the model bias parameter $\beta$, and the curves that correspond to the baseline (naive) policy of setting safety stock to 1, 2 or 3 units:

We can achieve arbitrarily high pick rates by choosing large values of the replenishment parameter - the safety stock policy becomes immaterial when the stock level significantly exceeds the in-store demand. Consequently, all curves in the chart are monotonically increasing, and are approaching the ideal pick rate as the replenishment parameter increases. The steepness of the curves depends on how aggressive the reservation policy is - the higher the value of the safety stock, the steeper the curve. The curves produced by our predictive model are in between the curves for the baseline policies, as the predictive model is essentially trying to differentiate safety stock values by products that result in lower average safety stock levels.

Next, we can make a similar plot for the dependency between the mean exposure or the exposure rate and the replenishment parameter $\alpha$. Again, the exposure metrics can be made arbitrarily good by increasing the average stock level controlled by the parameter $\alpha$. For example, the mean ATP (measured in the number of product units) can be visualized as follows:

The above charts can help to make operational decisions, such as selecting the best parameter $\beta$ for a given replenishment SLA parameter $\alpha$. The drawback of this representation is that the pick and exposure rates are directly related, and the separate charts do not visualize the tradeoff between the two metrics. From that perspective, it makes sense to plot the pick and exposure rates on one plot, so that each curve corresponds to a certain value of $\alpha$, and each point on a curve corresponds to a certain pair of parameters $\alpha$ and $\beta$:

This chart clearly shows the overall performance of the solution - the family of curves that correspond to our optimization model are well above the curves that correspond to the baseline policies (fixed safety stock values). This means that the optimization model consistently provides a better tradeoff between the pick rate and exposure objectives than the baseline policy. For example, the pick rate uplift in the above chart is about 10% for a wide range of exposure rates. We used this representation to find the optimal tradeoff based on the pick and exposure rate preferences, as well as stock level constraints reflected by the curves for different values of $\alpha$. We were then able to choose the optimal value for the model bias parameter $\beta$.

Innovations in the area of order delivery are critically important for online retail. These innovations often come with new operational and optimization challenges that can be addressed using advanced analytics and machine learning. In this article, we presented a case study on how such an optimization apparatus was developed for Ship from Store and Buy Online Pick Up in Store use cases.

We showed that demand prediction at the level of individual stores has major challenges caused by sporadic demand patterns, and these challenges can be partly addressed by specialized demand modeling techniques and choice of input signals. We also showed that predictive modeling does not fully solve the problem, and one needs to build an econometric model to properly interpret the outputs of the predictive model and the benefits from it. The econometric model we developed also illustrates some techniques that can be used to work around data gaps, which can be caused by various reasons, including parallel development of optimization and transactional systems, and organizational challenges.

A. Ghobbar and C. Friend, Evaluation of forecasting methods for intermittent parts

demand in the field of aviation: a predictive model, 2003 ↩︎