Jun 26, 2019
• **14 min read**

Marketers usually use multiple channels–such as sponsored search, display ads, and emails–to reach their customers, and each channel usually includes multiple activities or has multiple parameters that are associated with various costs. For example, a marketer can run several email campaigns, each of which corresponds to a certain price discount, or run sponsored search for multiple keywords, each of which is associated with a certain bid amount. On the other hand, customers usually interact with multiple touchpoints along the way to conversion, so that the effects from different touchpoints intertwine and accumulate:

This leads to the problem of marketing spend optimization, which requires estimating the true contribution of individual channels and activities to the final outcome and optimally allocating budgets across these channels, or even setting individual activity parameters such as bids in sponsored keyword search.

In this article, we explore how deep learning methods can be used to analyze sequences of customer interactions, and how the insights gained from such analyses can be used for spend optimization. We gradually will build a solution that can be applied to several common scenarios, including the following:

- Budget optimization across channels (display ads, email campaigns, etc.)
- Budget optimization across campaigns (different types of content, discounts, etc.)
- Optimization of channel or campaign parameters, such as sponsored search keywords.

The problem of spend optimization can be approached from several different perspectives depending on data availability and specific channels and activities’ properties. One traditional approach, known as marketing mix modeling (MMM), takes an aggregated view of the problem and tries to estimate correlations between total spending on individual channels and overall performance metrics, such as the number of conversions, using some sort of regression analysis. A basic model of this kind may look like this:

$$

\text{sales} = \alpha_1 \times \text{budget}_1 + \alpha_2 \times \text{budget}_2 + \ldots

$$

in which each term corresponds to one channel, and channel efficiency is estimated based on the regression coefficients $\alpha_i$. More advanced marketing mix models, such as adstock, can incorporate more complex effects, such as the advertising impact’s time decay. This approach helps separate overlapping marketing activities’ contributions in some applications, but it generally is a crude approximation that ignores individual interactions with customers and behavioral patterns.

Another approach is to analyze individual customer journeys and interactions, so that some credit is assigned to channels or activities for each conversion. This approach, known as attribution modeling, is used widely in digital advertising in which the marketer often distributes a payment for an individual conversion across multiple providers (agencies, publishers, etc.) based on attribution scores. Budgeting decisions can be made based on the attribution scores averaged across multiple customer journeys. One of the main advantages of attribution modeling is the ability to incorporate detailed data about customers, touchpoints, and individual events, and provide deep insights into the dependencies between marketing activities and outcomes. Although one can build a basic attribution model using fairly simple methods, it is generally a complex problem that requires accounting for ordering and dependencies between events, as well as various customer and event attributes.

In the next sections, we build several attribution models, show how deep learning methods for sequential data can improve the quality of such models, and develop a link between attribution to spend optimization. More specifically, our plan is as follows:

- We will use a real dataset to train and evaluate the models, so we will start with initial data analysis and preparation.
- Next, we will build several attribution models using Keras, starting from the most basic ones and gradually increasing in complexity. These models produce channel attribution weights that can be interpreted as recommended budget-allocation ratios, but weights alone may not be sufficient to make accurate budgeting decisions.
- To close this gap between attribution and optimization, we develop a spend optimization routine at the end of the article.

We will train and evaluate our models using an online advertising dataset published by Criteo.^{[1]} This dataset contains about 16 million impressions (events), each of which has multiple attributes, including the following:

**Timestamp**: timestamp of the impression**UID**: unique user identifier**Campaign**: unique campaign identifier**Conversion**: 1 if there was a conversion in the 30 days after the impression; 0 otherwise**Conversion ID**: a unique identifier for each conversion**Click**: 1 if the impression was clicked; 0 otherwise**Cost**: the price paid for this ad**Cat1-Cat9**: categorical features associated with the ad. These features’ semantic meaning is not disclosed.

We do not really have channels in this dataset, so we choose to optimize budget allocation across the campaigns. This is a more challenging task because the dataset contains about 700 advertising campaigns, so we have many more budgeting parameters to learn than in a typical cross-channel optimization, in which the number of channels is relatively small.

We start with the following initial transformation of the input data:

- We aim to analyze entire customer journeys, i.e., sequences of events, so we introduce a journey ID (JID), which is a concatenation of the user ID and conversion ID.
- We reduce the dataset size by randomly sampling 400 campaigns and filtering out journeys with just one event to focus on sequence analysis.
- The original dataset is, of course, imbalanced because conversion events are very rare. We balance the dataset by downsampling non-converted journeys.
- Finally, we also standardize some timestamp fields and do one-hot encoding for categorical fields (categories and campaigns). The total number of features after one-hot encoding is about 1,500.

Implementation of these initial steps is shown in the code snippet below.

Let us examine the distribution of journey lengths to confirm that it makes sense to use modeling methods for sequential data. The dataset contains journeys with up to 100 events or more, but the number of journeys falls exponentially with the length:

The number of journeys with several events is considerable; thus, it makes sense to try methods for sequential data.

The most basic and commonly used approach to attribution is position-based models. These models do not use any statistical analysis, but straightforwardly assign the credit to touchpoints based on their position in the journey. The most commonly used options include:

- Last-touch attribution: gives all credit to the last touchpoint in the journey; other touchpoints get zero credit.
- Time-decay attribution: gives more credit to the touchpoints that are closer in time to the conversion.
- Linear attribution: gives equal credit to all touchpoints in the journey.
- U-shaped attribution: gives most of the credit to the first and last touchpoints, and some credit to intermediate touchpoints.
- First-touch attribution: gives all credit to the first touchpoint in the journey.

These models do not really aim to assess the true contribution of touchpoints, but rather encode several different marketing strategies and reallocate budgets according to them. For example, a marketer can choose last-touch attribution to focus resources on customers who already are close to conversion, or first-touch attribution to focus on growth and acquisition. Other models correspond to more balanced strategies.

Last-touch attribution (LTA) is one of the most commonly used options, so we choose to implement it as a baseline. The implementation is quite straightforward:

The ratio between the number of journeys in which a given campaign is the last event and the total number of events for the same campaign gives the attribution weight (which can be interpreted as the return per impression). The following chart shows LTA-based weights for a sample of 50 campaigns:

The second baseline model that we build is a simple logistic regression model.^{[2]} Unlike position-based models, regression analysis aims to reveal touchpoints’ true contributions.

The idea of a regression-based approach is straightforward: Each journey is represented as a vector in which each campaign is represented by a binary feature (and can be other event features), a regression model is fit to predict conversions, and the resulting regression coefficients are interpreted as attribution weights.

The input data we prepared can be thought of as a 3D tensor, in which each event is represented by a vector of features, events are stacked into journeys, and journeys are stacked into the full dataset. We choose simply to aggregate all events in a journey, then fit a model using these aggregates as inputs:

The aggregation strategy depends on a feature: One-hot encoded event features (campaigns and categories) are aggregated into many-hot vectors, in which the number of clicks and costs are summed up. This featured engineering piece and the train-test split are implemented in the code snippet below.

Next, we implement a logistic regression model. We use Keras for consistency with the more complex models developed in the next sections:

We are getting reasonably good accuracy for such a basic approach. It is common to assume that attribution weights are non-negative, so let us remap the regression coefficients to non-negative weights using softmax:

We can compare these weights with the LTA weights computed in the previous section:

The attribution weights produced by two models are highly correlated, although significant differences exist with some campaigns.

We used a relatively simple model design that can be improved through more elaborate feature engineering and aggregation (e.g., one can try to add one-hot encoded event dates). However, advanced feature engineering is generally time consuming and fragile. On the other hand, we can expect to improve the model’s quality and even reduce the feature engineering effort by using methods that can consume sequences of events directly, i.e., without any aggregation. We explore this idea in the next two sections.

Our next step is to build a more advanced model that explicitly accounts for dependencies between the events in a journey. This problem can be framed as a conversion prediction based on the ordered sequence of events, and recurrent neural networks (RNNs) are a common solution for it. We choose to use a basic long short-term memory (LSTM) architecture with 64 hidden units, as illustrated in the figure below (hereafter, blue arrows denote fully connected layers):

The LSTM-based approach does not require the feature aggregation that we used for the logistic regression model, but we need to pack the events into a 3D tensor, as shown in the figure above. The implementation of this data repackaging is shown in the following code snippet.

The model can be implemented, fitted, and evaluated straightforwardly using Keras as follows:

We can see that the LSTM approach provided significantly better accuracy compared with the logistic regression baseline. However, LSTM does not provide a simple way to extract the attribution weights from the fitted model. Fortunately, we can build a much better model on top of LSTM that provides both better accuracy and explicit estimates of attribution weights.

The LSTM model described in the previous section starts with a random hidden state vector $h$, sequentially updates it after each input event, and estimates the conversion probability based on the hidden vector’s final state. This approach is known to be limited in the sense that the hidden vector’s final state is not always the best representation of the sequence, and better results can be obtained by using the weighted average of the hidden vector’s intermediate states:

This extension of a basic RNN is known commonly as an attention mechanism. It originally was developed for natural language processing (NLP) applications in which the intuition was that the weights associated with the intermediate states essentially model attention that a human reader pays to different words in a sentence.^{[3]} The attention mechanism is known to be very efficient for sequence modeling in general, and several attention-based models recently were proposed specifically for the attribution problem, so we implement a variant of the attention-based model in this section.^{[4]}^{[5]}

First, let us briefly review the design of the attention mechanism. As we already mentioned, the main idea is to learn the attention weights that can be used to combine the intermediate hidden vectors $h_t$ together. Thus, attention weights can be interpreted as amplifiers that control the contribution of individual hidden vectors in the final vector $s$, which is used to make the prediction. A typical implementation of the attention mechanism includes the following operations:

- First, a fully connected layer with $\text{tanh}$ activation is used to squash each hidden vector $h_t$ into an attention vector $u_t$:

$$

u_t = \tanh(W \ h_t + b)

$$

- Second, the importance of each event (attention weight) is estimated as a normalized similarity between $u_t$ and a so-called context vector $c$ that is learned jointly during the fitting process:

$$

a_t = \text{softmax}(u_t^T \ c )

$$

- Finally, the journey vector $s$ is obtained as an attention-weighted sum of the hidden vectors:

$$

s = \sum_t a_t h_t

$$

We choose to use a slightly simplified variant of the above design, in which $u_t$ is scalar; thus, context vector $c$ is redundant. Finally, we add a linear embedding layer in front of the LSTM layer to map sparse one-hot encoded event vectors to more dense 128-dimensional even embeddings, as suggested in ^{[4:1]}. This overall model design is shown in the diagram below.

Note that the feature vector produced by the top layer of the model can be augmented with additional journey-level features such as customer demographics. The implementation of this model in Keras is quite straightforward:

The attention mechanism clearly improves the accuracy of the model, but it also provides a convenient way to estimate attribution weights: We can just take the values of the attention vectors for each journey and average them across all training samples. This piece is highlighted in the above model design diagram in red.

Keras allows for cutting a trained model’s head and making predictions with a truncated model, so that the output is not the final conversion probability, but rather the output of a hidden layer, which is an attention layer in our case. We do this in the implementation below, i.e., build a truncated model, run predictions for all journeys in a training set, then compute average attention weights for each campaign:

This produces attribution weights similar to the previous models:

These weights generally are correlated with the baseline LTA weights:

It is important that the attention weights in each journey are not independent, but that each weight quantifies the contribution of a touchpoint, given all other touchpoints in the journey, so that subsequences (pairs, triplets, etc.) of campaigns also can be analyzed.

We have produced four different sets of attribution weights using four different models. Each of these sets can be used directly to reallocate the budget, but how do we tell which one promises the best return on investment (ROI)? We simply can assume that the higher-accuracy models produce better weights, but it would be a good idea to validate this assumption. One possible way to facilitate this validation is to simulate campaign execution under the new budgeting constraints by replaying historical events.^{[5:1]}

The campaign simulation idea can be outlined as follows:

- At the beginning of the process, we distribute a limited budget across the campaigns according to the attribution weights.
- We replay the available historical events (ordered by their timestamps) and decrement the budgets accordingly.
- Once a campaign runs out of money, we stop to replay the remaining events associated with it and somehow estimate the probabilities of conversion for all journeys affected by this campaign’s suppression.
- Finally, we count the total number of conversions and estimate ROI. If none of the campaigns in a converted journey runs out of money before the journey ends, this conversion will be counted explicitly. Otherwise, the estimate of the conversion probability will be used.

We implement this approach using two simplifying assumptions. First, we define the budget just as the number of events (impressions) that we can pay for, ignoring actual dollar costs. Second, we assume that once a campaign runs out of money, all journeys that have more events associated with this campaign will never convert. These assumptions lead to the following simulation algorithm:

**Inputs**: Total budget $B$, attribution weights vector $w$, events $x_t$ ordered by time.

**Outputs**: The number of conversions.

- Initialization:
- Initialize the budgets (maximum number of events) for all campaigns: $$ b = \left\lceil w \ \frac{B}{\sum w} \right\rceil $$
- Initialize a set of converted journeys $C = \{\}$, and blacklisted journeys $S = \{\}$

- Iterate over events $x_t$:
- Let $j$ and $c$ be the journey ID and campaign ID associated with $x_t$, respectively.
- If $j \notin S$ (journey is not blacklisted):
- If $b_c \ge 1$ (campaign $c$ has more budget):
- $b_c = b_c - 1$
- If journey $j$ ended with conversion, add $j$ to $C$

- Or else:
- Add $j$ to $S$ (blacklist the journey)

- If $b_c \ge 1$ (campaign $c$ has more budget):

- Return the number of non-blacklisted conversions $|C - S|$

This algorithm’s implementation is shown in the code snippet below.

This simulation algorithm can be used to evaluate the performance of the original attribution weights produced by the models, as well as various transformations of these weights. For example, we can evaluate not only the original weights $w$, but also weights $w^p$ for different values of parameter $p$. This parameter essentially controls the "pitch" of the budget distribution: The values of $p$ between zero and one lead to a more even distribution of the budget across the campaigns; the values higher than one lead to a more uneven distribution:

The results of the simulation indicate that the raw attribution weights ($p=1.0$) are not necessarily optimal for all models:

These results also confirm that LSTM with attention provides the best budget allocation, and that logistic regression performs reasonably well with a properly selected pitch value. LTA has very poor performance because it focuses exclusively on the campaigns at the end of the journey, so the campaigns at the beginning of the journey quickly run out of money, sending these journeys to the blacklist.

The simulation algorithm can be extended straightforwardly to incorporate the events’ costs, conversion profits, or more sophisticated logic for handling out-of-budget journeys. These adjustments can be designed and fine-tuned based on the actual performance of the optimized budgets in production.

We have discussed, implemented, and evaluated several attribution models that provide a solid foundation for measuring the efficiency of marketing activities and the optimization of budgeting parameters. We have seen that the state-of-the-art models that consume sequences of events provide superior accuracy and greatly simplify feature engineering. In fact, this is a typical example of how traditional enterprise data science can benefit from deep and reinforcement learning: Many marketing, merchandising, and supply-chain use cases deal with sequential data or multi-step optimization, and deep and reinforcement learning provide powerful toolkits for these types of problems. Other examples of that kind include next best action modeling, demand prediction, and inventory-constrained price optimization, to name a few.

We generally assumed availability of historical data for modeling and optimization. However, the same techniques can be combined with reinforcement learning to evaluate and adjust budgeting parameters dynamically. This approach can be particularly useful for sponsored search bids optimization and other use cases in which a large number of budgeting parameters needs to be tuned dynamically. ^{[6]}

A complete notebook with the data preparation code and models is available on github.

http://ailab.criteo.com/criteo-attribution-modeling-bidding-dataset/ ↩︎

Shao X. and Li L., Data-Driven Multi-Touch Attribution Models, 2011. ↩︎

Bahdanau D., Cho K., and Bengio Y., Neural Machine Translation by Jointly Learning to Align and Translate, 2014. ↩︎

Li N. et al., Deep Neural Net with Attention for Multi-Channel Multi-Touch Attribution, 2018. ↩︎ ↩︎

Ren K., Fang Y., Zhang W., Liu S., Li J., Zhang Y., Yu Y., and Wang J., Learning Multi-Touch Conversion Attribution with Dual-Attention Mechanisms for Online Advertising, 2018. ↩︎ ↩︎

Zhao J., Qiu G., Guan Z., Zhao W., and He X., Deep Reinforcement Learning for Sponsored Search Real-Time Bidding, 2018. ↩︎

Leave us a comment, we would love to know what you think