library(tidyverse)
library(lubridate)
library(easyGgplot2)
library(scales)
library(knitr)
library(kableExtra)
library(pander)
library(dummies)
library(lpSolve)
data <-
read.csv("data1.csv", sep = ",")
The following is an exposition of how to use a simple Marketing Mix Model to measure the effectiveness of TV commercials to generate sales, and then use the results to decide how to allocate commercial in such a way that maximizes the prospect of sales using Constrained Linear Optimization.
The dataset used in this exercise is composed of 2112 rows, each of which is a commercial, and 5 variables: the date and time when the commercial aired, the length in seconds, the channel where it aired, the amount of money spent on it, the number of people who saw it and the number of sales attributed to it.
In the following section I explain the theoretical reasoning behind the models used, which serve as methodology in the application section where the expected number of sales for each slot – a combination of day of the week, hour of the day, length of the commercial and channel – is estimated using Least Square Multiple Linear Regression. This is afterwards fed to a Linear Programming Algorithm to solve the constrained optimization problem of the number of commercial in each slot that maximizes the expected number of Sales.
The total number of units sold in a given period of time \(S_t\) can be described as the product of the number of commercials scheduled in a given period \(N_t\), the average reach of each commercial \(r_t\) – the number of people who see it – and the average conversion rate of the commercial \(c_t\) – the percentage of people who end up purchasing a product out of everyone who saw the commercial.
\[ S_t = N_tr_tc_t \] Here, \(N_t\) is an exogenous variable – it is strictly in the control of the marketing executive – and represents the volume of advertising. Furthermore, from her perspective, \(r_t\) and \(c_t\) can also be taken as exogenous because they cannot be changed in the short term. This is the premise behind KPI based analysis.
However, it intuitive clear that the reach of a commercial will depend on variables such as the channel where it is shown, at what time and on what day of the week. Similarly, the conversion rate will depend on variables such as the length and creative content of the commercial.
It is important to emphasize that \(r_t\) and \(c_t\) are average values. Therefore, it is also possible to define the conversion rate as a function of demographic variables – like the share of the audience that is young, male, or of a given income level.
A marketing executive would be interested in properly attributing the performance of the commercial to these variables because it would allow her to make trade-offs and optimize the advertising budget. In order to do so, it is first necessary to extend the previous model to include them.
Since the number of commercial is in the control of the marketing executive it can be removed from the equation by dividing both sides by \(N_t\).
\[ s_t = r_tc_t \] Where the lower case \(s_t\) represents the average number of units sold per commercial. By taking \(\ln(s_t)\) the multiplicative model can be reduced to an additive model.
\[ \ln(s_t) = \ln r_t + \ln c_t \rightarrow s^*_t = r^*_t + c^*_t \] For their part, \(r^*_t\) and \(s^*_t\) can be defined as functions of other variables.
\[ r^*_t = f(C_t, D_t, H_t) \equiv \alpha_r +\beta C_t + \phi D_T + \psi H_t \] \[ c^*_t = f(L_t) \equiv \alpha_c + \theta L_t \]
where:
The average number of sales per commercial can then be expressed as:
\[ s_t = \exp ( s^*_t ) = \exp ((\alpha_r + \alpha_c)+\beta C_t + \phi D_T + \psi H_t + \theta L_t) \]
The parameters of this model can be estimated, before taking \(\exp\), by Least Squares. The hypothesized model would be:
\[ \hat s_t = \hat \alpha +\hat \beta C_t + \hat \phi D_T + \hat \psi H_t + \hat \theta L_t + \varepsilon_t \] Once these parameters are estimated, the contribution of each channel, day, hour or length is represented by their respective coefficients.
The budget optimization problem is to select the number of commercial in each slot – the combination of channel, length, day of the week and hour of the day – that maximizes the number of units sold constrained to a monetary and time budget constraints.
The objective function to maximize is:
\[ \max_{(x_1, x_2, ..., x_n)} S = \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n \] \[ Subject\:to: \] \[p_1 x_1 + p_2 x_2 + ... + p_n x_n = B \] \[ L_i x_i \geq T \: \forall \, i \in (1, 2, ... , n) \]
\[ x_i \geq 0 \: \forall \, i \in (1, 2, ... , n) \]
Where:
The values for \(\beta_1, \beta_2, ... \beta_n\) are obtained by forecasting the expected number of sales per commercial using the model described in the previous section. Doing this will correct the amount of sales received in the sample period in any give slot by removing the noise factor \(\varepsilon_t\).
Aside from the obvious budget constraint, the time constraint is the next most important one since without it the algorithm would simply decide to spend all the budget on the single best slot. This is impossible since that would imply doing more than one hour of advertising in a single hour slot.
The solution to this problem will result in a list of commercials to be scheduled at every single slot considered.
Before applying the previously described methodology, it would be useful to take a look at the data. The following table shows 10 randomly selected rows of the dataset.
data %>%
select(Date, Channel, Length, Cost, Impressions, Sales) %>%
mutate(Wday = wday(Date), Hour = hour(Date)) %>%
select(-Date) %>%
group_by(Wday, Hour, Channel, Length) %>%
summarise(N = n(), p = sum(Cost), r = sum(Impressions), c = sum(Sales)/sum(Impressions), s = sum(Sales)/n() ) %>%
ungroup() %>%
sample_n(10) %>%
kable(format = "latex", booktabs = T) %>%
kable_styling(latex_options = "striped", position = "center")
On the leftmost, the variable Wday represents the day of the week (starting at Sunday = 1), then follows the hour of the day (starting at midnight = 00), the Channel, the Length (whether 15 or 30 seconds long), the number of commercials (N), and the average price, reach, conversion rate and sales of a commercial in that slot.
From this table, the dependent variable for estimation of sales will be “s”, while the independent variables will be Wday, Hour, Channel and Length All of them in the form of dummies.
The following table reports the coefficients estimated using LS Multiple Linear Regression.
dataset <-
data %>%
select(Channel, Length, Date, Sales, Impressions, Cost) %>%
filter(Sales != 0) %>%
mutate(Hour = hour(Date), Wday = wday(Date), s = Sales, ln_s = log(Sales), r = Impressions, p = Cost)%>%
select(-Sales)
dataset.dummy <- dataset
dataset.dummy <- dummy.data.frame(data = dataset.dummy, names = c("Channel","Length","Wday","Hour"))
regouput <- lm(data = dataset.dummy, ln_s ~
Channel1 +
Channel2 +
Channel3 +
Channel4 +
Channel5 +
Channel6 +
Channel7 +
Channel8 +
Channel9 +
Channel10 +
Channel11 +
Channel12 +
Channel13 +
Channel14 +
Channel15 +
Length15 +
Hour1 +
Hour2 +
Hour3 +
Hour7 +
Hour8 +
Hour9 +
Hour10 +
Hour11 +
Hour12 +
Hour13 +
Hour14 +
Hour15 +
Hour16 +
Hour17 +
Hour18 +
Hour19 +
Hour20 +
Hour21 +
Hour22 +
Hour23 +
Wday1 +
Wday2 +
Wday3 +
Wday4 +
Wday5 +
Wday6 )
pander(regouput)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 4.447 | 0.2884 | 15.42 | 9.177e-51 |
Channel1 | -1.513 | 0.2888 | -5.239 | 1.784e-07 |
Channel2 | -1.297 | 0.3149 | -4.118 | 3.985e-05 |
Channel3 | -1.316 | 0.2817 | -4.671 | 3.192e-06 |
Channel4 | -1.041 | 0.281 | -3.706 | 0.0002164 |
Channel5 | -1.187 | 0.3669 | -3.234 | 0.001239 |
Channel6 | -1.095 | 0.3339 | -3.28 | 0.001055 |
Channel7 | -1.391 | 0.3102 | -4.483 | 7.787e-06 |
Channel8 | 0.05563 | 0.2912 | 0.1911 | 0.8485 |
Channel9 | -0.718 | 0.2934 | -2.447 | 0.01448 |
Channel10 | -0.5326 | 0.2925 | -1.821 | 0.06881 |
Channel11 | -0.1029 | 0.2851 | -0.3609 | 0.7182 |
Channel12 | -0.803 | 0.3103 | -2.588 | 0.009732 |
Channel13 | -0.3943 | 0.2923 | -1.349 | 0.1775 |
Channel14 | -1.38 | 0.3764 | -3.667 | 0.0002522 |
Channel15 | -1.471 | 0.2888 | -5.093 | 3.864e-07 |
Length15 | 0.1557 | 0.04643 | 3.354 | 0.0008111 |
Hour1 | -0.2839 | 0.1571 | -1.807 | 0.07084 |
Hour2 | -0.303 | 0.167 | -1.815 | 0.06972 |
Hour3 | -0.5283 | 0.2434 | -2.171 | 0.03006 |
Hour7 | -1.664 | 0.5971 | -2.787 | 0.005365 |
Hour8 | -1.443 | 0.3249 | -4.441 | 9.424e-06 |
Hour9 | -0.9384 | 0.3027 | -3.1 | 0.001962 |
Hour10 | -0.8502 | 0.1783 | -4.768 | 1.995e-06 |
Hour11 | -0.8863 | 0.1552 | -5.71 | 1.303e-08 |
Hour12 | -0.7226 | 0.146 | -4.949 | 8.074e-07 |
Hour13 | -0.7302 | 0.1422 | -5.135 | 3.095e-07 |
Hour14 | -0.5705 | 0.1431 | -3.987 | 6.922e-05 |
Hour15 | -0.7097 | 0.1413 | -5.022 | 5.561e-07 |
Hour16 | -1.019 | 0.1393 | -7.317 | 3.66e-13 |
Hour17 | -0.8418 | 0.1321 | -6.374 | 2.288e-10 |
Hour18 | -0.6265 | 0.1305 | -4.801 | 1.695e-06 |
Hour19 | -0.3089 | 0.1284 | -2.406 | 0.01621 |
Hour20 | 0.03739 | 0.1422 | 0.263 | 0.7926 |
Hour21 | 0.0789 | 0.1405 | 0.5616 | 0.5744 |
Hour22 | 0.3042 | 0.1342 | 2.266 | 0.02355 |
Hour23 | 0.3181 | 0.1347 | 2.362 | 0.01827 |
Wday1 | -0.458 | 0.07337 | -6.242 | 5.276e-10 |
Wday2 | -0.6287 | 0.07317 | -8.592 | 1.696e-17 |
Wday3 | -0.6051 | 0.08097 | -7.473 | 1.168e-13 |
Wday4 | -0.752 | 0.07826 | -9.61 | 2.104e-21 |
Wday5 | -0.8167 | 0.2413 | -3.385 | 0.0007246 |
Wday6 | -0.5183 | 0.08058 | -6.432 | 1.57e-10 |
Using the estimated coefficient of the model, the fitted sales can be computed. The following table shows the original (s) and fitted (fs) average sales per commercial. Notice how for some cases the fitted sales deviate significantly from the original sales. If not for using the Marketing Mix Model this noise would be incorporated into budget optimization step.
dataset$sf <- exp(as.numeric(regouput$fitted.values))
dataset %>%
select(Wday, Hour, Channel, Length, s, sf) %>%
ungroup() %>%
sample_n(10) %>%
kable(format = "latex", booktabs = T) %>%
kable_styling(latex_options = "striped", position = "center")
In order to solve the problem it is necessary to input values for \(B\),\(T\) \(p_i\), and \(\beta_i\). This information will be fed to the optimization problem and solved using the R Package lpSolve.
The first one is simple, as it is just the maximum amount of money to spend scheduling TV commercials – the advertising budget. Here it is set to \(B = 2.000.000\).
The time constraint is what should stop the algorithm from allocating all the budget on a single one hour slot. Assuming that a 30 minute TV show advertises for 5 minutes, this gives a total of 10 minutes per hour – or 600 seconds. Assuming one is able to get about 20% of the available time, this makes \(T = 120\)
The prices \(p_i\) refer to the price of one commercial spot as seen in the dataset.
The expected contribution to Sales \(\beta_i\) are the fitted (expected) sales per commercial estimated in the previous section.
dfoptim <-
dataset %>%
select( Wday, Hour, Channel, Length, p, sf) %>%
group_by(Channel, Length, Wday, Hour) %>%
summarise(sf = sum(sf)/n(), n = n(), p = sum(p)/n)
objective <- dfoptim$sf
constraints <- rbind(dfoptim$p,as.numeric(dfoptim$Length)*diag(length(dfoptim$Length)), diag(length(objective)))
direction <- c(rep("<=",length(dfoptim$Length)+1), rep(">=", length(objective)))
restrictions <- c(500000*4, rep(120,length(dfoptim$Length)), rep(0,length(objective)))
solution <- lp("max", objective, constraints, direction, restrictions)
dfoptim$solution <- solution$solution
The total number of Sales obtained by the algorithm is 33582 and it is a direct result of the constraints and the inputs.
The following table shows 25 random observations from the final dataset. The variable “solution” contains the number of commercials that should be scheduled in this slot. For comparison, notice that the variable “n” contains the number of commercials scheduled in the same slot in the previous, sub-optimal campaign.
This is essentially a scheduling order that can be sent to the media agency to enforce.
dfoptim %>%
ungroup() %>%
sample_n(25)
## # A tibble: 25 x 8
## Channel Length Wday Hour sf n p solution
## <int> <int> <dbl> <int> <dbl> <int> <dbl> <dbl>
## 1 11 30 1. 19 35.8 2 12962. 0.
## 2 13 15 7. 17 29.0 2 3515. 0.
## 3 15 15 1. 1 10.9 1 1352. 0.
## 4 4 30 6. 19 13.2 1 7199. 0.
## 5 15 15 6. 17 5.88 3 2028. 0.
## 6 9 30 4. 17 8.46 1 2028. 0.
## 7 15 30 2. 0 10.5 3 5397. 0.
## 8 15 15 6. 1 10.3 1 2028. 0.
## 9 11 15 6. 13 25.8 3 1926. 8.
## 10 11 30 3. 14 23.8 3 5126. 0.
## # ... with 15 more rows