17  Beyond Normality

In the preceding chapters, we basked in the warm glow of normality, a simplifying assumption that allowed us to model outcomes as if they arose from the familiar bell curve. Yet, the real world often serves up data generated by far more diverse processes. Outcomes might be binary choices (to adopt or reject), rankings on an ordinal scale (like satisfaction surveys), or simple counts (such as the number of subscribers). In this chapter, we embark on an expedition beyond the comfortable confines of normality, charting the terrain of these alternative data types and the specialized tools we need to analyze them.

17.1 Binary Outcomes: The Coin Flips of Data

Binary outcomes are the coin flips of the data world – two sides, two possibilities. Think success/failure, yes/no, or the ever-important adopt/reject decision. To model these, we turn to the trusty logistic regression. While linear probability models are also used, logistic regression has the distinct advantage of keeping our predictions bounded between the sensible limits of 0 and 1.

This workhorse of a model is a type of generalized linear model, employing a logit link function:

\[ \text{BernoulliLogit}(y|\theta) = \text{Bernoulli}(y|\text{logit}^{-1}(\theta)) \]

where

\[ \text{logit}^{-1}(\theta) = \frac{1}{1 + \exp(-\theta)} \]

library(ggplot2)
library(tibble)
library(dplyr)

inv_logit <- function(theta) {
  return(1 / (1 + exp(-theta)))
}

ggplot2::ggplot() +
  ggplot2::stat_function(fun = inv_logit) +
  ggplot2::theme_minimal() +
  ggplot2::xlim(-5, 5) +
  ggplot2::xlab(expression(theta))

To illustrate, let’s cook up some fake data:

# Fake data
N <- 2000
K <- 2
set.seed(1982)
fake_data <- tibble::tibble(
  x1 = rnorm(N, mean = 0, sd = 1),
  x2 = rnorm(N, mean = 0, sd = 1),
  treat = sample(
    x = c(TRUE, FALSE), size = N, replace = TRUE,
    prob = c(0.5, 0.5)
  ),
  r = runif(n = N, min = 0, max = 1)
) %>%
  dplyr::mutate(
    p0 = inv_logit(theta = -3 + 0.1 * x1 + 0.25 * x2),
    p1 = inv_logit(theta = -3 + 0.1 * x1 + 0.25 * x2 + 0.2),
    y0 = dplyr::case_when(p0 > r ~ 1, TRUE ~ 0),
    y1 = dplyr::case_when(p1 > r ~ 1, TRUE ~ 0),
    y = dplyr::case_when(
      treat ~ as.logical(y1),
      TRUE ~ as.logical(y0)
    )
  )
dplyr::glimpse(fake_data)
Rows: 2,000
Columns: 9
$ x1    <dbl> 0.685092067, -0.005550195, -0.777641329, 1.875702830, -0.3771291…
$ x2    <dbl> -0.665502269, -1.256150229, -0.230714338, 0.743955915, -0.630752…
$ treat <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,…
$ r     <dbl> 0.45401970, 0.20720548, 0.15142911, 0.31274073, 0.88143116, 0.97…
$ p0    <dbl> 0.04319535, 0.03507396, 0.04166872, 0.06745600, 0.03933915, 0.03…
$ p1    <dbl> 0.05225914, 0.04250932, 0.05042906, 0.08117855, 0.04763407, 0.04…
$ y0    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ y1    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ y     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
mean_y0 <- mean(fake_data$y0)
mean_y1 <- mean(fake_data$y1)
impact <- round((mean(fake_data$y1) - mean(fake_data$y0)) * 100, 2)

In this fabricated dataset, we’ve engineered a scenario where, without intervention, only 4% would have adopted the feature. Yet, with the intervention applied universally, adoption would have jumped to 5%. The true impact of this intervention is a hefty 1 percentage points.

Prior predictive checking

As Bayesians, we’re not just number crunchers; we’re storytellers. We weave narratives about data, and our priors are the opening chapters. So, before we unleash our model, let’s ponder what our priors imply.

Our model takes this form:

\[ \begin{aligned} \theta &= \alpha + X \beta + \tau T \\ \alpha &\sim N(\mu_{\alpha}, sd_{\alpha}) \\ \beta_j &\sim N(\mu_{\beta_j}, sd_{\beta_j}) \\ \tau &\sim N(\mu_{\tau}, sd_{\tau}) \end{aligned} \]

logit <- im::logit$new(
  data = fake_data,
  y = "y", # this will not be used
  treatment = "treat",
  x = c("x1", "x2"),
  mean_alpha = -3,
  sd_alpha = 2,
  mean_beta = c(0, 0),
  sd_beta = c(1, 1),
  tau_mean = 0.05,
  tau_sd = 0.5,
  fit = FALSE # we will not be fitting the model
)

logit$plotPrior()

Fitting the Model: Where Theory Meets Data

Satisfied with our priors, we’re ready to fit the model to the data:

logit <- im::logit$new(
  data = fake_data,
  y = "y",
  treatment = "treat",
  x = c("x1", "x2"),
  mean_alpha = -3,
  sd_alpha = 2,
  mean_beta = c(0, 0),
  sd_beta = c(1, 1),
  tau_mean = 0.05,
  tau_sd = 0.5,
  fit = TRUE
)

Let’s glance at the trace plot of tau to ensure our chains mixed well and converged:

logit$tracePlot()

To sum up our findings, we have a few handy methods at our disposal:

logit$pointEstimate()
[1] 1.35
logit$credibleInterval(width = 0.95)
Given the data, we estimate that there is a 95% probability that the effect is between -1 and 4 percentage points.
logit$calcProb(a = 0)
Given the data, we estimate  that the probability that the effect is more than 0 percentage points is 89%.

We can also use the prediction function to predict new data and compare the differences between groups. The predict function takes the new_data and name argument to name the group. For example, here we will predict the data as if all units are treated, then make another prediction as if all units are not treated and summarize the two groups.

fake_treated_data <- fake_data %>% mutate(treat = TRUE)
fake_control_data <- fake_data %>% mutate(treat = FALSE)
logit$predict(
  new_data = fake_treated_data,
  name = "y1"
)
logit$predict(
  new_data = fake_control_data,
  name = "y0"
)
logit$predSummary(name = "y1", width = 0.95, a = 0)
[1] "Given the data, we estimate that for group: y1, the point estimate of the group average is 5%. With 95% probability, the point estimate is between 4 and 7 percentage points. Given the data, we estimate  that the probability that the group average is more than 0 percentage points is 100%."
logit$predSummary(name = "y0", width = 0.95, a = 0)
[1] "Given the data, we estimate that for group: y0, the point estimate of the group average is 4%. With 95% probability, the point estimate is between 3 and 6 percentage points. Given the data, we estimate  that the probability that the group average is more than 0 percentage points is 100%."

We can also compare the differences between two groups of predictions.

logit$predCompare(name1 = "y1", name2 = "y0", width = 0.95, a = 0)
[1] "Given the data, we estimate that the point estimate of the group difference is 1%. With 95% probability, the point estimate is between -1 and 3 percentage points. Given the data, we estimate  that the probability that the group difference is more than 0 percentage points is 89%."

We can also summarize and compare the predictions conditioning on subgroups.

logit$predSummary(
  name = "y1",
  subgroup = (fake_data$x1 > 0),
  width = 0.95, a = 0
)
[1] "Given the data, we estimate that for group: y1, the point estimate of the group average is 5%. With 95% probability, the point estimate is between 3 and 8 percentage points. Given the data, we estimate  that the probability that the group average is more than 0 percentage points is 100%."
logit$predCompare(
  name1 = "y1",
  name2 = "y0",
  subgroup1 = (fake_treated_data$x1 > 0),
  subgroup2 = (fake_control_data$x1 > 0),
  width = 0.95, a = 0
)
[1] "Given the data, we estimate that the point estimate of the group difference is 1%. With 95% probability, the point estimate is between -1 and 4 percentage points. Given the data, we estimate  that the probability that the group difference is more than 0 percentage points is 85%."

Finally, we can get the posterior predictive draws for advanced analysis.

pred <- logit$getPred(name = "y1")

17.2 Ordinal Outcomes

Ordinal outcomes are a rather curious beast in the realm of causal inference. They have a natural order - think of a survey respondent rating their satisfaction on a scale from “very dissatisfied” to “very satisfied” - but the intervals between the values don’t necessarily hold equal weight. The distance between “dissatisfied” and “neutral” may not be the same as that between “satisfied” and “very satisfied.” This lack of equal spacing is a challenge we must address head-on when analyzing the impact of an intervention or treatment.

This nuance of ordinal outcomes is also present in other domains, such as user experience ratings (e.g., poor, fair, good, excellent), or educational attainment levels (e.g., less than high school, high school diploma, some college, college degree). In each case, there’s a clear order, but the spacing between levels is not uniform.

This lack of uniform spacing can complicate our analysis, particularly when applying traditional regression models designed for continuous outcomes. We need a tailored approach that respects the ordinal nature of the data, while still allowing us to draw meaningful causal inferences.

TODO

17.3 Count Outcomes

In this section, we’re talking about events we can tally—website visits, customer complaints, product sales, you name it. These outcomes are inherently non-negative integers, reflecting the discrete nature of the events we’re counting.

The go-to tool for analyzing count outcomes is often Poisson regression. However, real-world data frequently throws us a curveball in the form of overdispersion, where the variance of the count data outstrips its mean. This is where negative binomial regression swoops in to save the day. It’s a souped-up version of Poisson regression that accounts for overdispersion by adding an extra parameter to the model.

Count data pops up in all sorts of scenarios. We might be interested in the effect of a new marketing campaign on app downloads or the impact of a software update on user logins. With count outcomes, the possibilities are as endless as the events we can count.

An Example with Fake Data: Video Views Galore

Let’s say we’re interested in the number of views a video receives in a given period. This is a perfect opportunity to use the negative binomial distribution to model the underlying data-generating process. Here’s how we can express this:

\[ \begin{aligned} y_i & \sim \text{NB}(\mu_i, \phi) \\ log(\mu_i) & = \alpha + X\beta \end{aligned} \]

In this model, \(\mu\) represents the mean (which must be positive, hence the log link function), and the inverse of \(\phi\) controls the overdispersion. A small \(\phi\) means the negative binomial distribution significantly deviates from a Poisson distribution, while a large \(\phi\) brings it closer to a Poisson. This becomes clear when we look at the variance:

\[ Var(y_i) \sim \mu_i + \frac{\mu_i^2}{\phi} \]

Let’s whip up some fake data to illustrate:

set.seed(9782)
library(dplyr)
library(ggplot2)
N <- 1000

fake_data <-
  tibble::tibble(x1 = runif(N, 2, 9), x2 = rnorm(N, 0, 1)) %>%
  dplyr::mutate(
    mu = exp(0.5 + 1.7 * x1 + 4.2 * x2),
    y0 = rnbinom(N, size = 2, mu = mu), # here size is phi
    y1 = y0 * 1.05,
    t = sample(c(TRUE, FALSE), size = n(), replace = TRUE, prob = c(0.5, 0.5)),
    y = case_when(t ~ as.integer(y1),
      .default = as.integer(y0)
    )
  ) %>%
  filter(y > 0)

# Plotting the histogram using ggplot2
ggplot(fake_data, aes(x = y)) +
  geom_histogram(
    bins = 100,
    color = "black",
    alpha = 0.7
  ) +
  facet_wrap(~t) +
  labs(title = "Histogram of y", x = "y", y = "Frequency") +
  xlim(0, 1e4) +
  ylim(0, 20)

OLS: A Common, But Flawed, Approach

A typical way to estimate the lift of an intervention in this scenario is to run ordinary least squares (OLS) on the natural logarithm of the outcome. Then, folks often look at the point estimate and 95% confidence interval for the treatment effect. Doing this with our fake data might lead you to conclude that the intervention is ineffective, as the 95% confidence interval includes zero.

lm(data = fake_data, log(y) ~ x1 + x2 + t) %>% broom::tidy(conf.int = TRUE)
# A tibble: 4 × 7
  term        estimate std.error statistic    p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1 (Intercept)   0.422     0.0861     4.89  0.00000116   0.253      0.591
2 x1            1.68      0.0137   122.    0            1.65       1.70 
3 x2            4.15      0.0293   142.    0            4.09       4.20 
4 tTRUE         0.0254    0.0547     0.464 0.643       -0.0819     0.133

Taming Overdispersion: The Bayesian Negative Binomial Advantage

Enter the Bayesian negative binomial model, easily implemented using the {im} package in R. This approach has two key advantages. First, it better captures the true data-generating process. Second, being Bayesian, it lets us incorporate prior information and express our findings in a way that’s more directly relevant to business decisions.

library(im)
nb <- negativeBinomial$new(
  data = fake_data, y = "y", x = c("x1", "x2"),
  treatment = "t", tau_mean = 0.0, tau_sd = 0.025
)

A quick check of the trace plot is always a good idea:

nb$tracePlot()

Now for the payoff. We can readily obtain a point estimate for the impact using nb$pointEstimate() (4%) and a 95% credible interval using nb$credibleInterval(width = 0.95, round = 2) (Given the data, we estimate that there is a 95% probability that the effect is between 0 and 0.08.). But what if we want to know the probability that the impact is at least 1% (or any other threshold)? Easy peasy! We use nb$posteriorProb(threshold = 0.01) (Given the data, we estimate that the probability that the effect is more than 0.01 is 93%.). Finally, to visualize the evolution of our understanding from prior to posterior, we employ nb$vizdraws().

nb$vizdraws(display_mode_name = TRUE, breaks = 0.01,
            break_names = c("< 0.01", "> 0.01"))