Binary response variables are ubiquitous in social science research. Many phenomena of interest are naturally dichotomous or can be meaningfully dichotomized:

  • Political behavior: Voting for a Republican vs. Democrat
  • International relations: Going to war or maintaining peace
  • Comparative politics: Democracy vs. autocracy classification
  • Criminal justice: Committing a crime or not
  • Public policy: Policy adoption vs. non-adoption

This document explores various approaches to modeling binary outcomes. We’ll start with logit and probit models, then discuss alternatives link functions, and close with the linear probability model (LPM).

1 Logit and Probit Models

Before we get into the mathematics, let’s build some intuition. Imagine you’re trying to predict whether someone will vote based on their education level. You might think that more educated people are more likely to vote, but this relationship probably isn’t perfectly linear. Going from 0 to 4 years of education might have a different effect than going from 12 to 16 years. Moreover, once someone is already very likely to vote (say 95% probability), additional education probably won’t increase their probability much more – there’s just not much room to grow. This kind of “diminishing returns” at the extremes is what logit and probit models capture naturally.

1.1 The Latent Variable Approach

The latent variable approach provides a conceptual framework for thinking about binary outcomes. The word “latent” means hidden or unobserved, and that’s exactly what we’re dealing with here. Think about the decision to vote. We observe whether someone voted (1) or didn’t vote (0), but underlying this binary outcome is likely a continuous measure of “motivation to vote” or “propensity to vote” that we can’t directly see. Someone with very high motivation votes, someone with very low motivation doesn’t, and people near the middle might go either way depending on small factors.

This latent variable framework isn’t just a mathematical convenience—it often corresponds to how we actually think about binary decisions. Consider a few examples:

  • A judge deciding guilty vs. not guilty has an underlying sense of the strength of evidence
  • A bank approving or denying a loan has an underlying assessment of creditworthiness
  • A country going to war or not has an underlying level of grievance or threat perception

In each case, the binary outcome emerges when this underlying continuous assessment crosses some threshold.

Both logit and probit models can be motivated through a latent variable framework. We assume there exists an unobserved continuous variable \(Y^*\) that determines the observed binary outcome:

\[Y^*_i = X_i\beta + u_i\]

We observe only:

\[Y_i = \begin{cases} 1 & \text{if } Y^*_i \geq \tau \\ 0 & \text{if } Y^*_i < \tau \end{cases}\]

where \(\tau\) is a threshold. Without loss of generality, we typically normalize \(\tau = 0\) by absorbing it into the intercept.

Let me explain this normalization because it’s important but often a source of confusion. We could set the threshold at any value – maybe people vote when their latent propensity exceeds 100, or 42, or any other number. But since we never observe the latent variable directly, we can’t identify both the threshold and the intercept separately. It’s like trying to measure temperature without agreeing on where zero should be – Celsius and Fahrenheit both work, they just put zero in different places. By convention, we set the threshold to zero and let the intercept absorb any threshold effects. This doesn’t change our substantive results; it just makes the math cleaner.

The probability of observing \(Y_i = 1\) is:

\[\begin{align} \Pr(Y_i = 1) &= \Pr(Y^*_i \geq 0) \\ &= \Pr(X_i\beta + u_i \geq 0) \\ &= \Pr(u_i \geq -X_i\beta) \\ &= \Pr(u_i \leq X_i\beta) \quad \text{(by symmetry)}\\ &= F(X_i\beta) \end{align}\]

Let’s walk through this derivation step by step. We start by asking: what’s the probability that we observe Y = 1? Well, that happens when the latent variable exceeds our threshold (which we’ve set to zero). The latent variable equals \(X_i\beta\) plus an error term \(u_i\), so we need the probability that their sum is positive. Rearranging, we need the probability that the error is greater than \(-X_i\beta\).

The last step—going from \(\Pr(u_i \geq -X_i\beta)\) to \(\Pr(u_i \leq X_i\beta)\) – uses the fact that we assume the error distribution is symmetric around zero. This is a standard assumption that says positive and negative errors are equally likely. Finally, we recognize that \(\Pr(u_i \leq X_i\beta)\) is just the cumulative distribution function (CDF) of the error evaluated at \(X_i\beta\).

where \(F(\cdot)\) is the CDF of \(u_i\). The choice between logit and probit depends on the assumed distribution of \(u_i\) – a topic we’ll return to after we introduce logit and probit more formally below.

1.2 The Logit Model

1.2.1 Distribution and Properties

The logit model assumes \(u_i\) follows a standard logistic distribution with:

  • PDF: \(\lambda(u) = \frac{\exp(u)}{[1 + \exp(u)]^2}\)
  • CDF: \(\Lambda(u) = \frac{\exp(u)}{1 + \exp(u)}\)
  • Variance: \(\sigma^2 = \frac{\pi^2}{3} \approx 3.29\)
  • Symmetry: \(\Lambda(u) = 1 - \Lambda(-u)\)

Before we dive deeper, let’s understand what these mathematical objects represent and why they matter. The PDF (probability density function) tells us how likely different values of the error are—think of it as the “shape” of randomness in our model. The CDF (cumulative distribution function) is what we actually use to transform our linear predictions into probabilities. The variance tells us how spread out the errors are, and the symmetry property ensures that positive and negative deviations from our predictions are equally likely.

These formulas define the “shape” of randomness in the logit model. The PDF (left plot below) shows how likely different error values are—notice it’s bell-shaped like the normal distribution but with heavier tails (more extreme values are slightly more likely). The CDF (right plot) is what we actually use in the model—it transforms any number from \(-\infty\) to \(+\infty\) into a probability between 0 and 1. This S-shaped curve is the famous “logistic curve” that gives logit its name.

The key insight: when \(X_i\beta = 0\), you get probability 0.5. As \(X_i\beta\) increases, probability smoothly rises toward 1. As it decreases, probability falls toward 0 and the curve naturally flattens out at the extremes, capturing the idea that it gets increasingly hard to push probabilities from 0.9 to 0.99 (or from 0.1 to 0.01).

Think about this intuitively: if someone already has a 90% chance of voting, it’s hard to increase that to 99%—most of the factors that would make them vote are already in play. Similarly, if someone has only a 10% chance of voting, it’s hard to decrease that to 1%—they’re already very unlikely to vote. This “resistance” at the extremes is what the S-shaped curve captures.

# create visualizations logistic distribution
u_seq = seq(-4, 4, 0.01)

# probability density function
logistic_pdf = dlogis(u_seq)

# cumulative distribution function
logistic_cdf = plogis(u_seq)

# create the plots
pdf_plot = ggplot(data.frame(u = u_seq, pdf = logistic_pdf), aes(x = u, y = pdf)) +
  geom_line(color = "blue", size = 1.2) +
  labs(title = "Standard Logistic PDF",
       subtitle = "The 'bell shape' of random errors",
       x = "u", y = "Probability Density") +
  theme_minimal()

cdf_plot = ggplot(data.frame(u = u_seq, cdf = logistic_cdf), aes(x = u, y = cdf)) +
  geom_line(color = "blue", size = 1.2) +
  labs(title = "Standard Logistic CDF (Logit Link)",
       subtitle = "The S-curve that maps Xβ to probabilities",
       x = "Xβ", y = "Pr(Y = 1)") +
  theme_minimal() +
  geom_hline(yintercept = 0.5, linetype = "dashed", alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5)

grid.arrange(pdf_plot, cdf_plot, ncol = 2)

1.2.2 Model Specification

The logit model specifies:

\[\Pr(Y_i = 1) = \Lambda(X_i\beta) = \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\]

This formula might look intimidating, but let’s break it down piece by piece. We’re trying to convert \(X_i\beta\)—which can be any number—into a probability that must be between 0 and 1. The exponential function and the ratio accomplish this transformation.

  1. The Linear Predictor: \(X_i\beta\) is just like OLS—a weighted sum of your variables. If you have two predictors, this is \(\beta_0 + \beta_1X_{i1} + \beta_2X_{i2}\). This can range from \(-\infty\) to \(+\infty\).

    To make this concrete, imagine we’re predicting voting based on education and income. The linear predictor might be: \(-2 + 0.3 \times \text{Education} + 0.0001 \times \text{Income}\). For someone with 16 years of education and $50,000 income, this gives us: \(-2 + 0.3(16) + 0.0001(50000) = -2 + 4.8 + 5 = 7.8\).

  2. The Exponential: \(\exp(X_i\beta)\) transforms the linear predictor to be strictly positive. When \(X_i\beta = 0\), \(\exp(0) = 1\). When \(X_i\beta\) is large and positive, \(\exp(X_i\beta)\) becomes huge. When negative, it approaches 0.

    Continuing our example, \(\exp(7.8) \approx 2440\). This is a big positive number, suggesting this person is very likely to vote.

  3. The Ratio: \(\frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\) creates the “squashing” that ensures probabilities stay in [0,1]:

    • When \(X_i\beta = 0\): \(\frac{1}{1+1} = 0.5\)
    • When \(X_i\beta \to \infty\): \(\frac{\text{huge}}{\text{huge}+1} \approx 1\)
    • When \(X_i\beta \to -\infty\): \(\frac{\text{tiny}}{\text{tiny}+1} \approx 0\)

    For our example: \(\frac{2440}{1 + 2440} = \frac{2440}{2441} \approx 0.9996\). So this person has a 99.96% probability of voting according to our model.

Alternative Forms You’ll Encounter:

The same probability can be written as:

  • \(\Pr(Y_i = 1) = \frac{1}{1 + \exp(-X_i\beta)}\) (inverse logit form)
  • \(\Pr(Y_i = 1) = \Lambda(X_i\beta)\) (using CDF notation)

These are all the same thing, just written differently. You’ll see different forms in different textbooks and papers, so it’s good to recognize them all.

Connection to Odds:

The logit model has a popular interpretation through odds. If we rearrange:

\[\frac{\Pr(Y_i = 1)}{1 - \Pr(Y_i = 1)} = \exp(X_i\beta)\]

Taking logs of both sides:

\[\ln\left(\frac{\Pr(Y_i = 1)}{1 - \Pr(Y_i = 1)}\right) = X_i\beta\]

This means the log-odds are linear in \(X\). A one-unit change in \(X_k\) changes the log-odds by \(\beta_k\), or multiplies the odds by \(\exp(\beta_k)\). We’ll talk more about this interpretation later.

1.2.3 Likelihood Function

Now we need to estimate the \(\beta\) parameters. In OLS regression, we used the principle of least squares—finding the line that minimizes the sum of squared residuals. For binary outcomes, this doesn’t work well (we’ll see why when we discuss the Linear Probability Model later). Instead, we use maximum likelihood estimation—finding the parameters that make our observed data most likely.

Think about it this way: suppose we observed that 8 out of 10 highly educated people voted, and only 2 out of 10 less educated people voted. What parameter values would make this pattern most likely? Maximum likelihood finds those values.

Building the Likelihood:

For a single observation \(i\), we need the probability of observing what we actually saw:

  • If \(Y_i = 1\): We need \(\Pr(Y_i = 1) = \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\)
  • If \(Y_i = 0\): We need \(\Pr(Y_i = 0) = 1 - \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)} = \frac{1}{1 + \exp(X_i\beta)}\)

The clever trick is to write both cases in one formula:

\[L_i = \left[\frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\right]^{Y_i} \left[\frac{1}{1 + \exp(X_i\beta)}\right]^{1-Y_i}\]

Why This Works:

This formula might seem a bit spooky at first, but the key is that \(Y_i\) is either 0 or 1, which acts like a switch:

  • When \(Y_i = 1\): The first term is raised to power 1 (so it stays), the second term is raised to power 0 (so it becomes 1)
  • When \(Y_i = 0\): The first term is raised to power 0 (becomes 1), the second term is raised to power 1 (so it stays)

This is a common trick in statistics for writing conditional expressions in a single formula.

From Individual to Joint Likelihood:

Assuming independence across observations, the likelihood for the entire sample is the product:

\[L = \prod_{i=1}^N L_i\]

As we’ve discussed, products are computationally difficult (tiny numbers multiplied together), so we take logs.

The Log-Likelihood:

\[\ln L = \sum_{i=1}^N \left[Y_i \ln\left(\frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\right) + (1-Y_i)\ln\left(\frac{1}{1 + \exp(X_i\beta)}\right)\right]\]

Simplifying the Log-Likelihood:

We can rewrite this in a more computationally stable form. Note that:

  • \(\ln\left(\frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\right) = X_i\beta - \ln(1 + \exp(X_i\beta))\)
  • \(\ln\left(\frac{1}{1 + \exp(X_i\beta)}\right) = -\ln(1 + \exp(X_i\beta))\)

Therefore:

\[\ln L = \sum_{i=1}^N \left[Y_i X_i\beta - \ln(1 + \exp(X_i\beta))\right]\]

Why Can’t We Just Use Calculus to Solve for \(\beta\)?

In OLS, we take derivatives, set them to zero, and solve algebraically for \(\beta\). Let’s try that here and see where it breaks down.

Step 1: Take the First Derivative

To maximize the log-likelihood, we differentiate with respect to \(\beta\):

\[\frac{\partial \ln L}{\partial \beta} = \sum_{i=1}^N \left[Y_i X_i - X_i \cdot \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\right]\]

Notice that \(\frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)} = \Pr(Y_i = 1 | X_i)\), so we can write:

\[\frac{\partial \ln L}{\partial \beta} = \sum_{i=1}^N X_i \left[Y_i - \Pr(Y_i = 1 | X_i)\right]\]

Note that this has a nice interpretation: we’re summing up the prediction errors (actual outcome minus predicted probability), weighted by the X values. When our predictions are perfect, this equals zero.

Step 2: Set Equal to Zero (First-Order Conditions)

For a maximum, we need:

\[\sum_{i=1}^N X_i \left[Y_i - \Pr(Y_i = 1 | X_i)\right] = 0\]

Or equivalently:

\[\sum_{i=1}^N X_i Y_i = \sum_{i=1}^N X_i \cdot \Pr(Y_i = 1 | X_i)\]

Step 3: Where We Get Stuck … Ruh-Roh

Here’s the problem: \(\Pr(Y_i = 1 | X_i) = \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\) is a nonlinear function of \(\beta\).

Let’s see this with a simple example. Suppose we have just an intercept and one predictor, so \(X_i\beta = \beta_0 + \beta_1 x_i\). Our first-order conditions become:

For \(\beta_0\):

\[\sum_{i=1}^N Y_i = \sum_{i=1}^N \frac{\exp(\beta_0 + \beta_1 x_i)}{1 + \exp(\beta_0 + \beta_1 x_i)}\]

For \(\beta_1\):

\[\sum_{i=1}^N x_i Y_i = \sum_{i=1}^N x_i \cdot \frac{\exp(\beta_0 + \beta_1 x_i)}{1 + \exp(\beta_0 + \beta_1 x_i)}\]

There’s no way to algebraically isolate \(\beta_0\) and \(\beta_1\) from these equations! The exponential function inside a ratio makes it impossible to “solve for \(\beta\)” the way we can with OLS. To understand why this is so problematic, imagine trying to solve the equation \(x = \frac{e^x}{1 + e^x}\). There’s no algebraic manipulation that will give you \(x\) in terms of elementary functions.

In OLS, the first-order conditions are:

\[\sum_{i=1}^N X_i (Y_i - X_i\beta) = 0\]

Which simplifies to:

\[\sum_{i=1}^N X_i Y_i = \sum_{i=1}^N X_i X_i'\beta\]

This can be solved directly:

\[\beta = (X'X)^{-1}X'Y\]

The key difference: In OLS, the predicted value \(X_i\beta\) is linear in \(\beta\), so we can factor it out and solve. In logit, the predicted probability is nonlinear in \(\beta\), trapping us in an equation we can’t solve analytically, thus we use the numerical estimation approaches that we discussed in the Likelihood Theory materials.

So what do we do! More specifically, if we can’t solve the first-order conditions, how do computers find \(\beta\)? The answer reveals a fun distinction between computing derivatives versus actually solving equations.

We CAN compute the gradient (the derivative formula):

\[g(\beta) = \sum_{i=1}^N X_i \left[Y_i - \Lambda(X_i\beta)\right]\]

where \(\Lambda(X_i\beta) = \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\) is the predicted probability.

This tells us the “slope” of the log-likelihood at any given \(\beta\) value – essentially, which direction is “uphill” toward the maximum.

We CAN’T solve \(g(\beta) = 0\) analytically because the exponential function inside the ratio makes it impossible to isolate \(\beta\) algebraically.

Computers uses the Newton-Raphson algorithm to navigate around this problem. Instead of solving the equation directly, it uses an iterative approach – like climbing a hill in dense fog where you can’t see the peak but can feel the slope beneath your feet. Specifically, the process works as follows:

  1. Initial Guess: Start with an initial estimate of \(\beta\), say \(\beta^{(0)}\) (often just zeros or OLS estimates).

  2. Compute Gradient and Hessian: Calculate the gradient (first derivative) and Hessian (second derivative) of the log-likelihood at the current estimate.

    1. The gradient tells us the direction and steepness: \[g(\beta^{(t)}) = \sum_{i=1}^N X_i \left[Y_i - \Lambda(X_i\beta^{(t)})\right]\] If positive, we should increase \(\beta\); if negative, decrease it.

    2. The Hessian tells us how the gradient is changing (the curvature): \[H(\beta^{(t)}) = -\sum_{i=1}^N X_i X_i' \cdot \Lambda(X_i\beta^{(t)})[1-\Lambda(X_i\beta^{(t)})]\]

  3. Update Rule: Update the estimate using the formula:

    \[\beta^{(t+1)} = \beta^{(t)} - H^{-1} g\] where \(H\) is the Hessian matrix and \(g\) is the gradient vector evaluated at \(\beta^{(t)}\).

    This formula essentially asks: “Given the current slope and curvature, if the function were quadratic, where would its maximum be?” We move to that point and reassess.

  4. Convergence Check: Repeat steps 2 and 3 until the change in \(\beta\) is smaller than a predefined threshold (typically when \(|\beta^{(t+1)} - \beta^{(t)}| < 0.00001\) or when the gradient is essentially zero).

Why This Works: At each iteration, Newton-Raphson approximates the complex nonlinear likelihood with a simpler quadratic function that matches the current point’s value, slope, and curvature. We can solve for the maximum of a quadratic easily, move there, and repeat. Within 4-6 iterations, we typically reach the maximum likelihood estimates.

The Key Insight: We never actually “solve” the first-order conditions. Instead, we use the gradient to guide us iteratively toward the solution—like following the steepest uphill path until we reach the peak where the gradient equals zero.

A Fundamental Trade-off: Functional Form Assumptions

Before moving forward we should also note that this nonlinearity raises an important critique: while OLS imposes a restrictive linear functional form that can violate probability bounds, models like logit impose their own restrictive assumption – that the relationship between \(X_i\beta\) and \(\Pr(Y_i = 1)\) follows a specific S-shaped curve (logistic). There’s no theoretical reason why the true relationship must follow exactly these curves.

Let me elaborate on this philosophical point because it’s rarely discussed in introductory treatments but is important for developing good judgment about model choice. Every model makes assumptions. OLS assumes the relationship is linear but makes no assumptions about the distribution of errors (for consistency of estimates). Logit assumes a very specific nonlinear relationship but this guarantees sensible probabilities. There’s no free lunch in statistics – we’re always trading one set of assumptions for another.

As Angrist and Pischke (2009) argue, we’re essentially trading one functional form assumption (linearity) for another (a specific nonlinear shape). The logistic and normal CDFs are convenient mathematical choices that guarantee sensible probabilities, but nature doesn’t necessarily follow our mathematical convenience. In some sense, OLS’ (or more specifically the OLS applied to binary data, which we refer to as the linear probability model (LPM)) transparency about its limitations (you can see exactly where it fails) might be preferable to logit/probit’s hidden assumption that we’ve correctly specified the entire shape of the probability curve.

Thus when you choose logit, you’re not just saying “probabilities should be between 0 and 1” – you’re imposing a very specific mathematical relationship at every point along the curve. Consider what the logistic function actually assumes:

  1. Symmetry Around 0.5: The logit model assumes the probability curve is perfectly symmetric around \(\Pr(Y=1) = 0.5\). The rate at which probabilities increase from 0.1 to 0.2 is exactly the same as the rate they increase from 0.8 to 0.9 (after adjusting for the log-odds scale). But why should this be true??? Maybe in your application, it’s much easier to move from low probabilities to moderate ones than from moderate to high.

Think about voter turnout: moving from 10% to 30% probability of voting (getting politically disengaged people slightly engaged) might be much easier than moving from 70% to 90% (getting already-engaged people to be super-engaged). Or it could be the opposite! The logit model forces symmetry regardless of which is true in reality.

  1. Specific Tail Behavior: The logistic CDF approaches 0 and 1 at very specific rates. When \(X_i\beta = -4\), the probability is about 0.018. When \(X_i\beta = -5\), it’s about 0.007. But what if the true process has thicker tails (more “surprises”) or thinner tails (more deterministic)? You’ve assumed away these possibilities by choosing logit.

For example, in modeling rare events like wars or financial crises, the true probability of extreme events might be higher than what the logistic distribution implies. These “black swan” events might require a distribution with fatter tails.

  1. Fixed Inflection Point: The steepest part of the S-curve (where marginal effects are largest) always occurs at \(\Pr(Y=1) = 0.5\) in logit. But perhaps in reality, the steepest change happens at 0.3 or 0.7. The complementary log-log model, for instance, has its steepest slope at around 0.37.

  2. Monotonic Effects: Logit assumes that as \(X\) increases, the probability either always increases (if \(\beta > 0\)) or always decreases (if \(\beta < 0\)), with no plateaus, reversals, or more complex patterns.

A Concrete Example of What Could Go Wrong

Imagine you’re modeling voter turnout as a function of education. The logit model forces a specific relationship: the effect of education on turnout probability must follow the logistic curve exactly. But what if reality is different?

  • Reality: Maybe there’s a “threshold effect” where education has little effect until high school completion, then a large jump, then diminishing returns.
  • What logit assumes: A smooth S-curve with the biggest effect at medium education levels.
  • Consequence: Your model might underestimate effects at the threshold and overestimate them elsewhere.

Or consider modeling loan default as a function of credit score:

  • Reality: Perhaps there are “cliffs” at certain scores (like 620, 700) due to institutional lending rules.
  • What logit assumes: Smooth transitions everywhere.
  • Consequence: You miss the discontinuities that actually drive lending decisions.

Why This Assumption is “Hidden”

The assumption is hidden because:

  1. It looks scientific: Using a proper CDF seems more rigorous than LPM’s obvious problems.

  2. We can’t test it directly: Unlike linearity (which we can assess with residual plots), we can’t easily test whether the true probability curve is logistic versus some other S-shape.

  3. Software makes it automatic: You type glm(..., family=binomial) and get results without thinking about what specific curve you’ve imposed.

  4. Small differences in the middle: Logit, probit, and other S-curves are very similar for probabilities between 0.2 and 0.8, where most data lie. The assumption mainly matters at the extremes.

The choice of MLE models like logit versus LPM isn’t really about which is “correct” because none of them are! It’s about which set of assumptions you’re most comfortable making:

  • OLS/LPM: “I’ll accept obviously wrong predictions at extremes for transparency and simplicity”
  • Logit/Probit/etc: “I’ll assume a specific mathematical curve to guarantee sensible probabilities”
  • Reality: “Pshh, I don’t follow any of your convenient mathematical functions”

As Beck (2020) notes, for many research questions—especially those focused on average effects in experimental or quasi-experimental settings – these functional form debates matter much less than we once thought. But researchers should understand what they’re assuming when they choose a model, rather than treating logit/probit as automatically superior to LPM just because they respect probability bounds.

This is why many applied researchers now emphasize robustness checks across multiple specifications and focus on average marginal effects, which tend to be similar across reasonable models, rather than treating any single functional form as “correct.”

The practical takeaway: always estimate multiple models (LPM, logit, probit) and check whether your substantive conclusions change. If they don’t, you can be more confident in your findings. If they do, you need to think carefully about which assumptions are most appropriate for your specific application.

1.2.4 Implementation

Below is an example of fitting a logit model in R using simulated data. We’ll generate some data, fit a logit model, and visualize the predicted probabilities.

# generate example data for all models
set.seed(6886)
n = 500
x = rnorm(n, 0, 2)
# true probability follows a logistic curve
true_prob = plogis(0.5 + 0.8*x)
y = rbinom(n, 1, true_prob)

# create data frame
data = data.frame(x = x, y = y)

# create prediction data
pred_data = data.frame(x = seq(-4, 4, 0.1))

# fit lpm model for later comparisons
lpm_model = lm(y ~ x, data = data)
pred_data$lpm_pred = predict(lpm_model, newdata = pred_data)
# fit the logit model
logit_model = glm(y ~ x, data = data, family = binomial(link = "logit"))

# get predictions
pred_data$logit_pred = predict(logit_model, newdata = pred_data, type = "response")

# show model summary
summary(logit_model)
## 
## Call:
## glm(formula = y ~ x, family = binomial(link = "logit"), data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.40111    0.11490   3.491 0.000481 ***
## x            0.88113    0.08217  10.723  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 688.91  on 499  degrees of freedom
## Residual deviance: 486.85  on 498  degrees of freedom
## AIC: 490.85
## 
## Number of Fisher Scoring iterations: 5
# visualize the logit predictions
logit_plot = ggplot(pred_data, aes(x = x)) +
  geom_line(aes(y = logit_pred), color = "grey20", size = 1.2) +
  geom_point(data = data, aes(x = x, y = y), alpha = 0.1) +
  geom_hline(yintercept = c(0, 1), linetype = "dashed", alpha = 0.5) +
  labs(title = "Logit Model Predictions",
       x = "Independent Variable (X)",
       y = "Predicted Probability") +
  theme_minimal()

print(logit_plot)

1.3 The Odds Ratio Approach

1.3.1 Understanding Odds

Earlier we said that there is an odds interpretation of the logit model. Let’s unpack what that means.

Before we dive into the math, let’s build intuition with a concrete example. Suppose 75 out of 100 people vote. The probability of voting is 0.75. The odds of voting are 75:25 or 3:1—for every three people who vote, one doesn’t. This is different from saying “3 times more likely” (which would be about probability). Odds tell us the ratio of success to failure.

The odds of an event are defined as:

\[\text{Odds} = \frac{\Pr(\text{Event})}{1 - \Pr(\text{Event})}\]

Odds represent how much more likely something is to happen than not happen. Unlike probabilities (which range from 0 to 1), odds range from 0 to infinity:

  • Probability 0.5 → Odds = 1 (equally likely to happen or not, “even odds”)
  • Probability 0.25 → Odds = 1/3 (three times more likely to NOT happen, “3 to 1 against”)
  • Probability 0.75 → Odds = 3 (three times more likely TO happen, “3 to 1 in favor”)
  • Probability 0.9 → Odds = 9 (nine times more likely to happen)
  • Probability 0.99 → Odds = 99 (ninety-nine times more likely to happen)

Odds are preferred by some over probabilities:

  1. Multiplicative symmetry: If the odds of success are 3:1, the odds of failure are 1:3. With probabilities (0.75 and 0.25), the relationship isn’t as clean.

  2. Unbounded above: While probabilities hit a ceiling at 1, odds can grow infinitely, making them better for modeling strong effects.

  3. Natural for comparisons: Odds ratios (comparing odds across groups) are the standard in modern medical research.

1.3.3 Interpreting Logit Coefficients Through Odds

The Key Insight: In logit models, coefficients represent the change in log-odds for a one-unit change in X.

For a single predictor model: \(\ln(\text{odds}) = \beta_0 + \beta_1 X\)

  • \(\beta_0\): Log-odds when X = 0 (baseline log-odds)
  • \(\beta_1\): Change in log-odds for one-unit increase in X
  • \(e^{\beta_1}\): Odds ratio - how the odds multiply for one-unit increase in X

Concrete Example:

Suppose we’re modeling admission to graduate school based on GRE scores, and \(\beta_{GRE} = 0.003\).

  • Each additional GRE point increases log-odds by 0.003
  • Each additional GRE point multiplies odds by \(e^{0.003} \approx 1.003\)
  • A 100-point increase multiplies odds by \(e^{0.3} \approx 1.35\)
  • Someone with a 100-point higher GRE has 35% higher odds of admission

Lets walk through this calculation in detail. If someone with a 600 GRE has a 20% admission probability (odds = 0.25), then someone with a 700 GRE has odds of 0.25 × 1.35 = 0.3375, which corresponds to a probability of 0.3375/(1+0.3375) ≈ 0.252 or 25.2%. Notice that the odds increased by 35% but the probability only increased by about 26% (from 20% to 25.2%).

For Binary Predictors, It’s Even Cleaner:

If we have a treatment variable with \(\beta_{treat} = 1.386\):

  • Treatment increases log-odds by 1.386
  • Treatment multiplies odds by \(e^{1.386} = 4\)
  • The treated group has 4 times the odds of success as the control group

1.3.4 Limitations of the Odds Ratio Interpretation

The Misconception Problem:

People often misinterpret odds ratios as probability ratios. If the odds ratio is 4:

  • Wrong: “Four times more likely” (that would be probability ratio)
  • Right: “Four times the odds”

For common outcomes (probability > 0.2), odds ratios exaggerate effect sizes:

  • If baseline probability = 0.4 and treatment doubles the odds
  • New probability = 0.57 (not 0.8!)
  • The probability increased by 43%, not 100%

This misconception is so common that many researchers avoid reporting odds ratios to lay audiences. In my opinion, it’s usually better to convert to probability differences or risk ratios.

# let's look at odds and log-odds
prob_seq = seq(0.01, 0.99, 0.01)
odds_seq = prob_seq / (1 - prob_seq)
log_odds_seq = log(odds_seq)

odds_df = data.frame(
  Probability = prob_seq,
  Odds = odds_seq,
  LogOdds = log_odds_seq
)

# create visualizations
odds_plot = ggplot(odds_df, aes(x = Odds, y = Probability)) +
  geom_line(color = "blue", size = 1.2) +
  labs(title = "Probability as a Function of Odds",
       subtitle = "Note the asymmetric, bounded relationship") +
  coord_cartesian(xlim = c(0, 10)) +
  geom_vline(xintercept = 1, linetype = "dashed", alpha = 0.5) +
  annotate("text", x = 1, y = 0.2, label = "Even odds\n(P = 0.5)", hjust = -0.1) +
  theme_minimal()

log_odds_plot = ggplot(odds_df, aes(x = LogOdds, y = Probability)) +
  geom_line(color = "red", size = 1.2) +
  labs(title = "Probability as a Function of Log-Odds",
       subtitle = "The logistic function - symmetric around 0") +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
  geom_hline(yintercept = 0.5, linetype = "dashed", alpha = 0.5) +
  annotate("text", x = 0, y = 0.3, label = "Log-odds = 0\n(P = 0.5)", hjust = -0.1) +
  theme_minimal()

grid.arrange(odds_plot, log_odds_plot, ncol = 2)

# show odds ratio interpretation
baseline_prob = 0.2
odds_ratio = 3
baseline_odds = baseline_prob / (1 - baseline_prob)
new_odds = baseline_odds * odds_ratio
new_prob = new_odds / (1 + new_odds)

cat("Example: Baseline probability = 0.2, Odds Ratio = 3\n")
## Example: Baseline probability = 0.2, Odds Ratio = 3
cat(sprintf("Baseline odds: %.3f\n", baseline_odds))
## Baseline odds: 0.250
cat(sprintf("New odds: %.3f\n", new_odds))
## New odds: 0.750
cat(sprintf("New probability: %.3f\n", new_prob))
## New probability: 0.429
cat(sprintf("Probability ratio: %.3f (NOT equal to odds ratio!)\n", new_prob/baseline_prob))
## Probability ratio: 2.143 (NOT equal to odds ratio!)

1.4 The Probit Model

Now let’s turn to the probit model, logit’s main competitor. The fundamental idea is the same—we’re transforming a linear predictor into a probability using an S-shaped curve. The difference is that probit uses the normal distribution (the familiar bell curve) instead of the logistic distribution.

1.4.1 Distribution and Properties

To understand probit, let’s explore the theoretical connection between the normal distribution assumption and what we observe. When we assume the errors in our latent variable model follow a normal distribution, we’re making a specific claim about the underlying process generating our binary outcomes.

The normal distribution often arises in nature due to the Central Limit Theorem - when many small, independent factors add up, their sum tends to be normally distributed. In the context of binary choices, think about:

  • Educational Attainment: The decision to pursue higher education might depend on numerous small factors (family support, academic preparation, financial resources, peer influences) that aggregate into a normally distributed propensity
  • Conflict Outcomes: Whether a pair of actors conflict might result from many political and environmental factors combining additively
  • Political Participation: The decision to vote could arise from accumulating civic experiences, social pressures, and personal motivations

This theoretical foundation suggests probit might be especially appropriate when your latent variable represents an accumulation of many influences.

FFormally, the probit model assumes \(u_i\) follows a standard normal distribution:

  • PDF: \(\phi(u) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{u^2}{2}\right)\)
  • CDF: \(\Phi(u) = \int_{-\infty}^u \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{t^2}{2}\right) dt\)
  • Variance: \(\sigma^2 = 1\)

Unlike the logistic CDF which has the neat formula \(\frac{e^x}{1+e^x}\), the normal CDF is defined only as an integral – there’s no way to write it without that integral sign. This means:

  1. Computational complexity: Every probability calculation requires numerical integration or approximation
  2. Historical disadvantage: Before fast computers, logit was much easier to compute … now differences are negligible
# compare normal and logistic distributions
comparison_df = data.frame(
  u = u_seq,
  Normal_PDF = dnorm(u_seq),
  Logistic_PDF = dlogis(u_seq),
  Normal_CDF = pnorm(u_seq),
  Logistic_CDF = plogis(u_seq)
)

# compare probability density functions
pdf_comp = ggplot(comparison_df, aes(x = u)) +
  geom_line(aes(y = Normal_PDF, color = "Normal"), size = 1.2) +
  geom_line(aes(y = Logistic_PDF, color = "Logistic"), size = 1.2, linetype = "dashed") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "PDF Comparison: Normal vs Logistic",
       subtitle = "Logistic has fatter tails",
       x = "u", y = "Density") +
  theme_minimal()

# compare cumulative distribution functions
cdf_comp = ggplot(comparison_df, aes(x = u)) +
  geom_line(aes(y = Normal_CDF, color = "Normal"), size = 1.2) +
  geom_line(aes(y = Logistic_CDF, color = "Logistic"), size = 1.2, linetype = "dashed") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "CDF Comparison: Normal vs Logistic",
       subtitle = "Very similar S-shaped curves",
       x = "u", y = "Pr(U ≤ u)") +
  theme_minimal()

grid.arrange(pdf_comp, cdf_comp, ncol = 2)

1.4.2 Model Specification

The probit model specifies:

\[\Pr(Y_i = 1) = \Phi(X_i\beta) = \int_{-\infty}^{X_i\beta} \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{t^2}{2}\right) dt\]

Unlike logit, where \(e^{\beta_k}\) represents the multiplicative change in odds, probit coefficients have no such clean interpretation. You cannot transform a probit coefficient into a meaningful quantity without knowing where you are on the curve. This is why:

  • Medical researchers prefer logit (odds ratios are standard in epidemiology)
  • We don’t mind (we’re going to focus on marginal effects anyway)

1.4.3 Likelihood Function and Estimation

The probit likelihood has the same structure as logit:

\[L_i = [\Phi(X_i\beta)]^{Y_i} [1-\Phi(X_i\beta)]^{1-Y_i}\]

But here’s where it gets interesting: The gradient (first derivative) of the probit log-likelihood has the following form:

\[\frac{\partial \ln L}{\partial \beta} = \sum_{i=1}^N X_i \cdot \lambda_i\]

where \(\lambda_i\) is the inverse Mills ratio:

  • When \(Y_i = 1\): \(\lambda_i = \frac{\phi(X_i\beta)}{\Phi(X_i\beta)}\)
  • When \(Y_i = 0\): \(\lambda_i = \frac{-\phi(X_i\beta)}{1-\Phi(X_i\beta)}\)

This ratio appears everywhere in the social sciences:

  1. Selection models: It’s the correction term in Heckman two-step
  2. Truncated regression: Adjusts for missing tails of distributions
  3. Interpretation: Represents the “hazard” of crossing the threshold

The fact that probit naturally produces these ratios is one reason it integrates so well with other methods.

Computational Quirks of Probit

Because \(\Phi(\cdot)\) requires numerical integration:

  1. Approximation errors: Different software might give slightly different answers (usually negligible)
  2. Speed: Historically slower than logit (now barely noticeable)
  3. Numerical instability: For extreme values of \(X_i\beta\) (like |\(X_i\beta\)| > 8), \(\Phi\) is so close to 0 or 1 that computers struggle
  4. Starting values matter more: Poor initial guesses can cause convergence problems more often than with logit

1.4.4 Tail Behavior: When Probit and Logit Diverge

The practical difference between probit and logit emerges at the extremes:

Thinner Tails in Probit:

  • At \(X_i\beta = -4\): Probit gives \(\Pr(Y=1) \approx 0.00003\), Logit gives \(\approx 0.018\)
  • At \(X_i\beta = 4\): Probit gives \(\Pr(Y=1) \approx 0.99997\), Logit gives \(\approx 0.982\)

What This Means for Your Research:

  1. Rare events: Logit predicts more “surprises” (unexpected successes when probability is low)
  2. Near-certain events: Probit reaches “practical certainty” faster
  3. Classification: With same threshold (0.5), probit will classify more observations as certain 0s or 1s
  4. Outliers: Logit is more forgiving of observations that “shouldn’t” have happened

When These Differences Actually Matter:

# look at extreme predicted probabilities
X_beta_extreme <- c(-5, -4, -3, 3, 4, 5)
probit_p <- pnorm(X_beta_extreme)
logit_p <- plogis(X_beta_extreme)
ratio <- logit_p/probit_p

# at extremes, logit probabilities can be much larger than probit
print(ratio)
## [1] 2.334838e+04 5.679035e+02 3.513293e+01 9.538617e-01 9.820449e-01
## [6] 9.933074e-01

For most data (where \(X_i\beta \in [-2, 2]\)), the maximum difference in predicted probabilities is about 0.02. But in applications with:

  • Interstate conflict (very low baseline rates)
  • Graduation rates (very high baseline rates)
  • Any extrapolation beyond the data

…the choice between probit and logit may become substantive, not just aesthetic.

1.4.5 The Dirty Secret: Why We Really Choose Probit vs Logit

Despite all the theory, the real reasons are often:

  1. Disciplinary convention:

  2. Software defaults:

  3. Extensions you plan:

    • Heading toward selection models? → Use probit
    • Planning multinomial extension? → Use logit (computational tractability)
    • Need to explain to non-technical audience? → Use logit (odds ratios)
  4. Reviewer preferences: Sometimes you choose the model that won’t trigger “why didn’t you use X instead?” from reviewers

The bottom line: For 95% of applications, the choice between probit and logit is inconsequential for your substantive conclusions. The remaining 5% involves extreme probabilities, specific theoretical models, or particular extensions where the choice truly matters.

1.4.6 Implementation

Below is an example of fitting a probit model in R using the same simulated data as before. We’ll fit the probit model and compare its predictions to the logit and LPM models.

# fit the probit model
probit_model = glm(y ~ x, data = data, family = binomial(link = "probit"))

# get predictions
pred_data$probit_pred = predict(probit_model, newdata = pred_data, type = "response")

# show model summary
summary(probit_model)
## 
## Call:
## glm(formula = y ~ x, family = binomial(link = "probit"), data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.23591    0.06687   3.528 0.000419 ***
## x            0.52161    0.04438  11.752  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 688.91  on 499  degrees of freedom
## Residual deviance: 485.75  on 498  degrees of freedom
## AIC: 489.75
## 
## Number of Fisher Scoring iterations: 5
# compare all three models
comparison = ggplot(pred_data, aes(x = x)) +
  geom_line(aes(y = lpm_pred, color = "LPM"), size = 1) +
  geom_line(aes(y = logit_pred, color = "Logit"), size = 1) +
  geom_line(aes(y = probit_pred, color = "Probit"), size = 1, linetype = "dashed") +
  geom_point(data = data, aes(x = x, y = y), alpha = 0.05) +
  geom_hline(yintercept = c(0, 1), linetype = "dotted", alpha = 0.5) +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Comparison: LPM vs Logit vs Probit",
       subtitle = "Logit and Probit produce similar predictions; LPM can violate bounds",
       x = "Independent Variable (X)",
       y = "Predicted Probability",
       color = "Model") +
  theme_minimal()

print(comparison)

1.5 The Random Utility Model

You might wonder: why are we suddenly talking about utility when we want to model binary outcomes? The answer reveals something profound about statistical modeling—our models often work better when they’re grounded in behavioral theory.

Think of it this way: when someone makes a choice (vote vs. don’t vote, migrate vs. stay, buy vs. don’t buy), they’re implicitly comparing the benefits of each option. The random utility model formalizes this intuition, giving us both theoretical coherence and practical tools.

1.5.1 Theoretical Foundation

Binary choice models can also be motivated through utility maximization, providing a behavioral foundation for the statistical models we’ve been discussing. This isn’t just mathematical convenience—it connects our statistical models to decades of research in economics, psychology, and decision science.

The Setup: Let’s make this concrete with an example before we dive into notation.

Imagine a citizen deciding whether to vote in an election: - Option 0 (Don’t vote): Save time and effort, avoid bad weather, skip researching candidates - Option 1 (Vote): Fulfill civic duty, influence outcome, express preferences, but incur costs

The citizen will vote if the utility (net benefit) of voting exceeds the utility of not voting. This seems obvious, but here’s the key insight: we can’t observe the citizen’s utility directly. We only see their choice.

This is the classic “paradox of voting”—why do people vote when the probability of being pivotal is essentially zero? The random utility framework helps us model this.

More formally, consider an individual \(i\) choosing between two options:

  • Option 0: Status quo (e.g., not buying, voting “no”, staying home)
  • Option 1: Alternative (e.g., buying, voting “yes”, migrating)

The individual chooses the option with higher utility:

\[Y_i = \begin{cases} 1 & \text{if } U_{i1} > U_{i0} \\ 0 & \text{if } U_{i0} \geq U_{i1} \end{cases}\]

Why “Random” Utility?

The word “random” might seem odd here – aren’t people’s choices deliberate? Yes, but from our perspective as researchers, there’s randomness because we can’t observe everything that influences decisions.

We call it “random” because from the researcher’s perspective, utility has:

  1. Observable components: Things we can measure (demographics, party registration, past behavior)
    • Example: We know the citizen’s age, education, party ID, distance to polling place
  2. Unobservable components: Things we can’t measure (civic duty, mood, private information)
    • Example: We don’t know their sense of civic obligation, whether they had a bad day at work, conversations with friends about candidates

This distinction between what we can and cannot observe is crucial—it’s why we need probabilistic models rather than deterministic ones.

1.5.2 Decomposing Utility

Each option’s utility consists of a systematic (observable) and random (unobservable) component:

\[U_{i0} = V_{i0} + \epsilon_{i0} = X_{i0}\beta + \epsilon_{i0}\] \[U_{i1} = V_{i1} + \epsilon_{i1} = X_{i1}\beta + \epsilon_{i1}\]

where:

  • \(V_{ij} = X_{ij}\beta\) is the systematic utility (what we can explain)
  • \(\epsilon_{ij}\) is the random component (what we can’t explain)

The Choice Probability:

Individual \(i\) chooses option 1 when \(U_{i1} > U_{i0}\):

\[\begin{align} \Pr(Y_i = 1) &= \Pr(U_{i1} > U_{i0}) \\ &= \Pr(V_{i1} + \epsilon_{i1} > V_{i0} + \epsilon_{i0}) \\ &= \Pr(\epsilon_{i0} - \epsilon_{i1} < V_{i1} - V_{i0}) \\ &= \Pr(\epsilon_{i0} - \epsilon_{i1} < X_{i1}\beta - X_{i0}\beta) \end{align}\]

1.5.3 The Distributional Assumption Determines the Model

Key Insight: The distribution of \((\epsilon_{i0} - \epsilon_{i1})\) determines which model we get.

Let’s pause here because this is a crucial point that often confuses students. We’re about to make an assumption about something we can’t even observe (the error terms). Why does this matter?

Think of it like this: The errors represent all the factors we don’t measure. Different assumptions about these unmeasured factors lead to different models:

  • If we think unmeasured factors follow a bell curve (normal distribution) → Probit model
  • If we think they follow a slightly different pattern with heavier tails → Logit model
  • Other assumptions → Other models

The beautiful part? For most applications, these different assumptions lead to very similar predictions. It’s like choosing between different brands of tires—they all get you where you’re going, just with slightly different characteristics.

Case 1: Type I Extreme Value (Gumbel) Errors → Logit

If \(\epsilon_{i0}\) and \(\epsilon_{i1}\) are independent Type I extreme value distributed, then their difference follows a logistic distribution:

\[\Pr(Y_i = 1) = \frac{\exp(V_{i1} - V_{i0})}{1 + \exp(V_{i1} - V_{i0})} = \frac{1}{1 + \exp(-(V_{i1} - V_{i0}))}\]

Why Extreme Value? These distributions arise naturally when modeling the maximum of many random shocks—fitting for choices where people pick the “best” option.

Case 2: Normal Errors → Probit

If \(\epsilon_{i0}\) and \(\epsilon_{i1}\) are jointly normal, then their difference is also normal:

\[\epsilon_{i0} - \epsilon_{i1} \sim N(0, \sigma^2_0 + \sigma^2_1 - 2\sigma_{01})\]

Usually we assume independence (\(\sigma_{01} = 0\)) and normalize the variance, giving:

\[\Pr(Y_i = 1) = \Phi\left(\frac{V_{i1} - V_{i0}}{\sigma}\right)\]

1.5.4 Normalizations and Identification

The Fundamental Problem: We can never separately identify:

  1. The scale of utility (is a utility of 10 twice as good as 5?)
  2. The level of utility (where is zero?)
  3. The variance of errors

An Analogy to Make This Clear:

Imagine you’re trying to measure “democratic quality.” Different indices might:

  • Freedom House uses a 1-7 scale
  • Polity uses -10 to +10
  • V-Dem uses continuous 0-1

Are these measuring the same thing? How do we compare? We need to fix a scale. Similarly, with utility, we need to pin down some reference points to make our model identifiable.

Standard Normalizations:

  1. Location: Set \(V_{i0} = 0\) (one option becomes the baseline)
  2. Scale: Fix error variance (1 for probit, \(\pi^2/3\) for logit)

After normalization, our familiar models emerge:

  • Logit: \(\Pr(Y_i = 1) = \Lambda(X_i\beta)\)
  • Probit: \(\Pr(Y_i = 1) = \Phi(X_i\beta)\)

where \(X_i\) now represents the difference in characteristics between options.

1.5.5 What Random Utility Models Give Us

1. Behavioral Foundation:

Instead of just saying “we model probability with an S-curve,” we can say “individuals maximize utility with random components.”

2. Welfare Analysis:

We can compute consumer surplus and welfare effects because we have a utility interpretation.

3. Extensions to Multiple Choices:

The framework naturally extends to:

  • Multinomial logit: Choosing among J > 2 options
  • Nested logit: Choices with hierarchical structure
  • Mixed logit: Allowing for preference heterogeneity

4. Structural Interpretation:

Coefficients represent preference parameters, not just statistical associations.

1.5.6 Practical Implications for Researchers

When the Random Utility Framework Matters:

  1. Policy evaluation: Need to compute welfare effects
  2. Demand estimation: Modeling consumer choices
  3. Discrete choice experiments: Designed to elicit preferences
  4. Structural models: Building economic models with micro-foundations

When It’s Just a Story:

Often, the random utility model is just a motivation – we’re really just fitting a nonlinear probability model. That’s fine! But be aware that:

  • Welfare calculations require the utility interpretation to be real
  • Preference parameters only make sense if utility is meaningful
  • Out-of-sample predictions assume stable preference structure

The McFadden Revolution:

Daniel McFadden won the Nobel Prize (2000) for developing these models. His key insight: by assuming specific error distributions, we can turn utility maximization into tractable statistical models.

1.5.7 A Simple Example: Interstate Conflict Decision

Consider modeling whether State A initiates conflict with State B:

Utility from peace: \(U_{Peace} = \beta_0 + \beta_1 \cdot \text{TradeDependence} + \epsilon_P\)

Utility from conflict: \(U_{Conflict} = \beta_2 \cdot \text{RelativePower} + \beta_3 \cdot \text{TerritorialDispute} + \epsilon_C\)

Conflict occurs when: \(U_{Conflict} > U_{Peace}\)

After differencing and normalizing:

\[\Pr(\text{Conflict}) = \Lambda(\beta_2 \cdot \text{RelativePower} + \beta_3 \cdot \text{TerritorialDispute} - \beta_1 \cdot \text{TradeDependence} - \beta_0)\]

The coefficients now have structural interpretation:

  • \(\beta_1\): How trade dependence affects the utility of maintaining peace (opportunity costs)
  • \(\beta_2\): How relative military power affects the utility of conflict (probability of winning)
  • \(\beta_3\): How territorial disputes affect conflict utility (value of stakes)

This is the same logit model we’d fit statistically, but now with behavioral meaning. Cool, right?!

1.5.8 What the Error Distribution Means for Applied Researchers

Now we come to a section that might seem highly technical but actually has very practical implications for your research. The assumption about the distribution of errors isn’t always mathematical minutiae – it affects how your model behaves in real applications.

The assumption about the distribution of \(u_i\) is not merely a technical detail but has important practical implications that researchers should understand.

Think of it this way: The error distribution is like choosing the lens through which you view your data. Different lenses (distributions) will:

  • Emphasize different aspects of the data
  • Handle extreme cases differently
  • Lead to different predictions in the tails

Let’s explore what this means in practice:

1.5.8.1 Tail Behavior and Extreme Events

First, what do we mean by “tails”?

In statistics, the “tails” of a distribution refer to the extreme values—the very high or very low outcomes that occur rarely. For binary models, this translates to:

  • Lower tail: Cases where the probability is very close to 0 (almost certainly won’t happen)
  • Upper tail: Cases where the probability is very close to 1 (almost certainly will happen)

The error distribution determines how the model handles extreme probabilities. This might seem abstract, so let’s use a concrete example:

Imagine modeling democratic breakdowns or coups:

  • Logistic Distribution (Logit): Has heavier tails (kurtosis = 1.2), meaning:

    • More probability mass in extreme values
    • Slower approach to 0 and 1 probabilities
    • More “surprises” (unexpected 0s when probability is high, unexpected 1s when probability is low)
    • Better for modeling processes with occasional extreme deviations
  • Normal Distribution (Probit): Has lighter tails (kurtosis = 0), meaning:

    • Faster approach to certainty (0 or 1)
    • Fewer extreme observations contrary to prediction
    • Better for modeling processes following central limit theorem logic

Practical Implication: If your phenomenon involves rare but impactful shocks (coups, revolutions, war onset), logit’s heavier tails may be more appropriate. If the process aggregates many small influences (voting behavior from multiple social pressures, gradual democratization), probit’s normal assumption may be preferable.

1.5.8.2 Identification and Scale

The variance of \(u_i\) is not identified separately from \(\beta\):

  • Logit: Assumes \(\text{Var}(u_i) = \pi^2/3 \approx 3.29\)
  • Probit: Assumes \(\text{Var}(u_i) = 1\)

This means:

  • Coefficients are only identified up to scale
  • Logit coefficients are typically about 1.6-1.8 times larger than probit coefficients
  • Cannot compare coefficient magnitudes across models with different link functions
  • Must focus on marginal effects or predicted probabilities for cross-model comparisons

Practical Implication: Never compare raw coefficients between logit and probit models. Always transform to a common scale (predicted probabilities, marginal effects, or standardized coefficients).

1.5.8.3 Theoretical Foundations

The error distribution connects to different theoretical frameworks:

Logistic (Logit):

  • Arises from random utility models with Type I extreme value errors
  • Natural for modeling discrete choices among alternatives
  • Connects to multinomial and conditional logit for multiple choices
  • Computational advantages in certain theoretical models

Normal (Probit):

  • Arises from threshold models with normally distributed latent traits
  • Natural when latent variable has normal distribution (by CLT or assumption)
  • Connects cleanly to factor analysis and structural equation models
  • Facilitates certain types of model extensions (selection models, SUR)

Practical Implication: Choose based on your theoretical model. If you’re extending to multinomial choice, logit provides consistency. If you’re embedding in a larger structural model with normal assumptions, probit maintains coherence.

1.5.8.4 What Researchers Should Look For

When choosing between error distributions, examine:

A. Empirical Fit at the Tails

# check how accurate predictions are at extreme values
extreme_low <- data[predicted_prob < 0.1, ]
extreme_high <- data[predicted_prob > 0.9, ]
# compare what actually happened vs what we predicted

B. Theoretical Consistency

  • Does your theory imply a specific aggregation process?
  • Are you modeling individual heterogeneity or population aggregates?
  • Will you extend to multinomial or ordered outcomes?

C. Sensitivity Analysis

  • Fit both models and compare:

    • Predicted probabilities (should be very similar)
    • Marginal effects (should be nearly identical)
    • Out-of-sample predictions
    • Information criteria (AIC/BIC)

D. Warning Signs of Misspecification

  • Large differences between logit and probit predictions (suggests neither is right)
  • Systematic prediction errors at extremes
  • Residual patterns suggesting different tail behavior
  • Theoretical reasons for non-standard error distributions

1.5.8.5 When the Standard Assumptions Break Down

Sometimes neither logistic nor normal errors are appropriate:

Asymmetric Processes:

  • Use complementary log-log or log-log for asymmetric approach to boundaries
  • Consider scobit for skewed response curves

Contaminated Processes:

  • Mixture models for populations with different error distributions
  • Robust link functions for outlier-prone processes

Bounded Rationality:

  • Trembling hand models with explicit error probabilities
  • Quantal response models with tunable precision parameters

Practical Implication: The logit/probit choice is often less important than recognizing when neither is appropriate. Large discrepancies between models or poor fit at extremes suggest exploring alternative link functions.

1.5.8.6 The Bottom Line on Error Distributions

For most applications, the choice between logistic and normal errors is inconsequential – both will yield similar substantive conclusions. However, researchers should:

  1. Default to disciplinary conventions unless there’s a reason to deviate
  2. Let theory guide choice when extending to more complex models
  3. Check robustness by estimating both specifications
  4. Focus on quantities that don’t depend on the distribution (predicted probabilities, marginal effects)
  5. Be alert to signs that standard distributions are inadequate

As Amemiya (1981) noted, “the choice between logit and probit models is largely one of convenience and convention.”

2 Alternative binary response models

So far, we’ve focused on the symmetric S-curves of logit and probit. But what if your phenomenon doesn’t follow that symmetric pattern? What if the approach to 0 is fundamentally different from the approach to 1?

This is where alternative link functions come in. These aren’t just mathematical curiosities—they can better capture the underlying process generating your data.

2.1 Complementary Log-Log Model

Before diving into the math, let’s understand when you might need an asymmetric model:

Example: Democratic Transitions

  • For a long time, the probability of democratization in an authoritarian regime remains very low
  • But once certain conditions are met (economic crisis, elite splits, mass protests reach critical mass)
  • The probability of regime change shoots up rapidly
  • This is NOT symmetric around 0.5!

Another Example: Revolution or Civil War Onset

  • Years of grievances with low probability of conflict
  • Then a triggering event (assassination, election fraud, economic collapse)
  • Suddenly the probability of violence escalates quickly

2.1.1 Specification

The complementary log-log (cloglog) model uses an asymmetric link function:

\[\Pr(Y_i = 1) = 1 - \exp[-\exp(X_i\beta)]\]

Equivalently:

\[\ln\{-\ln[1 - \Pr(Y_i = 1)]\} = X_i\beta\]

Understanding the Double Exponential

This looks complicated, but there’s intuition here. Let’s build it up step by step with a concrete example.

Example: Modeling Government Collapse

Imagine we’re modeling the probability a government falls within a year:

  1. \(\exp(X_i\beta)\) creates a positive hazard rate

    • Think of this as the “instability level” - always positive
    • If \(X_i\beta = 0\), hazard = 1 (baseline instability)
    • If \(X_i\beta = 2\), hazard ≈ 7.4 (high instability, e.g., coalition breakdown)
  2. \(-\exp(X_i\beta)\) makes it negative

    • We need this negative for the next step
  3. \(\exp[-\exp(X_i\beta)]\) is the survival probability (probability of government surviving)

    • This gives us the chance of making it through the period
    • When instability is high, survival probability is low
  4. \(1 - \exp[-\exp(X_i\beta)]\) is the probability of government collapse

    • Finally, we flip it to get probability of the event happening

Working from the inside out:

This structure comes from survival analysis, where we model the hazard (instantaneous risk) of an event.

2.1.2 Key Properties

  • Asymmetric: Unlike logit and probit, cloglog is not symmetric around 0.5
  • Approaches 0 slowly, 1 quickly: The curve rises steeply once it gets going
  • Maximum slope occurs around Pr = 0.37: Not at 0.5 like symmetric models

2.1.3 When to Use Cloglog

1. Discrete-Time Survival Analysis

Cloglog naturally arises when you observe a continuous-time process at discrete intervals. If the underlying hazard is constant within intervals:

\[\Pr(\text{Event in interval } t) = 1 - \exp[-\exp(\alpha_t + X_i\beta)]\]

Example: Modeling democratic transitions observed annually when the underlying democratization pressure is continuous. Or modeling war onset observed yearly when tensions build continuously.

2. Rare Events That Can Escalate Quickly

The asymmetry matches processes where:

  • Events are unlikely for a long time
  • Once conditions worsen, probability rises rapidly
  • There’s a “tipping point” dynamic

Examples:

  • Protest movements escalating to revolution
  • Regional conflicts spreading to neighboring states
  • Democratic waves spreading across regions (Arab Spring, Latin American democratization)
  • Electoral violence triggering broader civil conflict

3. Theoretical Justification from Extreme Value Theory

If \(Y = 1\) when the maximum of many random shocks exceeds a threshold, and these shocks follow a specific distribution, you get cloglog.

2.1.4 Interpreting Cloglog Coefficients

Unlike logit (where coefficients are log-odds ratios), cloglog coefficients have a different interpretation. This is where many researchers get confused, so let’s break it down carefully.

\[\ln\{-\ln[1 - \Pr(Y_i = 1)]\} = X_i\beta\]

What does this mean in plain English?

  • Positive \(\beta_k\): Increases the log-hazard
    • Example: If \(\beta_{protests} = 0.5\), more protests increase the “hazard” of regime change
    • But unlike logit, we can’t say “the odds double” or similar
  • The effect is not symmetric: A one-unit increase has a different magnitude effect than a one-unit decrease
    • Why this matters: If you’re studying both protective and risk factors, their effects won’t be mirror images
  • Marginal effects depend more strongly on baseline probability than in logit/probit
    • Practical implication: The same variable might have very different effects for high-risk vs. low-risk groups

Practical Tip: Focus on predicted probabilities and marginal effects rather than raw coefficients.

# define the link functions
x_range = seq(-3, 3, 0.01)
logit_probs = plogis(x_range)
probit_probs = pnorm(x_range)
cloglog_probs = 1 - exp(-exp(x_range))
loglog_probs = exp(-exp(-x_range))  # Note: reversed interpretation

link_comparison = data.frame(
  x = x_range,
  Logit = logit_probs,
  Probit = probit_probs,
  Cloglog = cloglog_probs,
  Loglog = loglog_probs
)

link_plot = link_comparison %>%
  pivot_longer(cols = -x, names_to = "Model", values_to = "Probability") %>%
  ggplot(aes(x = x, y = Probability, color = Model)) +
  geom_line(size = 1.2) +
  labs(title = "Comparison of Link Functions",
       subtitle = "Note the asymmetry in cloglog and loglog",
       x = "Linear Predictor (Xβ)",
       y = "Pr(Y = 1)") +
  theme_minimal() +
  geom_hline(yintercept = 0.5, linetype = "dotted", alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dotted", alpha = 0.5)

print(link_plot)

# show where each model reaches key probabilities
key_probs = c(0.1, 0.5, 0.9)
for(p in key_probs) {
  cat(sprintf("\nXβ value needed for Pr(Y=1) = %.1f:\n", p))
  cat(sprintf("  Logit:   %6.3f\n", qlogis(p)))
  cat(sprintf("  Probit:  %6.3f\n", qnorm(p)))
  cat(sprintf("  Cloglog: %6.3f\n", log(-log(1-p))))
  cat(sprintf("  Loglog:  %6.3f\n", -log(-log(p))))
}
## 
## Xβ value needed for Pr(Y=1) = 0.1:
##   Logit:   -2.197
##   Probit:  -1.282
##   Cloglog: -2.250
##   Loglog:  -0.834
## 
## Xβ value needed for Pr(Y=1) = 0.5:
##   Logit:    0.000
##   Probit:   0.000
##   Cloglog: -0.367
##   Loglog:   0.367
## 
## Xβ value needed for Pr(Y=1) = 0.9:
##   Logit:    2.197
##   Probit:   1.282
##   Cloglog:  0.834
##   Loglog:   2.250
# fit the cloglog model
cloglog_model = glm(y ~ x, data = data, 
                      family = binomial(link = "cloglog"))
summary(cloglog_model)
## 
## Call:
## glm(formula = y ~ x, family = binomial(link = "cloglog"), data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.21156    0.07242  -2.921  0.00349 ** 
## x            0.56631    0.05024  11.272  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 688.91  on 499  degrees of freedom
## Residual deviance: 488.21  on 498  degrees of freedom
## AIC: 492.21
## 
## Number of Fisher Scoring iterations: 5
# compare predictions at extreme values
extreme_x = c(-2, -1, 0, 1, 2)
pred_comparison = data.frame(
  x = extreme_x,
  Logit = predict(logit_model, newdata = data.frame(x = extreme_x), type = "response"),
  Probit = predict(probit_model, newdata = data.frame(x = extreme_x), type = "response"),
  Cloglog = predict(cloglog_model, newdata = data.frame(x = extreme_x), type = "response")
)
print("\nPredicted probabilities at different X values:")
## [1] "\nPredicted probabilities at different X values:"
print(pred_comparison)
##    x     Logit    Probit   Cloglog
## 1 -2 0.2040555 0.2097422 0.2295298
## 2 -1 0.3822494 0.3875535 0.3683281
## 3  0 0.5989553 0.5932502 0.5548412
## 4  1 0.7828308 0.7756334 0.7596912
## 5  2 0.8969111 0.8995768 0.9188912

2.2 Log-Log Model

The log-log model is the “reverse” of cloglog:

\[\Pr(Y_i = 1) = \exp[-\exp(-X_i\beta)]\]

Key Differences from Cloglog:

  • Approaches 1 slowly, 0 quickly (opposite of cloglog)
  • Maximum slope occurs around Pr = 0.63
  • Positive coefficients DECREASE probability (counterintuitive!)

2.2.1 When to Use Log-Log

Rare Non-Events: When you’re modeling the probability of something NOT happening, and non-occurrence can become certain quickly:

  • Probability of equipment NOT failing
  • Probability of NOT defaulting on a loan
  • Probability of a species NOT going extinct

The Coefficient Interpretation Trap

Because of how log-log is parameterized, interpretation is tricky:

  • Positive \(\beta\) → Lower probability of Y = 1
  • Negative \(\beta\) → Higher probability of Y = 1

This is so counterintuitive that many researchers avoid log-log entirely, preferring to:

  1. Flip their outcome coding and use cloglog
  2. Use cloglog and focus on 1 - Pr(Y = 1)

2.3 Scobit Model

2.3.1 Motivation

The scobit (skewed-logit) model, introduced by Nagler (1994), generalizes the logit model to allow for asymmetric response curves while maintaining the odds ratio framework.

Why would you want this?

Imagine you’re modeling legislative voting behavior and you suspect:

  • It’s hard to get strong partisans to cross party lines (probability increases slowly at first)
  • But once a legislator becomes somewhat likely to defect, additional pressures have bigger effects
  • Yet you still want to interpret results using odds ratios (unlike cloglog)

Or consider modeling alliance formation:

  • Countries with historical animosity are very unlikely to ally (slow initial increase)
  • But once they overcome initial barriers, additional incentives matter more

Scobit lets you test whether this asymmetry exists while keeping the interpretational framework of logit.

2.3.2 Specification

The scobit uses the Burr-10 (of course, right? lol) distribution:

\[\Pr(Y_i = 1) = \frac{1}{[1 + \exp(-X_i\beta)]^\alpha}\]

where \(\alpha > 0\) is a shape parameter.

What α Does:

  • \(\alpha = 1\): Standard logit (symmetric)
  • \(\alpha < 1\): Probability rises slowly at first, then quickly (like cloglog)
  • \(\alpha > 1\): Probability rises quickly at first, then slowly (like log-log)

2.3.3 The Appeal and Problems of Scobit

Theoretical Appeal:

  1. Nests logit: Can test \(H_0: \alpha = 1\) to see if asymmetry matters
  2. Preserves odds ratios: Maintains some interpretability of logit
  3. Flexible: Can approximate various asymmetric patterns

Practical Problems:

1. Identification Issues

The shape parameter \(\alpha\) and slope coefficients \(\beta\) can be hard to separately identify:

  • Both affect how steeply probability changes
  • Need lots of data, especially in tails
  • Convergence problems are common

2. Interpretation Complexity

With varying \(\alpha\):

  • Odds ratios are no longer constant
  • Marginal effects depend on both \(\alpha\) and \(\beta\)
  • Can’t compare coefficients across models with different \(\alpha\)

2.3.4 When Scobit Might Be Worth It

Strong Theoretical Reason for Flexible Asymmetry: You believe the response curve could be asymmetric in either direction and want to let the data decide.

Large Sample with Good Coverage: You have thousands of observations with outcomes across the probability range.

Model Selection Exercise: You’re comparing multiple link functions and want the most flexible option.

# demonstrate scobit with different alpha values
scobit_cdf = function(x, alpha) {
  1 / (1 + exp(-x))^alpha
}

alpha_values = c(0.2, 0.5, 1, 2, 3)
scobit_df = expand.grid(
  x = x_range,
  alpha = alpha_values
) %>%
  mutate(
    Probability = mapply(scobit_cdf, x, alpha),
    Alpha = factor(alpha, labels = paste("α =", alpha_values))
  )

scobit_plot = ggplot(scobit_df, aes(x = x, y = Probability, color = Alpha)) +
  geom_line(size = 1.2) +
  labs(title = "Scobit CDFs with Varying Shape Parameters",
       subtitle = "α = 1 corresponds to standard logit",
       x = "Linear Predictor (Xβ)",
       y = "Pr(Y = 1)") +
  theme_minimal() +
  geom_hline(yintercept = c(0.37, 0.5, 0.63), linetype = "dotted", alpha = 0.3) +
  annotate("text", x = -2.5, y = 0.39, label = "Cloglog max slope", size = 3) +
  annotate("text", x = -2.5, y = 0.52, label = "Logit max slope", size = 3) +
  annotate("text", x = -2.5, y = 0.65, label = "Log-log max slope", size = 3)

print(scobit_plot)

# demonstrate the identification challenge
# show how different parameter combinations can give similar curves
demo_x = seq(-3, 3, 0.1)

# case 1: α = 0.5, β = 1
case1_prob = scobit_cdf(demo_x * 1, 0.5)

# case 2: α = 2, β = 0.5  
case2_prob = scobit_cdf(demo_x * 0.5, 2)

ident_df = data.frame(
  x = demo_x,
  Case1 = case1_prob,
  Case2 = case2_prob
)

ident_plot = ggplot(ident_df, aes(x = x)) +
  geom_line(aes(y = Case1, color = "α=0.5, β=1"), size = 1.2) +
  geom_line(aes(y = Case2, color = "α=2, β=0.5"), size = 1.2, linetype = "dashed") +
  labs(title = "Identification Challenge in Scobit",
       subtitle = "Different parameter combinations can yield similar curves",
       x = "X (unstandardized)",
       y = "Pr(Y = 1)",
       color = "Parameters") +
  theme_minimal()

print(ident_plot)

2.5 Choosing Among Alternative Models

2.5.1 Decision Tree for Model Selection

  1. Is symmetry reasonable?

    • Yes → Logit or Probit
    • No → Continue to 2
  2. Which direction is the asymmetry?

    • Slow approach to 0, fast to 1 → Cloglog
    • Fast approach to 0, slow to 1 → Log-log
    • Uncertain → Scobit (if you have lots of data)
  3. Do you have a survival/duration interpretation?

    • Yes → Cloglog often natural
    • No → Consider theoretical justification
  4. Sample size and coverage?

    • Large with good tail coverage → Can estimate flexible models
    • Small or limited range → Stick with standard models

2.5.2 Empirical Testing Strategy

# fit multiple models for comparison
models = list(
  logit = glm(y ~ x, family = binomial(link = "logit"), data = data),
  probit = glm(y ~ x, family = binomial(link = "probit"), data = data),
  cloglog = glm(y ~ x, family = binomial(link = "cloglog"), data = data),
  loglog = glm(y ~ x, family = binomial(link = "loglog"), data = data)
)

# compare using information criteria
sapply(models, AIC)
sapply(models, BIC)

# vuong test for non-nested models (if needed)
# likelihood ratio tests for nested models

2.5.3 The Practical Reality

Despite all these options, in practice: - 90% of papers use logit or probit - 8% use cloglog (mostly survival analysis) - 1% use log-log - <1% use scobit or other exotic links

Why? Because:

  1. Marginal effects are usually similar across reasonable models
  2. Software support varies dramatically
  3. Reviewer familiarity matters for publication
  4. Theoretical justification is often weak for exotic models

The bottom line: Use alternative link functions when you have a strong theoretical reason or when standard models clearly fail. Otherwise, the added complexity rarely justifies the minimal gain in fit.

3 Model interpretation and marginal effects

We’ve estimated our models and have a table full of coefficients. Now comes the crucial question: what do these numbers actually mean? This is where many analyses go wrong—researchers report coefficients without translating them into quantities that readers can understand.

3.1 Coefficient Interpretation

Binary response model coefficients don’t have a direct interpretation like OLS. Let’s see why with a comparison:

In OLS (Linear Regression):

  • Coefficient = 2.5
  • Interpretation: “A one-unit increase in democracy score increases aid by $2.5 million”
  • This is true everywhere, always

In Logit (Modeling conflict onset):

  • Coefficient = 2.5
  • Interpretation: “A one-unit increase in capability ratio increases the log-odds of war by 2.5”
  • But what are log-odds? Policymakers and the public have no intuition for this!

In Probit (Modeling voting behavior):

  • Coefficient = 2.5
  • Interpretation: “A one-unit increase in ideology increases the z-score by 2.5”
  • Again, not interpretable for most audiences

Instead:

  • Logit: Coefficients represent changes in log-odds
  • Probit: Coefficients represent changes in z-scores
  • All models: Sign indicates direction of effect on probability

The Fundamental Challenge

In OLS, a one-unit change in X always leads to a \(\beta\) change in Y. Simple. In binary response models, the effect of X on probability depends on where you start:

  • Moving from very low probability (0.01 to 0.02) is “easy”
  • Moving from medium probability (0.49 to 0.50) is moderate
  • Moving from high probability (0.98 to 0.99) is “hard”

This is why we can’t interpret coefficients directly – we need to know the baseline probability first.

3.2 Marginal Effects

The marginal effect of \(X\) on \(\Pr(Y = 1)\) depends on the current values of all variables:

\[\frac{\partial \Pr(Y_i = 1)}{\partial X_{ik}} = f(X_i\beta) \cdot \beta_k\]

where \(f(\cdot)\) is the PDF of the link function.

Breaking This Down:

  • \(\beta_k\) is the coefficient (constant across observations)
  • \(f(X_i\beta)\) is the “density weight” (varies across observations)
  • For logit: \(f(X_i\beta) = p_i(1-p_i)\) (maximized at p = 0.5)
  • For probit: \(f(X_i\beta) = \phi(X_i\beta)\) (standard normal PDF)

3.2.1 Three Types of Marginal Effects

1. Marginal Effects at the Mean (MEM)

  • Calculate marginal effect for a hypothetical “average” person
  • Set all X variables to their means, then calculate ME
  • Easy to report but represents no actual person

When to use: When you need a single summary number and your audience expects it

Problem: The “average state” might not exist. Example: If your sample is 50% democracies and 50% autocracies, the “average” country has regime type = 0.5, which is meaningless! Or the “average” legislator being half Republican and half Democrat.

2. Average Marginal Effects (AME) - The Observed Value Approach

  • Calculate ME for each actual observation at their observed values
  • Take the average across all observations
  • Most commonly reported in modern work
  • This is the observed value approach advocated by Hanmer & Kalkan (2013)

Why AME/Observed Values? As Hanmer & Kalkan (2013) argue, this approach: - Respects the actual distribution of your data - Avoids creating impossible “average” units - Better represents population heterogeneity - Accounts for the nonlinearity of the model across the full range of data - Gives you the average effect for real units, not the effect for an imaginary average unit

Key insight: In nonlinear models, the effect at the mean ≠ mean of the effects!

3. Marginal Effects at Representative Values (MER)

  • Calculate ME at specific, meaningful values
  • E.g., “effect for high-risk individuals” or “effect at median income”
  • Most useful for understanding heterogeneity
# calculate Average Marginal Effects (AME) - the observed value approach
margins_logit_ame = margins(logit_model)
summary(margins_logit_ame)
##  factor    AME     SE       z      p  lower  upper
##       x 0.1418 0.0060 23.7169 0.0000 0.1301 0.1535
# compare with Marginal Effects at Means (MEM) - the traditional approach
# First create a dataset with all variables at their means
mean_data = data.frame(x = mean(data$x))
margins_logit_mem = margins(logit_model, data = mean_data)
summary(margins_logit_mem)
##  factor    AME     SE       z      p  lower  upper
##       x 0.2149 0.0197 10.8903 0.0000 0.1762 0.2535
# Show the difference
cat("\nComparing approaches:\n")
## 
## Comparing approaches:
cat("AME (Observed Values):", summary(margins_logit_ame)$AME[1], "\n")
## AME (Observed Values): 0.1417677
cat("MEM (At Means):", summary(margins_logit_mem)$AME[1], "\n")
## MEM (At Means): 0.2148566
cat("Difference can be substantial when data is not centered around 0.5 probability!\n")
## Difference can be substantial when data is not centered around 0.5 probability!
# create visualizations how marginal effects vary with X
x_vals = seq(-3, 3, 0.1)
me_data = data.frame(x = x_vals)

# For logit: ME = β * p(1-p)
logit_coef = coef(logit_model)[2]
me_data$logit_prob = predict(logit_model, newdata = me_data, type = "response")
me_data$logit_me = logit_coef * me_data$logit_prob * (1 - me_data$logit_prob)

# For probit: ME = β * φ(Xβ)
probit_coef = coef(probit_model)[2]
probit_xb = predict(probit_model, newdata = me_data, type = "link")
me_data$probit_me = probit_coef * dnorm(probit_xb)

# For LPM: ME is constant
lpm_coef = coef(lpm_model)[2]
me_data$lpm_me = lpm_coef

me_plot = me_data %>%
  pivot_longer(cols = c(logit_me, probit_me, lpm_me), 
               names_to = "Model", 
               values_to = "MarginalEffect") %>%
  mutate(Model = str_remove(Model, "_me") %>% str_to_title()) %>%
  ggplot(aes(x = x, y = MarginalEffect, color = Model)) +
  geom_line(size = 1.2) +
  labs(title = "Marginal Effects Across Different Models",
       subtitle = "Note how logit/probit effects vary with X, while LPM is constant",
       x = "Value of X",
       y = "Marginal Effect (∂Pr/∂X)") +
  theme_minimal() +
  geom_hline(yintercept = 0, linetype = "dotted", alpha = 0.5)

print(me_plot)

# show the distribution of marginal effects in the actual data
actual_me_logit = margins(logit_model, at = list(x = data$x))
hist(actual_me_logit$fitted, 
     main = "Distribution of Marginal Effects in Sample",
     xlab = "Individual Marginal Effect",
     col = "lightblue")
abline(v = mean(actual_me_logit$fitted), col = "red", lwd = 2)
text(mean(actual_me_logit$fitted), par("usr")[4]*0.9, 
     "AME", pos = 4, col = "red")

3.2.2 Interpreting Interaction Effects (The Hidden Complexity)

In linear models, the interaction effect is just the coefficient on the interaction term. In nonlinear models, it’s much more complex.

The Ai-Norton Critique: The marginal effect of an interaction involves:

  1. The direct effect of the interaction term
  2. How the interaction changes the marginal effect of X1
  3. How the interaction changes the marginal effect of X2
  4. All evaluated at specific values of the variables

For logit with interaction \(Y = \Lambda(\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2)\):

\[\frac{\partial^2 \Pr(Y=1)}{\partial X_1 \partial X_2} = \lambda'(\cdot)[\beta_1\beta_2] + \lambda(\cdot)\beta_3\]

This can even have a different sign than \(\beta_3\)!

Practical Advice:

  • Don’t interpret interaction coefficients directly
  • Calculate marginal effects at different values
  • Use plots to show how effects vary
  • Consider using LPM for transparent interactions

4 Model identification and assumptions

4.1 The Identification Problem

What is “identification” and why should you care?

Identification means being able to uniquely determine the parameters of your model from the data. Without it, multiple different parameter sets could explain your data equally well—obviously a problem!

For latent variable models, we cannot separately identify:

  • The location of \(Y^*\) (where is zero?)

    • Example: Is the threshold for war at “zero grievance” or “moderate tension”?
  • The scale of \(Y^*\) (what’s the unit?)

    • Example: Is the difference between allies and rivals one unit or ten units of “hostility”?
  • The variance of errors

    • Example: How much unexplained variation is there in voting behavior?

The Analogy: It’s like measuring temperature. We can’t identify:

  • Absolute zero separately from scale (Celsius vs Kelvin)
  • Degrees separately from unit size (Celsius vs Fahrenheit)
  • We can only identify ratios and differences

4.2 Standard Normalizations

To achieve identification, we fix:

  1. Threshold: \(Y = 1\) when \(Y^* > 0\) (absorbed into intercept)
  2. Error mean: \(E(u_i|X) = 0\) (absorbed into intercept)
  3. Error variance:
    • Logit: \(\text{Var}(u_i) = \pi^2/3 \approx 3.29\)
    • Probit: \(\text{Var}(u_i) = 1\)

Why These Specific Values?

  • Probit’s 1: Mathematical convenience (standard normal)
  • Logit’s π²/3: Not arbitrary! It’s the variance of the standard logistic distribution

4.3 Scale Invariance: Why It (Mostly) Doesn’t Matter

# demonstrate scale invariance of probabilities
# original probit model
probit1 = glm(y ~ x, data = data, family = binomial(link = "probit"))

# create scaled version (multiply coefficients by 2)
original_coef = coef(probit1)
scaled_coef = original_coef * 2

# predictions at original scale
x_test = c(-2, -1, 0, 1, 2)
original_xb = original_coef[1] + original_coef[2] * x_test
original_prob = pnorm(original_xb)

# predictions at scaled version
scaled_xb = scaled_coef[1] + scaled_coef[2] * x_test
# to get same probabilities, need variance = 4, so sd = 2
scaled_prob = pnorm(scaled_xb / 2)

comparison_df = data.frame(
  X = x_test,
  Original = original_prob,
  Scaled_Doubled = scaled_prob,
  Difference = abs(original_prob - scaled_prob)
)

print("Demonstrating Scale Invariance of Probabilities:")
## [1] "Demonstrating Scale Invariance of Probabilities:"
print(comparison_df)
##    X  Original Scaled_Doubled Difference
## 1 -2 0.2097422      0.2097422          0
## 2 -1 0.3875535      0.3875535          0
## 3  0 0.5932502      0.5932502          0
## 4  1 0.7756334      0.7756334          0
## 5  2 0.8995768      0.8995768          0
# show what this means for comparing models
cat("\nWhat this means for practice:\n")
## 
## What this means for practice:
cat("1. Can't compare raw coefficients across:\n")
## 1. Can't compare raw coefficients across:
cat("   - Different models (logit vs probit)\n")
##    - Different models (logit vs probit)
cat("   - Different samples (unless variance is constant)\n")
##    - Different samples (unless variance is constant)
cat("   - Models with different covariates\n")
##    - Models with different covariates
cat("2. CAN compare:\n")
## 2. CAN compare:
cat("   - Predicted probabilities\n")
##    - Predicted probabilities
cat("   - Marginal effects\n")
##    - Marginal effects
cat("   - Ratios of coefficients within same model\n")
##    - Ratios of coefficients within same model

4.4 The Heteroscedasticity Problem We Don’t Talk About

In regular regression, heteroscedasticity means the variance of errors changes across observations. It’s annoying but not catastrophic—you can use robust standard errors and move on.

In binary response models, it’s much more serious.

Unlike linear models where heteroscedasticity “just” affects standard errors, in binary models it affects everything:

If \(\text{Var}(u_i|X_i) = \sigma_i^2\) varies across observations:

  • Coefficients are biased (not just inefficient)
  • Maximum likelihood is inconsistent
  • Standard link functions are wrong

Solutions:

  1. Heteroscedastic probit: Model \(\sigma_i = \exp(Z_i\gamma)\)
  2. Robust methods: Klein & Spady semiparametric estimator
  3. Hope it’s not too bad: Most common approach

5 Model selection and goodness of fit

You’ve estimated several models — logit, probit, maybe cloglog. How do you choose? More importantly, how do you know if any of them fit well?

This section will give you practical tools for model selection and evaluation. But first, a warning: don’t get too obsessed with finding the “best” model by statistical criteria. Often, the choice should be driven by:

  • Theoretical considerations
  • Interpretational needs
  • Disciplinary conventions
  • Robustness across specifications

5.1 Comparing Models: Beyond Logit vs Probit

# compare predictions
pred_comparison = data.frame(
  logit = predict(logit_model, type = "response"),
  probit = predict(probit_model, type = "response")
)

cor_pred = cor(pred_comparison$logit, pred_comparison$probit)

ggplot(pred_comparison, aes(x = logit, y = probit)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(title = "Logit vs Probit Predictions",
       subtitle = paste("Correlation:", round(cor_pred, 4)),
       x = "Logit Predicted Probability",
       y = "Probit Predicted Probability") +
  theme_minimal()

# compare coefficients (need to scale for comparison)
# rule of thumb: logit ≈ 1.6 × probit
scale_factor = sqrt(pi^2/3)
scaled_comparison = data.frame(
  Variable = names(coef(logit_model)),
  Logit = coef(logit_model),
  Probit = coef(probit_model),
  Ratio = coef(logit_model) / coef(probit_model),
  Expected_Ratio = scale_factor
)

print("Coefficient Comparison:")
## [1] "Coefficient Comparison:"
print(scaled_comparison)
##                Variable     Logit    Probit    Ratio Expected_Ratio
## (Intercept) (Intercept) 0.4011140 0.2359139 1.700256       1.813799
## x                     x 0.8811257 0.5216150 1.689226       1.813799
cat("\nNote: Logit/Probit ratio should be approximately", round(scale_factor, 2), "\n")
## 
## Note: Logit/Probit ratio should be approximately 1.81

5.2 Model Evaluation Metrics

5.2.1 The Unique Challenges of Binary Outcomes

Unlike continuous outcomes where R² tells a clear story, binary models have multiple, non-equivalent measures of fit:

Pseudo-R² Measures (None are true R²!)

  • McFadden’s: \(1 - \ln L / \ln L_0\) (most common)
  • Efron’s: Squared correlation between y and ŷ
  • McKelvey-Zavoina: R² of latent variable (unobserved!)

2. Classification Measures

  • Accuracy: (TP + TN) / N
  • Sensitivity: TP / (TP + FN)
  • Specificity: TN / (TN + FP)
  • Precision: TP / (TP + FP)

3. Probability Measures

  • AUC: Area under ROC curve
  • Brier Score: Mean squared difference between probability and outcome
  • Log Score: Negative log-likelihood
# function to calculate various metrics
evaluate_model = function(model, data, model_name) {
  # Handle LPM differently (it's lm, not glm)
  if(class(model)[1] == "lm") {
    pred_prob = fitted(model)
    # Truncate LPM predictions to [0,1] for fair comparison
    pred_prob = pmax(0, pmin(1, pred_prob))
  } else {
    pred_prob = predict(model, type = "response")
  }
  
  pred_class = ifelse(pred_prob > 0.5, 1, 0)
  actual = data$y
  
  # Confusion matrix metrics
  conf_matrix = table(Predicted = pred_class, Actual = actual)
  accuracy = sum(diag(conf_matrix)) / sum(conf_matrix)
  
  # Sensitivity and Specificity
  sensitivity = conf_matrix[2,2] / sum(conf_matrix[,2])
  specificity = conf_matrix[1,1] / sum(conf_matrix[,1])
  
  # Log-likelihood and information criteria
  if(class(model)[1] == "lm") {
    # For LPM, calculate likelihood assuming normal errors
    n = length(actual)
    rss = sum((actual - pred_prob)^2)
    loglik = -n/2 * (log(2*pi) + log(rss/n) + 1)
    aic = -2*loglik + 2*length(coef(model))
    bic = -2*loglik + log(n)*length(coef(model))
  } else {
    loglik = as.numeric(logLik(model))
    aic = AIC(model)
    bic = BIC(model)
  }
  
  # McFadden's R-squared (not meaningful for LPM)
  if(class(model)[1] != "lm") {
    null_model = glm(y ~ 1, data = data, family = binomial())
    mcfadden_r2 = 1 - (logLik(model) / logLik(null_model))
  } else {
    mcfadden_r2 = NA
  }
  
  # ROC and AUC
  roc_obj = roc(actual, pred_prob, quiet = TRUE)
  auc_val = auc(roc_obj)
  
  # Brier Score
  brier = mean((pred_prob - actual)^2)
  
  return(data.frame(
    Model = model_name,
    Accuracy = round(accuracy, 3),
    Sensitivity = round(sensitivity, 3),
    Specificity = round(specificity, 3),
    AUC = round(as.numeric(auc_val), 3),
    Brier = round(brier, 4),
    McFadden_R2 = ifelse(is.na(mcfadden_r2), NA, round(as.numeric(mcfadden_r2), 3))
  ))
}

# evaluate all models
evaluation_results = rbind(
  evaluate_model(lpm_model, data, "LPM"),
  evaluate_model(logit_model, data, "Logit"),
  evaluate_model(probit_model, data, "Probit"),
  evaluate_model(cloglog_model, data, "Cloglog")
)

print("Model Comparison:")
## [1] "Model Comparison:"
print(evaluation_results)
##     Model Accuracy Sensitivity Specificity  AUC  Brier McFadden_R2
## 1     LPM    0.758       0.769       0.744 0.84 0.1642          NA
## 2   Logit    0.758       0.777       0.736 0.84 0.1624       0.293
## 3  Probit    0.758       0.773       0.740 0.84 0.1623       0.295
## 4 Cloglog    0.756       0.755       0.758 0.84 0.1634       0.291
# roc curves
par(mfrow = c(1, 2))

# roc plot
roc_logit = roc(data$y, predict(logit_model, type = "response"), quiet = TRUE)
roc_probit = roc(data$y, predict(probit_model, type = "response"), quiet = TRUE)
roc_cloglog = roc(data$y, predict(cloglog_model, type = "response"), quiet = TRUE)
roc_lpm = roc(data$y, fitted(lpm_model), quiet = TRUE)

plot(roc_logit, main = "ROC Curves", col = "red")
plot(roc_probit, add = TRUE, col = "blue")
plot(roc_cloglog, add = TRUE, col = "green")
plot(roc_lpm, add = TRUE, col = "purple", lty = 2)
legend("bottomright", 
       legend = c("Logit", "Probit", "Cloglog", "LPM"),
       col = c("red", "blue", "green", "purple"),
       lty = c(1, 1, 1, 2),
       cex = 0.8)

# calibration plot
logit_probs = predict(logit_model, type = "response")
calib_df = data.frame(
  pred = logit_probs,
  actual = data$y
)
calib_df$bin = cut(calib_df$pred, breaks = seq(0, 1, 0.1))
calib_summary = calib_df %>%
  group_by(bin) %>%
  summarise(
    mean_pred = mean(pred),
    mean_actual = mean(actual),
    n = n()
  ) %>%
  filter(!is.na(bin))

plot(calib_summary$mean_pred, calib_summary$mean_actual,
     main = "Calibration Plot (Logit)",
     xlab = "Mean Predicted Probability",
     ylab = "Actual Proportion",
     xlim = c(0, 1), ylim = c(0, 1))
abline(0, 1, col = "red", lty = 2)

5.2.2 Which Metric Should You Use?

With all these metrics, you might feel overwhelmed. Here’s a practical guide:

For Model Selection:

  • AIC/BIC for nested models
    • Use when: Comparing models with different numbers of parameters
    • Lower is better: But differences < 2 are negligible
  • Cross-validated log-likelihood for prediction
    • Use when: You care about out-of-sample prediction
    • Best practice: Always do this for prediction tasks
  • Vuong test for non-nested models
    • Use when: Comparing logit vs. cloglog, for example
    • Interpretation: Tests which model is closer to the truth

For Reporting Fit:

  • AUC for discrimination ability
  • Calibration plots for probability accuracy

For Practical Applications:

  • Depends on costs of false positives vs false negatives
  • Consider decision thresholds other than 0.5
  • Use context-specific loss functions

6 Practical considerations and recommendations

6.1 Common Pitfalls and Solutions

6.1.1 1. Perfect Separation

What it is: One or more variables perfectly predict your outcome. Sounds good, right? Wrong!

Political Science Example:

  • All NATO members voted “yes” on a resolution, all non-NATO voted “no”
  • All democracies with GDP > $40,000 have welfare states
  • No country with nuclear weapons has been invaded

Perfect separation occurs when a predictor (or combination) perfectly predicts the outcome.

# create data with perfect separation
set.seed(456)
x_sep = c(rnorm(50, -2, 0.5), rnorm(50, 2, 0.5))
y_sep = c(rep(0, 50), rep(1, 50))
data_sep = data.frame(x = x_sep, y = y_sep)

# try to fit logit (will show warning)
logit_sep = glm(y ~ x, data = data_sep, family = binomial())

ggplot(data_sep, aes(x = x, y = y)) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Perfect Separation Example",
       subtitle = "All y=0 have x<0, all y=1 have x>0",
       x = "X", y = "Y") +
  theme_minimal()

print("Model with perfect separation (note inflated standard errors):")
## [1] "Model with perfect separation (note inflated standard errors):"
summary(logit_sep)
## 
## Call:
## glm(formula = y ~ x, family = binomial(), data = data_sep)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -0.2234 18512.2327   0.000    1.000
## x              24.6958 21192.3695   0.001    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.3863e+02  on 99  degrees of freedom
## Residual deviance: 2.1510e-09  on 98  degrees of freedom
## AIC: 4
## 
## Number of Fisher Scoring iterations: 25
cat("\nSolutions for perfect separation:\n")
## 
## Solutions for perfect separation:
cat("1. Penalized likelihood (Firth's method)\n")
## 1. Penalized likelihood (Firth's method)
cat("2. Bayesian methods with informative priors\n")
## 2. Bayesian methods with informative priors
cat("3. Exact logistic regression\n")
## 3. Exact logistic regression
cat("4. Remove problematic predictors\n")
## 4. Remove problematic predictors
cat("5. Combine categories\n")
## 5. Combine categories

6.1.2 2. Rare Events Bias

When outcomes are rare (< 5% or > 95%), standard methods can be biased.

Common in Political Science:

  • Interstate wars (< 1% of dyad-years)
  • Coups in established democracies (very rare)
  • Third-party victories in US elections (extremely rare)
  • International terrorist attacks (rare for most countries)
# create imbalanced data
set.seed(789)
n_rare = 500
x_rare = rnorm(n_rare)
p_rare = plogis(-3 + 0.5 * x_rare)  # Very low baseline probability
y_rare = rbinom(n_rare, 1, p_rare)

cat("Proportion of 1s:", mean(y_rare), "\n\n")
## Proportion of 1s: 0.044
# standard logit
logit_rare = glm(y ~ x, data = data.frame(x = x_rare, y = y_rare), 
                   family = binomial())

# cloglog might be better for rare events
cloglog_rare = glm(y ~ x, data = data.frame(x = x_rare, y = y_rare), 
                     family = binomial(link = "cloglog"))

# compare the models
rare_comparison = data.frame(
  Model = c("Logit", "Cloglog"),
  Intercept = c(coef(logit_rare)[1], coef(cloglog_rare)[1]),
  Slope = c(coef(logit_rare)[2], coef(cloglog_rare)[2]),
  AIC = c(AIC(logit_rare), AIC(cloglog_rare))
)

print("Rare Events - Model Comparison:")
## [1] "Rare Events - Model Comparison:"
print(rare_comparison)
##     Model Intercept     Slope      AIC
## 1   Logit -3.167135 0.4481748 180.5006
## 2 Cloglog -3.190186 0.4402903 180.4725
cat("\nSolutions for rare events:\n")
## 
## Solutions for rare events:
cat("1. King & Zeng bias correction\n")
## 1. King & Zeng bias correction
cat("2. Penalized likelihood\n")
## 2. Penalized likelihood
cat("3. Cloglog or log-log links\n")
## 3. Cloglog or log-log links
cat("4. Choice-based sampling with weights\n")
## 4. Choice-based sampling with weights

6.2 Best Practices Checklist

Before Modeling: - [ ] Plot Y against each X - [ ] Check for separation - [ ] Examine outcome distribution (rare events?) - [ ] Consider theoretical link function

During Modeling: - [ ] Try multiple link functions - [ ] Check convergence warnings - [ ] Examine residuals (deviance, Pearson) - [ ] Test for influential observations

After Modeling: - [ ] Report marginal effects, not just coefficients - [ ] Use AME (average marginal effects) as default - the observed value approach - [ ] Only use MEM (marginal effects at means) if specifically justified - [ ] Show predicted probabilities for meaningful scenarios - [ ] Validate out-of-sample - [ ] Conduct sensitivity analyses

6.3 Reporting Results

# create a publication-ready table
library(stargazer)

# For demonstration (would normally output to LaTeX or HTML)
stargazer(lpm_model, logit_model, probit_model,
          type = "text",
          title = "Comparison of Binary Response Models",
          dep.var.labels = "Binary Outcome",
          covariate.labels = c("X Variable", "Constant"),
          keep.stat = c("n", "rsq", "adj.rsq", "ll", "aic"),
          star.cutoffs = c(0.05, 0.01, 0.001))
## 
## Comparison of Binary Response Models
## =================================================
##                         Dependent variable:      
##                   -------------------------------
##                           Binary Outcome         
##                      OLS      logistic   probit  
##                      (1)        (2)        (3)   
## -------------------------------------------------
## X Variable         0.140***   0.881***  0.522*** 
##                    (0.009)    (0.082)    (0.044) 
##                                                  
## Constant           0.559***   0.401***  0.236*** 
##                    (0.018)    (0.115)    (0.067) 
##                                                  
## -------------------------------------------------
## Observations         500        500        500   
## R2                  0.327                        
## Adjusted R2         0.326                        
## Log Likelihood                -243.424  -242.875 
## Akaike Inf. Crit.             490.849    489.749 
## =================================================
## Note:               *p<0.05; **p<0.01; ***p<0.001
# Substantive effects plot
x_range_sub = seq(min(data$x), max(data$x), length.out = 100)
pred_data_sub = data.frame(x = x_range_sub)

# get predictions with confidence intervals
pred_logit = predict(logit_model, newdata = pred_data_sub, 
                      type = "response", se.fit = TRUE)
pred_data_sub$fit = pred_logit$fit
pred_data_sub$lower = pred_logit$fit - 1.96 * pred_logit$se.fit
pred_data_sub$upper = pred_logit$fit + 1.96 * pred_logit$se.fit

substantive_plot = ggplot(pred_data_sub, aes(x = x)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, fill = "blue") +
  geom_line(aes(y = fit), color = "blue", size = 1.2) +
  geom_rug(data = filter(data, y == 0), aes(x = x), sides = "b", alpha = 0.3) +
  geom_rug(data = filter(data, y == 1), aes(x = x), sides = "t", alpha = 0.3) +
  labs(title = "Predicted Probabilities with 95% Confidence Interval",
       subtitle = "Rug plots show distribution of observations",
       x = "Independent Variable",
       y = "Pr(Y = 1)") +
  theme_minimal()

print(substantive_plot)

# Create first differences plot (effect of moving X from low to high)
x_low = quantile(data$x, 0.25)
x_high = quantile(data$x, 0.75)

pred_low = predict(logit_model, newdata = data.frame(x = x_low), 
                    type = "response", se.fit = TRUE)
pred_high = predict(logit_model, newdata = data.frame(x = x_high), 
                     type = "response", se.fit = TRUE)

first_diff = pred_high$fit - pred_low$fit
first_diff_se = sqrt(pred_high$se.fit^2 + pred_low$se.fit^2)

cat("\n\nFirst Difference (moving X from 25th to 75th percentile):\n")
## 
## 
## First Difference (moving X from 25th to 75th percentile):
cat(sprintf("Change in probability: %.3f (SE = %.3f)\n", first_diff, first_diff_se))
## Change in probability: 0.533 (SE = 0.039)
cat(sprintf("95%% CI: [%.3f, %.3f]\n", 
            first_diff - 1.96*first_diff_se, 
            first_diff + 1.96*first_diff_se))
## 95% CI: [0.456, 0.610]

6.4 Writing About Your Results

Bad: “The logit coefficient on education is 0.234 (p < 0.01).”

Better: “Each additional year of education increases the log-odds of voting by 0.234.”

Best: “Each additional year of education increases the probability of voting by 4.2 percentage points for an individual with average characteristics (95% CI: 2.1 to 6.3 percentage points). The effect is larger for those with low baseline probability (6.5 pp) than high baseline probability (2.1 pp).”

Key Points for Writing:

  1. Always translate to probability scale (for us social scientists in the non-medical field)
  2. Report uncertainty (standard errors or confidence intervals)
  3. Specify at what values effects are calculated
  4. Show heterogeneity when relevant
  5. Use graphics to communicate complex patterns

7 The Linear Probability Model (LPM)

7.1 Theoretical Foundation

The Linear Probability Model represents the simplest approach to modeling binary outcomes. We begin with the general linear regression framework where we model the expected value of \(Y\) as a linear function of independent variables \(X\):

\[E(Y) = X\beta\]

For a binary response variable, the expected value has a special interpretation:

\[E(Y) = 1 \cdot \Pr(Y = 1) + 0 \cdot \Pr(Y = 0) = \Pr(Y = 1)\]

Therefore, in the LPM, we model:

\[\Pr(Y_i = 1) = X_i\beta\]

This amounts to fitting a standard OLS regression to a binary response variable.

7.2 Implementation and Visualization

# show the lpm model that was fit earlier
summary(lpm_model)
## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86546 -0.33755  0.01616  0.34236  0.88700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.559460   0.018316   30.54   <2e-16 ***
## x           0.140122   0.008998   15.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4091 on 498 degrees of freedom
## Multiple R-squared:  0.3275, Adjusted R-squared:  0.3261 
## F-statistic: 242.5 on 1 and 498 DF,  p-value: < 2.2e-16
# Visualization
ggplot(data, aes(x = x, y = y)) +
  geom_point(alpha = 0.3, position = position_jitter(height = 0.02)) +
  geom_line(data = pred_data, aes(x = x, y = lpm_pred), 
            color = "blue", size = 1.2) +
  geom_hline(yintercept = c(0, 1), linetype = "dashed", alpha = 0.5) +
  labs(title = "Linear Probability Model",
       subtitle = "Notice predictions can exceed [0,1] bounds",
       x = "Independent Variable (X)",
       y = "Probability of Y = 1") +
  theme_minimal() +
  annotate("text", x = -3, y = -0.15, label = "Problematic\npredictions < 0", 
           color = "red", size = 3) +
  annotate("text", x = 3, y = 1.15, label = "Problematic\npredictions > 1", 
           color = "red", size = 3)

# show model summary of LPM
summary(lpm_model)
## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86546 -0.33755  0.01616  0.34236  0.88700 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.559460   0.018316   30.54   <2e-16 ***
## x           0.140122   0.008998   15.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4091 on 498 degrees of freedom
## Multiple R-squared:  0.3275, Adjusted R-squared:  0.3261 
## F-statistic: 242.5 on 1 and 498 DF,  p-value: < 2.2e-16

7.3 Problems with the Linear Probability Model

While the LPM is computationally simple and provides easily interpretable coefficients (as direct effects on probability), there are some issues worth discussing:

7.3.1 Heteroscedasticity

The heteroscedasticity problem in the Linear Probability Model is both mechanical and substantive, arising directly from the binary nature of the dependent variable. But note that this is not catastrophic, there are cases when it poses genuine threats to inference versus when it can be safely managed.

7.3.1.1 The Mathematical Structure of LPM Heteroscedasticity

For a binary variable \(Y_i \in \{0,1\}\) following a Bernoulli distribution with probability \(\pi_i = \Pr(Y_i = 1)\), the variance has a specific functional form:

\[\text{Var}(Y_i | X_i) = \pi_i(1 - \pi_i) = E(Y_i|X_i)[1 - E(Y_i|X_i)]\]

In the LPM, we model \(E(Y_i|X_i) = X_i\beta\), which means:

\[\text{Var}(Y_i | X_i) = X_i\beta(1 - X_i\beta)\]

This creates a precise, known form of heteroscedasticity where: - Variance is maximized when \(X_i\beta = 0.5\) (where \(\text{Var}(Y_i) = 0.25\)) - Variance approaches zero as \(X_i\beta\) approaches 0 or 1 - The variance function is quadratic in the linear predictor

This violates the classical OLS assumption that \(\text{Var}(\epsilon_i | X_i) = \sigma^2\) for all \(i\), where the error variance is constant across observations.

7.3.1.2 Why This Heteroscedasticity Matters (and When It Doesn’t)

The consequences of this heteroscedasticity have been debated extensively in recent literature:

Statistical Consequences:

  1. Coefficient Estimates Remain Consistent: Despite heteroscedasticity, OLS coefficients remain unbiased and consistent. This is because \(E(\epsilon_i | X_i) = 0\) still holds.

  2. Standard Errors Are Incorrect: Classical OLS standard errors are inconsistent, typically underestimating the true variance when predicted probabilities are near 0.5 and overestimating when near the boundaries.

  3. Efficiency Loss: OLS is no longer the Best Linear Unbiased Estimator (BLUE); Weighted Least Squares would be more efficient if we knew the true variance function.

Beck (2020) argues that for many practical applications, especially when predicted probabilities stay within [0.2, 0.8], the heteroscedasticity in LPM is manageable with robust standard errors. Gomila (2021) demonstrates through extensive simulations that LPM with robust standard errors often performs comparably to logistic regression for causal inference, particularly in experimental settings.

The modern consensus, reflected in current practice, suggests:

  • Always use heteroscedasticity-robust standard errors (HC1, HC2, or HC3)
  • Consider clustering if appropriate for the research design
  • Be cautious when many predictions fall near 0 or 1
# show heteroscedasticity structure in lpm
lpm_residuals = residuals(lpm_model)
lpm_fitted = fitted(lpm_model)

# calculate theoretical variance
theoretical_var = lpm_fitted * (1 - lpm_fitted)

# Create comprehensive heteroscedasticity visualization
het_data = data.frame(
  fitted = lpm_fitted,
  residuals = lpm_residuals,
  squared_residuals = lpm_residuals^2,
  theoretical_var = theoretical_var
)

# Main residual plot
p1 = ggplot(het_data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", se = TRUE, color = "blue") +
  labs(title = "Residual Plot for LPM",
       subtitle = "Variance changes systematically with fitted values",
       x = "Fitted Probability",
       y = "Residuals") +
  theme_minimal()

# Squared residuals vs fitted to show variance pattern
p2 = ggplot(het_data, aes(x = fitted)) +
  geom_point(aes(y = squared_residuals), alpha = 0.3, color = "gray40") +
  geom_line(aes(y = theoretical_var), color = "red", size = 1.2) +
  labs(title = "Variance Structure in LPM",
       subtitle = "Gray points: squared residuals; Red line: theoretical variance",
       x = "Fitted Probability",
       y = "Variance") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 1)

# Demonstrate impact on inference
# Standard OLS standard errors
ols_se = summary(lpm_model)$coefficients[, "Std. Error"]

# Robust standard errors (HC1)
library(sandwich)
robust_se = sqrt(diag(vcovHC(lpm_model, type = "HC1")))

# compare the models standard errors
se_comparison = data.frame(
  Variable = names(coef(lpm_model)),
  OLS_SE = ols_se,
  Robust_SE = robust_se,
  Ratio = robust_se / ols_se
)

print("Comparison of Standard Errors:")
## [1] "Comparison of Standard Errors:"
print(se_comparison)
##                Variable      OLS_SE   Robust_SE     Ratio
## (Intercept) (Intercept) 0.018316401 0.018232059 0.9953953
## x                     x 0.008997758 0.006500339 0.7224399
print(paste("Average ratio of robust to OLS standard errors:", 
            round(mean(se_comparison$Ratio), 3)))
## [1] "Average ratio of robust to OLS standard errors: 0.859"

7.3.1.3 Practical Implications for Applied Research

The heteroscedasticity in LPM has several practical implications:

  1. Hypothesis Testing: Use robust standard errors by default. In R: lmtest::coeftest(model, vcov = sandwich::vcovHC(model, type = "HC1"))

  2. Confidence Intervals: Construct using robust standard errors to ensure proper coverage

  3. Model Selection: When comparing models, be aware that information criteria (AIC, BIC) assume homoscedasticity

  4. Predictive Intervals: Standard prediction intervals will be incorrect; the variance should vary with the predicted probability

  5. Weighted Least Squares Alternative: If pursuing efficiency, use WLS with weights \(w_i = 1/[\hat{\pi}_i(1-\hat{\pi}_i)]\), though this requires iterative estimation and can be unstable near boundaries

The bottom line: heteroscedasticity in LPM is a known, manageable issue that rarely invalidates the substantive conclusions when properly addressed through robust inference procedures.

7.3.2 Non-Normal Errors

The non-normality of errors in the Linear Probability Model is a fundamental consequence of the binary nature of the dependent variable. This issue has important implications for finite-sample inference, though its practical impact depends heavily on sample size and the specific inferential goals of the analysis.

7.3.2.1 The Discrete Error Structure

In the LPM, we model \(Y_i = X_i\beta + \epsilon_i\) where \(Y_i \in \{0,1\}\). The error term \(\epsilon_i\) inherits a peculiar structure from this setup. Given that the model predicts \(\hat{Y}_i = X_i\hat{\beta}\), the residuals can only take two possible values for any given observation:

  • When \(Y_i = 1\): \(\epsilon_i = Y_i - X_i\hat{\beta} = 1 - X_i\hat{\beta}\)
  • When \(Y_i = 0\): \(\epsilon_i = Y_i - X_i\hat{\beta} = 0 - X_i\hat{\beta} = -X_i\hat{\beta}\)

This creates a discrete, two-point distribution for each observation’s error term. The probability mass function of \(\epsilon_i\) conditional on \(X_i\) is:

\[P(\epsilon_i = 1 - X_i\beta | X_i) = \pi_i = X_i\beta\] \[P(\epsilon_i = -X_i\beta | X_i) = 1 - \pi_i = 1 - X_i\beta\]

We can verify that \(E(\epsilon_i | X_i) = 0\):

\[E(\epsilon_i | X_i) = (1 - X_i\beta) \cdot X_i\beta + (-X_i\beta) \cdot (1 - X_i\beta) = X_i\beta - (X_i\beta)^2 - X_i\beta + (X_i\beta)^2 = 0\]

7.3.2.2 The Distribution of LPM Errors

The error distribution has a few characteristics:

  1. Bimodality: For any given \(X_i\), the error distribution is bimodal with two point masses. This is fundamentally different from the continuous, unimodal normal distribution.

  2. Skewness: Unless \(X_i\beta = 0.5\), the error distribution is skewed. When \(X_i\beta < 0.5\), the distribution is right-skewed; when \(X_i\beta > 0.5\), it’s left-skewed.

  3. Bounded Support: Errors are bounded: \(-X_i\beta \leq \epsilon_i \leq 1 - X_i\beta\). This boundedness becomes more severe as predictions approach 0 or 1.

  4. Kurtosis: The distribution exhibits excess kurtosis (leptokurtic), with the degree depending on \(X_i\beta\).

7.3.2.3 Implications for Statistical Inference

The violation of the normality assumption has several consequences for statistical inference:

Finite-Sample Issues:

  1. Exact Tests Invalid: Classical t-tests and F-tests rely on the normality of errors for their exact finite-sample distributions. With non-normal errors, the test statistics don’t follow t or F distributions exactly.

  2. Confidence Interval Coverage: Finite-sample confidence intervals based on normal critical values may have incorrect coverage probabilities, particularly in small samples.

  3. Prediction Intervals: Standard prediction intervals, which assume normality, are inappropriate for LPM.

Asymptotic Theory to the Rescue:

Despite these finite-sample issues, several factors mitigate the problem:

  1. Central Limit Theorem: Under standard regularity conditions, the OLS estimator \(\hat{\beta}\) is asymptotically normal regardless of the error distribution:

\[\sqrt{n}(\hat{\beta} - \beta) \xrightarrow{d} N(0, V)\]

where \(V\) is the asymptotic variance-covariance matrix.

  1. Consistency: The consistency of OLS estimates doesn’t require normality—only that \(E(\epsilon_i | X_i) = 0\).

  2. Robust Inference: Heteroscedasticity-robust standard errors remain valid asymptotically without normality assumptions.

Additionally, there is some discussion in the literature about the practical importance of non-normal errors in LPM:

Horrace and Oaxaca (2006) demonstrate that normality violations in LPM are most problematic when:

  • Sample sizes are small (n < 100)
  • Many predicted probabilities fall outside [0.2, 0.8]
  • The research focus is on prediction rather than marginal effects

Gomila (2021) shows through simulations that for hypothesis testing about treatment effects in randomized experiments, the non-normality of LPM errors has minimal impact on Type I error rates when n > 50, even without robust standard errors.

Beck (2020) argues that concerns about non-normality are often overstated in political science applications where:

  • Sample sizes typically exceed 100
  • Interest centers on average marginal effects
  • Robust standard errors are used routinely
# Demonstrate the bimodal nature of LPM errors
set.seed(42)
n_demo = 1000

# Generate data with varying predicted probabilities
x_demo = rnorm(n_demo, 0, 1.5)
true_prob_demo = plogis(0.3 + 0.6*x_demo)
y_demo = rbinom(n_demo, 1, true_prob_demo)

# Fit LPM
lpm_demo = lm(y_demo ~ x_demo)
fitted_demo = fitted(lpm_demo)
resid_demo = residuals(lpm_demo)

# Create groups based on fitted values for visualization
fitted_groups = cut(fitted_demo, 
                    breaks = c(-Inf, 0.3, 0.5, 0.7, Inf),
                    labels = c("Low (< 0.3)", "Mid-Low (0.3-0.5)", 
                              "Mid-High (0.5-0.7)", "High (> 0.7)"))

# create visualizations error distributions by fitted value groups
error_df = data.frame(
  residuals = resid_demo,
  fitted = fitted_demo,
  group = fitted_groups
)

# Histogram of residuals by group
p1 = ggplot(error_df, aes(x = residuals)) +
  geom_histogram(bins = 50, fill = "skyblue", color = "black", alpha = 0.7) +
  facet_wrap(~group, scales = "free_y", ncol = 2) +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Distribution of LPM Residuals by Fitted Probability Range",
       subtitle = "Note the bimodal structure and changing skewness",
       x = "Residuals",
       y = "Frequency") +
  theme_minimal()

# Q-Q plot to assess normality
p2 = ggplot(error_df, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  facet_wrap(~group, scales = "free") +
  labs(title = "Q-Q Plots by Fitted Probability Range",
       subtitle = "Systematic departures from normality (red line)",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 1)

# Formal tests of normality
library(nortest)
shapiro_overall = shapiro.test(resid_demo[1:min(5000, length(resid_demo))])
ad_test = ad.test(resid_demo)

print("Formal Tests of Normality for LPM Residuals:")
## [1] "Formal Tests of Normality for LPM Residuals:"
print(paste("Shapiro-Wilk test p-value:", round(shapiro_overall$p.value, 6)))
## [1] "Shapiro-Wilk test p-value: 0"
print(paste("Anderson-Darling test p-value:", round(ad_test$p.value, 6)))
## [1] "Anderson-Darling test p-value: 0"
# Demonstrate CLT in action: sampling distribution of coefficients
n_bootstrap = 1000
boot_coefs = matrix(NA, n_bootstrap, 2)

for(i in 1:n_bootstrap) {
  boot_idx = sample(1:n_demo, replace = TRUE)
  boot_model = lm(y_demo[boot_idx] ~ x_demo[boot_idx])
  boot_coefs[i,] = coef(boot_model)
}

# Plot sampling distribution
boot_df = data.frame(
  intercept = boot_coefs[,1],
  slope = boot_coefs[,2]
)

p3 = ggplot(boot_df, aes(x = slope)) +
  geom_histogram(aes(y = ..density..), bins = 30, 
                 fill = "lightblue", color = "black", alpha = 0.7) +
  stat_function(fun = dnorm, 
                args = list(mean = mean(boot_df$slope), 
                           sd = sd(boot_df$slope)),
                color = "red", size = 1.2) +
  labs(title = "Bootstrap Sampling Distribution of Slope Coefficient",
       subtitle = "Despite non-normal errors, sampling distribution approximates normality (CLT)",
       x = "Slope Coefficient",
       y = "Density") +
  theme_minimal()

print(p3)

# Show sample size effects
sample_sizes = c(30, 50, 100, 500)
ks_results = numeric(length(sample_sizes))

for(j in 1:length(sample_sizes)) {
  n_temp = sample_sizes[j]
  boot_slopes_temp = numeric(500)
  
  for(i in 1:500) {
    idx_temp = sample(1:n_demo, n_temp, replace = TRUE)
    temp_model = lm(y_demo[idx_temp] ~ x_demo[idx_temp])
    boot_slopes_temp[i] = coef(temp_model)[2]
  }
  
  # Kolmogorov-Smirnov test for normality
  ks_results[j] = ks.test(scale(boot_slopes_temp), "pnorm")$p.value
}

normality_df = data.frame(
  SampleSize = sample_sizes,
  KS_pvalue = ks_results,
  Conclusion = ifelse(ks_results > 0.05, "Approximately Normal", "Non-Normal")
)

print("\nNormality of Sampling Distribution by Sample Size:")
## [1] "\nNormality of Sampling Distribution by Sample Size:"
print(normality_df)
##   SampleSize KS_pvalue           Conclusion
## 1         30 0.2311874 Approximately Normal
## 2         50 0.1119774 Approximately Normal
## 3        100 0.6953876 Approximately Normal
## 4        500 0.7290082 Approximately Normal

7.3.2.4 Practical Recommendations

Given the current understanding of non-normal errors in LPM:

  1. Sample Size Considerations:

    • For n < 50: Consider logit/probit or use bootstrap inference
    • For n > 100: Normality concerns are typically negligible for hypothesis testing
    • For n > 500: Can essentially ignore non-normality for most purposes
  2. Inference Strategy:

    • Always use robust standard errors (addresses both heteroscedasticity and helps with non-normality)
    • Consider bootstrap confidence intervals for small samples or when many predictions are near boundaries
    • For critical applications with small samples, use permutation tests
  3. Diagnostic Checks:

    • Examine residual plots but don’t expect normality
    • Focus on whether predictions fall within [0,1]
    • Check for influential observations that might drive non-normality
  4. Reporting:

    • Acknowledge the violation of normality but emphasize asymptotic validity
    • Report sample size to allow readers to assess the likely impact
    • Consider sensitivity analyses using logit/probit if sample is small

The modern view, supported by simulation evidence and theoretical results, is that non-normal errors in LPM are primarily a small-sample concern that becomes negligible in the sample sizes typical of most social science applications.

7.3.3 3. Nonsensical Predictions

The Linear Probability Model’s ability to produce predicted probabilities outside the [0,1] interval represents perhaps its most obvious theoretical deficiency, yet the practical implications of this issue are more nuanced.

7.3.3.1 The Mathematical Source of the Problem

The LPM specifies \(\Pr(Y_i = 1 | X_i) = X_i\beta\) without any constraints ensuring this linear combination falls within [0,1]. For any finite dataset, we can always find values of the covariates that produce:

\[\hat{\Pr}(Y_i = 1 | X_i) = X_i\hat{\beta} < 0 \quad \text{or} \quad \hat{\Pr}(Y_i = 1 | X_i) = X_i\hat{\beta} > 1\]

This occurs because the linear functional form extends infinitely in both directions, while probabilities are bounded. The problem is particularly acute when:

  • The range of covariates is large
  • Coefficients are substantial in magnitude
  • Making out-of-sample predictions
  • Extrapolating beyond the support of the training data

7.3.3.2 When and How Often Boundary Violations Occur

The frequency and severity of boundary violations depend on several factors:

Data Characteristics:

  1. Covariate Range: Wider ranges of continuous predictors increase violation likelihood
  2. Baseline Probability: Violations are more common when the unconditional probability is near 0 or 1
  3. Effect Sizes: Larger coefficients increase the probability of violations
  4. Dimensionality: More predictors generally increase violation risk through cumulative effects

Empirical Frequency:

Horrace and Oaxaca (2006) analyzed 166 published LPM studies and found:

  • 35% had no predictions outside [0,1]
  • 44% had fewer than 5% of predictions outside bounds
  • Only 21% had more than 5% of predictions outside bounds

More recently, Gomila (2021) examined experimental studies and found that with binary treatments and typical covariate adjustment, fewer than 2% of predictions typically fall outside [0,1].

7.3.3.3 Types of Boundary Violations and Their Implications

Boundary violations can be categorized by their nature and consequences:

1. In-Sample Violations

These occur for observations used to estimate the model:

  • Magnitude: Typically small (within 0.1 of boundaries)
  • Location: Usually occur for observations with extreme covariate values
  • Impact: Affect model interpretation but not coefficient consistency

2. Out-of-Sample Violations

These occur when making predictions for new observations:

  • Magnitude: Can be substantial, especially when extrapolating
  • Location: Common when predicting for covariate combinations not well-represented in training data
  • Impact: Can invalidate predictions and probability-based decision rules

3. Counterfactual Violations

These occur when computing treatment effects or marginal effects:

  • Magnitude: Varies with treatment effect size
  • Location: Most problematic for subgroups near probability boundaries
  • Impact: Can produce nonsensical average treatment effects
# Comprehensive demonstration of boundary violations in LPM
set.seed(123)
n = 1000

# Generate data with varying baseline probabilities
x1 = rnorm(n, 0, 2)
x2 = rbinom(n, 1, 0.4)
x3 = runif(n, -2, 2)

# True model with strong effects to induce some violations
true_prob = plogis(-0.5 + 0.8*x1 + 1.2*x2 + 0.6*x3)
y = rbinom(n, 1, true_prob)

# Fit LPM
lpm_full = lm(y ~ x1 + x2 + x3)

# get predictions
in_sample_pred = fitted(lpm_full)

# Create out-of-sample test data with more extreme values
n_test = 500
x1_test = rnorm(n_test, 0, 3)  # Wider range
x2_test = rbinom(n_test, 1, 0.4)
x3_test = runif(n_test, -3, 3)  # Wider range
test_data = data.frame(x1 = x1_test, x2 = x2_test, x3 = x3_test)
out_sample_pred = predict(lpm_full, newdata = test_data)

# Analyze violations
violations_summary = function(predictions, label) {
  n_total = length(predictions)
  n_below = sum(predictions < 0)
  n_above = sum(predictions > 1)
  n_violations = n_below + n_above
  pct_violations = 100 * n_violations / n_total
  
  # Magnitude of violations
  below_magnitude = if(n_below > 0) predictions[predictions < 0] else NA
  above_magnitude = if(n_above > 0) predictions[predictions > 1] - 1 else NA
  
  data.frame(
    Sample = label,
    N = n_total,
    Below_0 = n_below,
    Above_1 = n_above,
    Total_Violations = n_violations,
    Pct_Violations = round(pct_violations, 2),
    Min_Pred = round(min(predictions), 3),
    Max_Pred = round(max(predictions), 3),
    Worst_Below = if(n_below > 0) round(min(below_magnitude), 3) else NA,
    Worst_Above = if(n_above > 0) round(max(above_magnitude), 3) else NA
  )
}

violation_results = rbind(
  violations_summary(in_sample_pred, "In-Sample"),
  violations_summary(out_sample_pred, "Out-of-Sample")
)

print("Boundary Violations Summary:")
## [1] "Boundary Violations Summary:"
print(violation_results)
##          Sample    N Below_0 Above_1 Total_Violations Pct_Violations Min_Pred
## 1     In-Sample 1000      52      68              120           12.0   -0.368
## 2 Out-of-Sample  500      64      59              123           24.6   -0.760
##   Max_Pred Worst_Below Worst_Above
## 1    1.378      -0.368       0.378
## 2    2.035      -0.760       1.035
# create visualizations the distribution of predictions
pred_df = data.frame(
  Prediction = c(in_sample_pred, out_sample_pred),
  Sample = c(rep("In-Sample", length(in_sample_pred)),
             rep("Out-of-Sample", length(out_sample_pred)))
)

p1 = ggplot(pred_df, aes(x = Prediction, fill = Sample)) +
  geom_histogram(alpha = 0.6, position = "identity", bins = 50) +
  geom_vline(xintercept = c(0, 1), color = "red", linetype = "dashed", size = 1) +
  facet_wrap(~Sample, ncol = 1, scales = "free_y") +
  scale_fill_manual(values = c("In-Sample" = "blue", "Out-of-Sample" = "orange")) +
  labs(title = "Distribution of LPM Predictions",
       subtitle = "Red lines mark [0,1] boundaries; note violations outside",
       x = "Predicted Probability",
       y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "none")

# Examine which observations have violations
violation_analysis = data.frame(
  x1 = x1,
  x2 = x2, 
  x3 = x3,
  prediction = in_sample_pred,
  violation = ifelse(in_sample_pred < 0, "Below 0",
                    ifelse(in_sample_pred > 1, "Above 1", "Valid"))
)

p2 = ggplot(violation_analysis, aes(x = x1, y = prediction, color = violation)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "gray50") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Boundary Violations by Covariate Values",
       subtitle = "Violations occur primarily at extreme covariate values",
       x = "X1 (continuous predictor)",
       y = "LPM Prediction",
       color = "Status") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 1)

# Demonstrate impact on marginal effects
# Calculate marginal effect of x2 (binary treatment)
me_calculations = violation_analysis %>%
  group_by(x2) %>%
  summarise(
    n = n(),
    mean_pred = mean(prediction),
    n_violations = sum(violation != "Valid"),
    .groups = 'drop'
  )

ate_naive = diff(me_calculations$mean_pred)

# Show problematic individual treatment effects
ite_df = data.frame(
  x1 = x1,
  x3 = x3,
  baseline_pred = predict(lpm_full, newdata = data.frame(x1 = x1, x2 = 0, x3 = x3)),
  treated_pred = predict(lpm_full, newdata = data.frame(x1 = x1, x2 = 1, x3 = x3))
)
ite_df$ite = ite_df$treated_pred - ite_df$baseline_pred

# Identify problematic ITEs
ite_df$problem = ifelse(ite_df$baseline_pred < 0 | ite_df$baseline_pred > 1 |
                        ite_df$treated_pred < 0 | ite_df$treated_pred > 1,
                        "Problematic", "Valid")

p3 = ggplot(ite_df, aes(x = baseline_pred, y = treated_pred, color = problem)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = c(0, 1), linetype = "dotted", color = "red") +
  geom_hline(yintercept = c(0, 1), linetype = "dotted", color = "red") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Individual Treatment Effects in LPM",
       subtitle = "Red box shows valid [0,1] × [0,1] region",
       x = "Predicted Probability (Control)",
       y = "Predicted Probability (Treated)",
       color = "ITE Status") +
  theme_minimal()

print(p3)

print(paste("\nAverage Treatment Effect (naive):", round(ate_naive, 3)))
## [1] "\nAverage Treatment Effect (naive): 0.196"
print(paste("Number of problematic ITEs:", sum(ite_df$problem == "Problematic"), 
            "out of", nrow(ite_df)))
## [1] "Number of problematic ITEs: 183 out of 1000"

7.3.3.4 Perspectives on Boundary Violations

Recent literature has some useful notes on boundary violations:

Beck (2020) argues that boundary violations are often a “red herring” when:

  • The focus is on average marginal effects rather than individual predictions
  • Most predictions fall within [0.1, 0.9]
  • The research question concerns population-level relationships

Gomila (2021) demonstrates through extensive simulations that:

  • For experimental treatment effects, boundary violations rarely affect ATE estimates
  • Bias from boundary violations is typically smaller than bias from model misspecification in logit/probit
  • With covariate adjustment, LPM often outperforms logit in finite samples

Kitagawa and Tetenov (2018) show that for policy learning and optimal treatment assignment:

  • LPM can be preferable to logit/probit despite boundary violations
  • The linearity of LPM simplifies optimal policy computation
  • Boundary violations can be handled through constrained optimization

7.3.3.5 Practical Solutions and Workarounds

Several approaches can address boundary violations while retaining LPM’s advantages:

1. Trimming/Truncation:

# Simple truncation
predictions_trimmed = pmax(0, pmin(1, predictions))
  • Pros: Ensures valid probabilities
  • Cons: Introduces bias, affects inference

2. Feasible Generalized Least Squares (FGLS):

  • Estimate LPM, truncate predictions to [0.01, 0.99]
  • Use predicted values to construct weights
  • Re-estimate with weighted least squares
  • Iterate until convergence

3. Bounded Regression:

  • Use beta regression for outcomes in (0,1)
  • Apply transformation: \(\tilde{Y} = \frac{Y(N-1) + 0.5}{N}\)

4. Sample Restriction:

  • Drop observations with violations
  • Re-estimate on restricted sample
  • Note: Can introduce selection bias

5. Honest Reporting:

  • Report the percentage of violations
  • Show robustness to alternative specifications
  • Focus interpretation on regions without violations

7.3.3.6 When Boundary Violations Matter Most

Boundary violations are most problematic when:

  1. Prediction is the Goal: If the model’s purpose is to generate individual-level predictions or risk scores

  2. Decision Rules Depend on Probabilities: When thresholds determine binary decisions (e.g., treatment assignment)

  3. Extrapolation is Required: Making predictions for covariate values outside the training data range

  4. Small Sample Sizes: With limited data, violations can substantially affect average predictions

  5. Extreme Treatment Effects: When treatments push many units across probability boundaries

  6. Simulation Studies: When using the model to generate synthetic data

7.3.3.7 The Bottom Line

Modern methodological consensus suggests that boundary violations in LPM, while theoretically inelegant, are often practically inconsequential for:

  • Estimating average marginal effects
  • Testing hypotheses about relationships
  • Causal inference in experiments
  • Description of population patterns

However, researchers should:

  • Always check for and report boundary violations
  • Consider the specific inferential goal
  • Use alternative models when individual predictions matter
  • Be cautious about extrapolation
  • Recognize that theoretical purity sometimes conflicts with practical utility

As Angrist and Pischke (2009) memorably note: “While a nonlinear model may fit the CEF [conditional expectation function] for LDVs [limited dependent variables] more closely than a linear model, when it comes to marginal effects, this probably matters little.”

7.3.4 4. Constant Marginal Effects

The Linear Probability Model’s assumption of constant marginal effects represents a departure from how we typically conceptualize probability relationships. This mechanical feature of linearity has profound implications for interpretation and can lead to substantively different conclusions than nonlinear models, though recent methodological work suggests these differences are often smaller than traditionally believed.

7.3.4.1 The Mathematical Structure of Constant Effects

In the LPM, the marginal effect of any continuous variable \(X_k\) on the probability of success is:

\[\frac{\partial \Pr(Y_i = 1 | X_i)}{\partial X_{ik}} = \frac{\partial (X_i\beta)}{\partial X_{ik}} = \beta_k\]

This constancy implies that:

  • A one-unit change in \(X_k\) has the same effect whether \(\Pr(Y = 1) = 0.1\) or \(\Pr(Y = 1) = 0.9\)
  • The effect is identical for all individuals regardless of their other characteristics
  • There are no diminishing returns as probabilities approach boundaries
  • Interaction effects must be explicitly modeled rather than arising naturally

This contrasts sharply with nonlinear models where marginal effects vary with the level of all covariates.

7.3.4.2 Why Constant Effects Are Theoretically Problematic

The assumption of constant marginal effects conflicts with several theoretical and empirical regularities:

1. Ceiling and Floor Effects:

Near probability boundaries, there is less “room” for effects:

  • When \(\Pr(Y = 1) \approx 0.95\), a positive effect should be attenuated
  • When \(\Pr(Y = 1) \approx 0.05\), a negative effect should be attenuated
  • LPM ignores these natural bounds, potentially predicting impossible changes

2. Substantive Theory Often Implies Nonlinearity:

Many theories predict varying effects:

  • In voting: mobilization efforts have larger effects on moderate-probability voters
  • In conflict: deterrence effects vary with baseline conflict risk
  • In economics: policy interventions have diminishing marginal returns

3. Heterogeneous Treatment Effects:

The constant effect assumption implies no effect heterogeneity based on baseline risk, contradicting evidence from many fields that treatment effects vary by baseline characteristics.

7.3.4.3 Comparing Marginal Effects Across Models

In nonlinear models, marginal effects depend on location:

Logit Model:

\[\frac{\partial \Pr(Y = 1)}{\partial X_k} = \beta_k \cdot \pi_i(1 - \pi_i)\] where \(\pi_i = \Pr(Y_i = 1 | X_i)\)

Probit Model:

\[\frac{\partial \Pr(Y = 1)}{\partial X_k} = \beta_k \cdot \phi(X_i\beta)\] where \(\phi(\cdot)\) is the standard normal PDF

These formulations show that marginal effects:

  • Are largest when \(\pi_i = 0.5\) (for logit)
  • Approach zero as \(\pi_i\) approaches 0 or 1
  • Vary across individuals with different covariate profiles
# Comprehensive demonstration of marginal effects across models
set.seed(456)
n = 1000

# Generate data with a range of baseline probabilities
x1 = runif(n, -3, 3)
x2 = rnorm(n, 0, 1)
true_prob = plogis(-0.5 + 0.8*x1 + 0.5*x2)
y = rbinom(n, 1, true_prob)

data_me = data.frame(y = y, x1 = x1, x2 = x2)

# Fit three models
lpm_me = lm(y ~ x1 + x2, data = data_me)
logit_me = glm(y ~ x1 + x2, data = data_me, family = binomial(link = "logit"))
probit_me = glm(y ~ x1 + x2, data = data_me, family = binomial(link = "probit"))

# Calculate marginal effects across the range of predicted probabilities

# For LPM: constant marginal effect
lpm_me_x1 = coef(lpm_me)["x1"]

# For logit and probit: varying marginal effects
# Create evaluation points
eval_points = seq(-3, 3, 0.1)
eval_data = expand.grid(x1 = eval_points, x2 = 0)  # Hold x2 at mean

# Logit marginal effects
logit_pred = predict(logit_me, newdata = eval_data, type = "response")
logit_me_x1 = coef(logit_me)["x1"] * logit_pred * (1 - logit_pred)

# Probit marginal effects
probit_link = predict(probit_me, newdata = eval_data, type = "link")
probit_me_x1 = coef(probit_me)["x1"] * dnorm(probit_link)

# Create comprehensive comparison plot
me_comparison = data.frame(
  x1 = eval_points,
  baseline_prob_logit = logit_pred,
  LPM = lpm_me_x1,
  Logit = logit_me_x1,
  Probit = probit_me_x1
)

me_long = me_comparison %>%
  pivot_longer(cols = c(LPM, Logit, Probit),
               names_to = "Model",
               values_to = "Marginal_Effect")

p1 = ggplot(me_long, aes(x = x1, y = Marginal_Effect, color = Model)) +
  geom_line(size = 1.2) +
  labs(title = "Marginal Effects of X1 Across Its Range",
       subtitle = "LPM shows constant effect; Logit/Probit show varying effects",
       x = "Value of X1",
       y = "Marginal Effect (∂Pr/∂X1)") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  geom_hline(yintercept = lpm_me_x1, linetype = "dashed", color = "blue", alpha = 0.3)

# Show marginal effects as function of baseline probability
p2 = ggplot(me_long %>% 
             mutate(baseline_prob = rep(me_comparison$baseline_prob_logit, 3)),
           aes(x = baseline_prob, y = Marginal_Effect, color = Model)) +
  geom_line(size = 1.2) +
  labs(title = "Marginal Effects by Baseline Probability",
       subtitle = "Nonlinear models show maximum effect at Pr = 0.5",
       x = "Baseline Probability",
       y = "Marginal Effect") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 1)

# Distribution of marginal effects across sample
# Calculate individual marginal effects for each observation
library(margins)

# Marginal effects at observed values
lpm_ame = mean(rep(coef(lpm_me)["x1"], n))
logit_margins = margins(logit_me, variables = "x1")
logit_mes = logit_margins$fitted
probit_margins = margins(probit_me, variables = "x1")
probit_mes = probit_margins$fitted

# Create distribution plot
me_dist = data.frame(
  LPM = rep(lpm_ame, n),
  Logit = logit_mes,
  Probit = probit_mes
)

me_dist_long = me_dist %>%
  pivot_longer(cols = everything(),
               names_to = "Model",
               values_to = "ME")

p3 = ggplot(me_dist_long, aes(x = ME, fill = Model)) +
  geom_histogram(alpha = 0.6, position = "identity", bins = 30) +
  facet_wrap(~Model, scales = "free_y", ncol = 1) +
  labs(title = "Distribution of Marginal Effects Across Sample",
       subtitle = "LPM: single value; Nonlinear models: heterogeneous effects",
       x = "Marginal Effect",
       y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "none")

print(p3)

# Calculate and compare average marginal effects
ame_comparison = data.frame(
  Model = c("LPM", "Logit", "Probit"),
  AME = c(lpm_ame, mean(logit_mes), mean(probit_mes)),
  SD = c(0, sd(logit_mes), sd(probit_mes)),
  Min = c(lpm_ame, min(logit_mes), min(probit_mes)),
  Max = c(lpm_ame, max(logit_mes), max(probit_mes))
)

print("Average Marginal Effects Comparison:")
## [1] "Average Marginal Effects Comparison:"
print(ame_comparison)
##    Model       AME        SD        Min       Max
## 1    LPM 0.1468397 0.0000000 0.14683973 0.1468397
## 2  Logit 0.4250000 0.2743343 0.01116335 0.9547737
## 3 Probit 0.4250558 0.2736289 0.00396370 0.9650805

7.3.4.4 When Constant Effects Matter: Empirical Evidence

The practical importance of allowing for varying marginal effects depends on several factors:

1. Distribution of Predicted Probabilities:

Angrist (2001) shows that when most predicted probabilities fall in [0.2, 0.8], the difference between LPM and nonlinear models is minimal. The average marginal effect (AME) from logit/probit often closely approximates the LPM coefficient.

2. Research Question:

  • Population Average Effects: LPM often adequate
  • Effect Heterogeneity: Nonlinear models necessary
  • Extreme Groups: Nonlinear models important
  • Counterfactual Predictions: Depends on range

3. Sample Composition:

Recent work by Breen, Karlson, and Holm (2018) demonstrates that comparing coefficients across models or samples is problematic in nonlinear models due to scale identification. LPM avoids these issues entirely.

# Demonstrate when constant vs varying marginal effects matter

# Scenario 1: Most observations in middle range
set.seed(789)
n_scenario = 1000

# Middle range scenario
x_middle = rnorm(n_scenario, 0, 0.5)
prob_middle = plogis(0 + x_middle)  # Most probs in [0.3, 0.7]
y_middle = rbinom(n_scenario, 1, prob_middle)

# Extreme range scenario
x_extreme = rnorm(n_scenario, 0, 2)
prob_extreme = plogis(0 + x_extreme)  # Probs span [0.02, 0.98]
y_extreme = rbinom(n_scenario, 1, prob_extreme)

# Fit models for both scenarios
models_middle = list(
  lpm = lm(y_middle ~ x_middle),
  logit = glm(y_middle ~ x_middle, family = binomial())
)

models_extreme = list(
  lpm = lm(y_extreme ~ x_extreme),
  logit = glm(y_extreme ~ x_extreme, family = binomial())
)

# Calculate average marginal effects
ame_middle = c(
  LPM = coef(models_middle$lpm)[2],
  Logit = mean(margins(models_middle$logit, variables = "x_middle")$fitted)
)

ame_extreme = c(
  LPM = coef(models_extreme$lpm)[2],
  Logit = mean(margins(models_extreme$logit, variables = "x_extreme")$fitted)
)

# create visualizations the difference
scenario_comparison = data.frame(
  Scenario = rep(c("Middle Range", "Extreme Range"), each = 2),
  Model = rep(c("LPM", "Logit AME"), 2),
  AME = c(ame_middle, ame_extreme)
)

p4 = ggplot(scenario_comparison, aes(x = Model, y = AME, fill = Model)) +
  geom_bar(stat = "identity") +
  facet_wrap(~Scenario) +
  labs(title = "Average Marginal Effects: When LPM Approximates Nonlinear Models",
       subtitle = "LPM closely approximates logit AME in middle range, diverges at extremes",
       y = "Average Marginal Effect") +
  theme_minimal() +
  theme(legend.position = "none")

print(p4)

print("AME Comparison by Scenario:")
## [1] "AME Comparison by Scenario:"
print(scenario_comparison)
##        Scenario     Model       AME
## 1  Middle Range       LPM 0.2571512
## 2  Middle Range Logit AME 0.5080000
## 3 Extreme Range       LPM 0.1528286
## 4 Extreme Range Logit AME 0.4710000
print(paste("Middle range - Difference:", 
            round(abs(ame_middle[1] - ame_middle[2]), 4)))
## [1] "Middle range - Difference: 0.2508"
print(paste("Extreme range - Difference:", 
            round(abs(ame_extreme[1] - ame_extreme[2]), 4)))
## [1] "Extreme range - Difference: 0.3182"
# Show individual vs average effects
# Generate data with clear heterogeneity
x_het = runif(n_scenario, -3, 3)
z_het = rbinom(n_scenario, 1, 0.5)  # Binary moderator
true_effect = ifelse(z_het == 1, 1.5, 0.3)  # Heterogeneous effect
prob_het = plogis(-0.5 + true_effect * x_het)
y_het = rbinom(n_scenario, 1, prob_het)

# Fit models
lpm_het = lm(y_het ~ x_het * z_het)
logit_het = glm(y_het ~ x_het * z_het, family = binomial())

# Calculate marginal effects by subgroup
me_z0 = margins(logit_het, at = list(z_het = 0), variables = "x_het")
me_z1 = margins(logit_het, at = list(z_het = 1), variables = "x_het")

het_results = data.frame(
  Group = c("Z=0", "Z=0", "Z=1", "Z=1"),
  Model = rep(c("LPM", "Logit AME"), 2),
  Effect = c(
    coef(lpm_het)["x_het"],
    summary(me_z0)$AME,
    coef(lpm_het)["x_het"] + coef(lpm_het)["x_het:z_het"],
    summary(me_z1)$AME
  )
)

print("\nHeterogeneous Effects Comparison:")
## [1] "\nHeterogeneous Effects Comparison:"
print(het_results)
##   Group     Model     Effect
## 1   Z=0       LPM 0.06748769
## 2   Z=0 Logit AME 0.06615296
## 3   Z=1       LPM 0.20862741
## 4   Z=1 Logit AME 0.16200196

7.3.4.5 Literature on this topic

Recent methodological advances have nuanced our understanding of when constant marginal effects are problematic:

Angrist and Pischke (2009) show that LPM provides the best linear approximation to the true conditional expectation function, minimizing mean squared error.

Lewbel, Dong, and Yang (2012) demonstrate that:

  • LPM identifies a weighted average of unit-specific treatment effects
  • This parameter is often the policy-relevant quantity
  • Nonlinear models identify different weighted averages that may be less interpretable

Cattaneo, Crump, Farrell, and Feng (2024) in recent work show:

  • For average treatment effects, LPM is often more robust to misspecification than logit/probit
  • The bias from ignoring nonlinearity is often smaller than bias from distributional misspecification
  • Regression adjustment with LPM can be more reliable than with nonlinear models

7.3.4.6 Practical Implications of Constant Effects

The constant marginal effects assumption has several practical consequences:

1. Interaction Interpretation:

  • LPM: Interaction coefficients have straightforward interpretation
  • Nonlinear: Interaction effects \(\neq\) interaction coefficients (Ai & Norton, 2003)

2. Comparison Across Groups:

  • LPM: Coefficients directly comparable across subsamples
  • Nonlinear: Coefficients not comparable due to scale differences (Mood, 2010)

3. Specification Testing:

  • LPM: Standard tests for functional form apply
  • Nonlinear: Functional form inherently restricted

4. Covariate Balance:

  • LPM: Linear balance implies comparable treatment effects
  • Nonlinear: Balance requirements depend on functional form

7.3.4.7 When to Worry About Constant Effects

Constant marginal effects are most problematic when:

  1. Extreme Probabilities: Many units have baseline probabilities near 0 or 1
  2. Large Effects: Treatment effects are large enough to push units across wide probability ranges
  3. Effect Heterogeneity Focus: Research specifically examines how effects vary
  4. Individual Predictions: Need accurate individual-level predictions, not just averages
  5. Simulation/Extrapolation: Generating counterfactual scenarios outside data support

Constant effects are less concerning when:

  1. Average Effects: Focus is on population average treatment effects
  2. Middle Range: Most probabilities fall in [0.2, 0.8]
  3. Small Effects: Treatment effects are modest (< 0.2 change in probability)
  4. Causal Inference: Primary goal is unbiased estimation of causal effects
  5. Transparent Interpretation: Simplicity and interpretability are prioritized

7.3.4.8 The Bottom Line

The assumption of constant marginal effects in LPM represents a clear departure from the theoretical ideal of how probabilities should behave. However, mounting empirical evidence suggests that:

  1. For Average Effects: The difference between LPM and nonlinear models is often negligible
  2. For Transparency: LPM’s simplicity aids interpretation and replication
  3. For Robustness: LPM requires fewer assumptions about functional form
  4. For Comparisons: LPM facilitates cross-group and cross-study comparisons

As Beck (2020) concludes: “The LPM is not a model of individual probabilities but of average probabilities in the population. For this purpose, it often performs remarkably well.”

The choice between constant and varying marginal effects should depend on:

  • The research question
  • The distribution of predicted probabilities
  • The importance of effect heterogeneity
  • The value placed on simplicity versus flexibility

Rather than viewing constant marginal effects as a fatal flaw, recent literature suggests viewing it as a modeling choice with well-understood tradeoffs.

8 Conclusion

Binary response models are fundamental tools in social science research. Key insights from this deep dive:

Theoretical Foundations Matter (Sometimes): - All models make strong assumptions about functional form - No model is “correct”—they’re all approximations - Theory should guide choice when possible

Practical Differences Are Often Small: - Logit and probit usually give similar substantive conclusions - LPM often approximates nonlinear models well for AME - Focus on what you’re trying to learn, not model purity

Interpretation Is Everything: - Coefficients are not effects - Marginal effects vary across observations - Graphics often communicate better than tables

Modern Best Practices: 1. Report average marginal effects (AME) using observed values (Hanmer & Kalkan 2013) - Don’t use marginal effects at means unless specifically justified - The margins package in R defaults to AME 2. Show robustness across specifications (LPM, logit, probit) 3. Validate predictions out-of-sample 4. Be transparent about limitations and modeling choices

The Evolution Continues: - Machine learning methods (random forests, neural nets) for prediction - Causal inference focus shifting emphasis to identification over functional form - Semiparametric methods reducing reliance on distributional assumptions

The choice of binary response model is less crucial than understanding what each model assumes, when those assumptions matter, and how to interpret results correctly. In the end, the goal is not to find the “right” model but to learn something useful about the world.

8.1 References

  • Ai, C., & Norton, E. C. (2003). Interaction terms in logit and probit models. Economics Letters, 80(1), 123-129.
  • Angrist, J. D., & Pischke, J. S. (2009). Mostly harmless econometrics. Princeton University Press.
  • Beck, N. (2020). Estimating grouped data models with a binary dependent variable and fixed effects. Annual Review of Political Science, 23, 185-201.
  • Gomila, R. (2021). Logistic or linear? Estimating causal effects of experimental treatments on binary outcomes using regression analysis. Journal of Experimental Psychology: General, 150(4), 700-709.
  • Horrace, W. C., & Oaxaca, R. L. (2006). Results on the bias and inconsistency of ordinary least squares for the linear probability model. Economics Letters, 90(3), 321-327.
  • King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9(2), 137-163.
  • Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do. European Sociological Review, 26(1), 67-82.
  • Nagler, J. (1994). Scobit: An alternative estimator to logit and probit. American Journal of Political Science, 38(1), 230-255.

8.2 Session Information

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Pop!_OS 22.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Detroit
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] nortest_1.0-4   sandwich_3.1-1  stargazer_5.2.3 gridExtra_2.3  
##  [5] pROC_1.18.5     margins_0.3.28  broom_1.0.8     MASS_7.3-65    
##  [9] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
## [13] purrr_1.0.4     readr_2.1.5     tidyr_1.3.1     tibble_3.3.0   
## [17] ggplot2_3.5.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.3     lattice_0.22-5     stringi_1.8.7     
##  [5] hms_1.1.3          digest_0.6.37      magrittr_2.0.3     evaluate_1.0.3    
##  [9] grid_4.5.1         timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0     
## [13] Matrix_1.7-4       plyr_1.8.9         jsonlite_2.0.0     backports_1.5.0   
## [17] mgcv_1.9-1         scales_1.4.0       jquerylib_0.1.4    cli_3.6.5         
## [21] rlang_1.1.6        splines_4.5.1      withr_3.0.2        cachem_1.1.0      
## [25] yaml_2.3.10        tools_4.5.1        tzdb_0.5.0         vctrs_0.6.5       
## [29] R6_2.6.1           zoo_1.8-14         lifecycle_1.0.4    pkgconfig_2.0.3   
## [33] pillar_1.11.0      bslib_0.9.0        gtable_0.3.6       glue_1.8.0        
## [37] data.table_1.17.0  Rcpp_1.1.0         xfun_0.52          tidyselect_1.2.1  
## [41] knitr_1.50         farver_2.1.2       nlme_3.1-168       htmltools_0.5.8.1 
## [45] rmarkdown_2.29     prediction_0.3.18  labeling_0.4.3     compiler_4.5.1