Binary response variables are ubiquitous in social science research. Many phenomena of interest are naturally dichotomous or can be meaningfully dichotomized:
This document explores various approaches to modeling binary outcomes. We’ll start with logit and probit models, then discuss alternatives link functions, and close with the linear probability model (LPM).
Before we get into the mathematics, let’s build some intuition. Imagine you’re trying to predict whether someone will vote based on their education level. You might think that more educated people are more likely to vote, but this relationship probably isn’t perfectly linear. Going from 0 to 4 years of education might have a different effect than going from 12 to 16 years. Moreover, once someone is already very likely to vote (say 95% probability), additional education probably won’t increase their probability much more – there’s just not much room to grow. This kind of “diminishing returns” at the extremes is what logit and probit models capture naturally.
The latent variable approach provides a conceptual framework for thinking about binary outcomes. The word “latent” means hidden or unobserved, and that’s exactly what we’re dealing with here. Think about the decision to vote. We observe whether someone voted (1) or didn’t vote (0), but underlying this binary outcome is likely a continuous measure of “motivation to vote” or “propensity to vote” that we can’t directly see. Someone with very high motivation votes, someone with very low motivation doesn’t, and people near the middle might go either way depending on small factors.
This latent variable framework isn’t just a mathematical convenience—it often corresponds to how we actually think about binary decisions. Consider a few examples:
In each case, the binary outcome emerges when this underlying continuous assessment crosses some threshold.
Both logit and probit models can be motivated through a latent variable framework. We assume there exists an unobserved continuous variable \(Y^*\) that determines the observed binary outcome:
\[Y^*_i = X_i\beta + u_i\]
We observe only:
\[Y_i = \begin{cases} 1 & \text{if } Y^*_i \geq \tau \\ 0 & \text{if } Y^*_i < \tau \end{cases}\]
where \(\tau\) is a threshold. Without loss of generality, we typically normalize \(\tau = 0\) by absorbing it into the intercept.
Let me explain this normalization because it’s important but often a source of confusion. We could set the threshold at any value – maybe people vote when their latent propensity exceeds 100, or 42, or any other number. But since we never observe the latent variable directly, we can’t identify both the threshold and the intercept separately. It’s like trying to measure temperature without agreeing on where zero should be – Celsius and Fahrenheit both work, they just put zero in different places. By convention, we set the threshold to zero and let the intercept absorb any threshold effects. This doesn’t change our substantive results; it just makes the math cleaner.
The probability of observing \(Y_i = 1\) is:
\[\begin{align} \Pr(Y_i = 1) &= \Pr(Y^*_i \geq 0) \\ &= \Pr(X_i\beta + u_i \geq 0) \\ &= \Pr(u_i \geq -X_i\beta) \\ &= \Pr(u_i \leq X_i\beta) \quad \text{(by symmetry)}\\ &= F(X_i\beta) \end{align}\]
Let’s walk through this derivation step by step. We start by asking: what’s the probability that we observe Y = 1? Well, that happens when the latent variable exceeds our threshold (which we’ve set to zero). The latent variable equals \(X_i\beta\) plus an error term \(u_i\), so we need the probability that their sum is positive. Rearranging, we need the probability that the error is greater than \(-X_i\beta\).
The last step—going from \(\Pr(u_i \geq -X_i\beta)\) to \(\Pr(u_i \leq X_i\beta)\) – uses the fact that we assume the error distribution is symmetric around zero. This is a standard assumption that says positive and negative errors are equally likely. Finally, we recognize that \(\Pr(u_i \leq X_i\beta)\) is just the cumulative distribution function (CDF) of the error evaluated at \(X_i\beta\).
where \(F(\cdot)\) is the CDF of \(u_i\). The choice between logit and probit depends on the assumed distribution of \(u_i\) – a topic we’ll return to after we introduce logit and probit more formally below.
The logit model assumes \(u_i\) follows a standard logistic distribution with:
Before we dive deeper, let’s understand what these mathematical objects represent and why they matter. The PDF (probability density function) tells us how likely different values of the error are—think of it as the “shape” of randomness in our model. The CDF (cumulative distribution function) is what we actually use to transform our linear predictions into probabilities. The variance tells us how spread out the errors are, and the symmetry property ensures that positive and negative deviations from our predictions are equally likely.
These formulas define the “shape” of randomness in the logit model. The PDF (left plot below) shows how likely different error values are—notice it’s bell-shaped like the normal distribution but with heavier tails (more extreme values are slightly more likely). The CDF (right plot) is what we actually use in the model—it transforms any number from \(-\infty\) to \(+\infty\) into a probability between 0 and 1. This S-shaped curve is the famous “logistic curve” that gives logit its name.
The key insight: when \(X_i\beta = 0\), you get probability 0.5. As \(X_i\beta\) increases, probability smoothly rises toward 1. As it decreases, probability falls toward 0 and the curve naturally flattens out at the extremes, capturing the idea that it gets increasingly hard to push probabilities from 0.9 to 0.99 (or from 0.1 to 0.01).
Think about this intuitively: if someone already has a 90% chance of voting, it’s hard to increase that to 99%—most of the factors that would make them vote are already in play. Similarly, if someone has only a 10% chance of voting, it’s hard to decrease that to 1%—they’re already very unlikely to vote. This “resistance” at the extremes is what the S-shaped curve captures.
# create visualizations logistic distribution
seq(-4, 4, 0.01)
u_seq =
# probability density function
dlogis(u_seq)
logistic_pdf =
# cumulative distribution function
plogis(u_seq)
logistic_cdf =
# create the plots
ggplot(data.frame(u = u_seq, pdf = logistic_pdf), aes(x = u, y = pdf)) +
pdf_plot = geom_line(color = "blue", size = 1.2) +
labs(title = "Standard Logistic PDF",
subtitle = "The 'bell shape' of random errors",
x = "u", y = "Probability Density") +
theme_minimal()
ggplot(data.frame(u = u_seq, cdf = logistic_cdf), aes(x = u, y = cdf)) +
cdf_plot = geom_line(color = "blue", size = 1.2) +
labs(title = "Standard Logistic CDF (Logit Link)",
subtitle = "The S-curve that maps Xβ to probabilities",
x = "Xβ", y = "Pr(Y = 1)") +
theme_minimal() +
geom_hline(yintercept = 0.5, linetype = "dashed", alpha = 0.5) +
geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5)
grid.arrange(pdf_plot, cdf_plot, ncol = 2)
The logit model specifies:
\[\Pr(Y_i = 1) = \Lambda(X_i\beta) = \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\]
This formula might look intimidating, but let’s break it down piece by piece. We’re trying to convert \(X_i\beta\)—which can be any number—into a probability that must be between 0 and 1. The exponential function and the ratio accomplish this transformation.
The Linear Predictor: \(X_i\beta\) is just like OLS—a weighted sum of your variables. If you have two predictors, this is \(\beta_0 + \beta_1X_{i1} + \beta_2X_{i2}\). This can range from \(-\infty\) to \(+\infty\).
To make this concrete, imagine we’re predicting voting based on education and income. The linear predictor might be: \(-2 + 0.3 \times \text{Education} + 0.0001 \times \text{Income}\). For someone with 16 years of education and $50,000 income, this gives us: \(-2 + 0.3(16) + 0.0001(50000) = -2 + 4.8 + 5 = 7.8\).
The Exponential: \(\exp(X_i\beta)\) transforms the linear predictor to be strictly positive. When \(X_i\beta = 0\), \(\exp(0) = 1\). When \(X_i\beta\) is large and positive, \(\exp(X_i\beta)\) becomes huge. When negative, it approaches 0.
Continuing our example, \(\exp(7.8) \approx 2440\). This is a big positive number, suggesting this person is very likely to vote.
The Ratio: \(\frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\) creates the “squashing” that ensures probabilities stay in [0,1]:
For our example: \(\frac{2440}{1 + 2440} = \frac{2440}{2441} \approx 0.9996\). So this person has a 99.96% probability of voting according to our model.
Alternative Forms You’ll Encounter:
The same probability can be written as:
These are all the same thing, just written differently. You’ll see different forms in different textbooks and papers, so it’s good to recognize them all.
Connection to Odds:
The logit model has a popular interpretation through odds. If we rearrange:
\[\frac{\Pr(Y_i = 1)}{1 - \Pr(Y_i = 1)} = \exp(X_i\beta)\]
Taking logs of both sides:
\[\ln\left(\frac{\Pr(Y_i = 1)}{1 - \Pr(Y_i = 1)}\right) = X_i\beta\]
This means the log-odds are linear in \(X\). A one-unit change in \(X_k\) changes the log-odds by \(\beta_k\), or multiplies the odds by \(\exp(\beta_k)\). We’ll talk more about this interpretation later.
Now we need to estimate the \(\beta\) parameters. In OLS regression, we used the principle of least squares—finding the line that minimizes the sum of squared residuals. For binary outcomes, this doesn’t work well (we’ll see why when we discuss the Linear Probability Model later). Instead, we use maximum likelihood estimation—finding the parameters that make our observed data most likely.
Think about it this way: suppose we observed that 8 out of 10 highly educated people voted, and only 2 out of 10 less educated people voted. What parameter values would make this pattern most likely? Maximum likelihood finds those values.
Building the Likelihood:
For a single observation \(i\), we need the probability of observing what we actually saw:
The clever trick is to write both cases in one formula:
\[L_i = \left[\frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\right]^{Y_i} \left[\frac{1}{1 + \exp(X_i\beta)}\right]^{1-Y_i}\]
Why This Works:
This formula might seem a bit spooky at first, but the key is that \(Y_i\) is either 0 or 1, which acts like a switch:
This is a common trick in statistics for writing conditional expressions in a single formula.
From Individual to Joint Likelihood:
Assuming independence across observations, the likelihood for the entire sample is the product:
\[L = \prod_{i=1}^N L_i\]
As we’ve discussed, products are computationally difficult (tiny numbers multiplied together), so we take logs.
The Log-Likelihood:
\[\ln L = \sum_{i=1}^N \left[Y_i \ln\left(\frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\right) + (1-Y_i)\ln\left(\frac{1}{1 + \exp(X_i\beta)}\right)\right]\]
Simplifying the Log-Likelihood:
We can rewrite this in a more computationally stable form. Note that:
Therefore:
\[\ln L = \sum_{i=1}^N \left[Y_i X_i\beta - \ln(1 + \exp(X_i\beta))\right]\]
Why Can’t We Just Use Calculus to Solve for \(\beta\)?
In OLS, we take derivatives, set them to zero, and solve algebraically for \(\beta\). Let’s try that here and see where it breaks down.
Step 1: Take the First Derivative
To maximize the log-likelihood, we differentiate with respect to \(\beta\):
\[\frac{\partial \ln L}{\partial \beta} = \sum_{i=1}^N \left[Y_i X_i - X_i \cdot \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\right]\]
Notice that \(\frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)} = \Pr(Y_i = 1 | X_i)\), so we can write:
\[\frac{\partial \ln L}{\partial \beta} = \sum_{i=1}^N X_i \left[Y_i - \Pr(Y_i = 1 | X_i)\right]\]
Note that this has a nice interpretation: we’re summing up the prediction errors (actual outcome minus predicted probability), weighted by the X values. When our predictions are perfect, this equals zero.
Step 2: Set Equal to Zero (First-Order Conditions)
For a maximum, we need:
\[\sum_{i=1}^N X_i \left[Y_i - \Pr(Y_i = 1 | X_i)\right] = 0\]
Or equivalently:
\[\sum_{i=1}^N X_i Y_i = \sum_{i=1}^N X_i \cdot \Pr(Y_i = 1 | X_i)\]
Step 3: Where We Get Stuck … Ruh-Roh
Here’s the problem: \(\Pr(Y_i = 1 | X_i) = \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\) is a nonlinear function of \(\beta\).
Let’s see this with a simple example. Suppose we have just an intercept and one predictor, so \(X_i\beta = \beta_0 + \beta_1 x_i\). Our first-order conditions become:
For \(\beta_0\):
\[\sum_{i=1}^N Y_i = \sum_{i=1}^N \frac{\exp(\beta_0 + \beta_1 x_i)}{1 + \exp(\beta_0 + \beta_1 x_i)}\]
For \(\beta_1\):
\[\sum_{i=1}^N x_i Y_i = \sum_{i=1}^N x_i \cdot \frac{\exp(\beta_0 + \beta_1 x_i)}{1 + \exp(\beta_0 + \beta_1 x_i)}\]
There’s no way to algebraically isolate \(\beta_0\) and \(\beta_1\) from these equations! The exponential function inside a ratio makes it impossible to “solve for \(\beta\)” the way we can with OLS. To understand why this is so problematic, imagine trying to solve the equation \(x = \frac{e^x}{1 + e^x}\). There’s no algebraic manipulation that will give you \(x\) in terms of elementary functions.
In OLS, the first-order conditions are:
\[\sum_{i=1}^N X_i (Y_i - X_i\beta) = 0\]
Which simplifies to:
\[\sum_{i=1}^N X_i Y_i = \sum_{i=1}^N X_i X_i'\beta\]
This can be solved directly:
\[\beta = (X'X)^{-1}X'Y\]
The key difference: In OLS, the predicted value \(X_i\beta\) is linear in \(\beta\), so we can factor it out and solve. In logit, the predicted probability is nonlinear in \(\beta\), trapping us in an equation we can’t solve analytically, thus we use the numerical estimation approaches that we discussed in the Likelihood Theory materials.
So what do we do! More specifically, if we can’t solve the first-order conditions, how do computers find \(\beta\)? The answer reveals a fun distinction between computing derivatives versus actually solving equations.
We CAN compute the gradient (the derivative formula):
\[g(\beta) = \sum_{i=1}^N X_i \left[Y_i - \Lambda(X_i\beta)\right]\]
where \(\Lambda(X_i\beta) = \frac{\exp(X_i\beta)}{1 + \exp(X_i\beta)}\) is the predicted probability.
This tells us the “slope” of the log-likelihood at any given \(\beta\) value – essentially, which direction is “uphill” toward the maximum.
We CAN’T solve \(g(\beta) = 0\) analytically because the exponential function inside the ratio makes it impossible to isolate \(\beta\) algebraically.
Computers uses the Newton-Raphson algorithm to navigate around this problem. Instead of solving the equation directly, it uses an iterative approach – like climbing a hill in dense fog where you can’t see the peak but can feel the slope beneath your feet. Specifically, the process works as follows:
Initial Guess: Start with an initial estimate of \(\beta\), say \(\beta^{(0)}\) (often just zeros or OLS estimates).
Compute Gradient and Hessian: Calculate the gradient (first derivative) and Hessian (second derivative) of the log-likelihood at the current estimate.
The gradient tells us the direction and steepness: \[g(\beta^{(t)}) = \sum_{i=1}^N X_i \left[Y_i - \Lambda(X_i\beta^{(t)})\right]\] If positive, we should increase \(\beta\); if negative, decrease it.
The Hessian tells us how the gradient is changing (the curvature): \[H(\beta^{(t)}) = -\sum_{i=1}^N X_i X_i' \cdot \Lambda(X_i\beta^{(t)})[1-\Lambda(X_i\beta^{(t)})]\]
Update Rule: Update the estimate using the formula:
\[\beta^{(t+1)} = \beta^{(t)} - H^{-1} g\] where \(H\) is the Hessian matrix and \(g\) is the gradient vector evaluated at \(\beta^{(t)}\).
This formula essentially asks: “Given the current slope and curvature, if the function were quadratic, where would its maximum be?” We move to that point and reassess.
Convergence Check: Repeat steps 2 and 3 until the change in \(\beta\) is smaller than a predefined threshold (typically when \(|\beta^{(t+1)} - \beta^{(t)}| < 0.00001\) or when the gradient is essentially zero).
Why This Works: At each iteration, Newton-Raphson approximates the complex nonlinear likelihood with a simpler quadratic function that matches the current point’s value, slope, and curvature. We can solve for the maximum of a quadratic easily, move there, and repeat. Within 4-6 iterations, we typically reach the maximum likelihood estimates.
The Key Insight: We never actually “solve” the first-order conditions. Instead, we use the gradient to guide us iteratively toward the solution—like following the steepest uphill path until we reach the peak where the gradient equals zero.
A Fundamental Trade-off: Functional Form Assumptions
Before moving forward we should also note that this nonlinearity raises an important critique: while OLS imposes a restrictive linear functional form that can violate probability bounds, models like logit impose their own restrictive assumption – that the relationship between \(X_i\beta\) and \(\Pr(Y_i = 1)\) follows a specific S-shaped curve (logistic). There’s no theoretical reason why the true relationship must follow exactly these curves.
Let me elaborate on this philosophical point because it’s rarely discussed in introductory treatments but is important for developing good judgment about model choice. Every model makes assumptions. OLS assumes the relationship is linear but makes no assumptions about the distribution of errors (for consistency of estimates). Logit assumes a very specific nonlinear relationship but this guarantees sensible probabilities. There’s no free lunch in statistics – we’re always trading one set of assumptions for another.
As Angrist and Pischke (2009) argue, we’re essentially trading one functional form assumption (linearity) for another (a specific nonlinear shape). The logistic and normal CDFs are convenient mathematical choices that guarantee sensible probabilities, but nature doesn’t necessarily follow our mathematical convenience. In some sense, OLS’ (or more specifically the OLS applied to binary data, which we refer to as the linear probability model (LPM)) transparency about its limitations (you can see exactly where it fails) might be preferable to logit/probit’s hidden assumption that we’ve correctly specified the entire shape of the probability curve.
Thus when you choose logit, you’re not just saying “probabilities should be between 0 and 1” – you’re imposing a very specific mathematical relationship at every point along the curve. Consider what the logistic function actually assumes:
Think about voter turnout: moving from 10% to 30% probability of voting (getting politically disengaged people slightly engaged) might be much easier than moving from 70% to 90% (getting already-engaged people to be super-engaged). Or it could be the opposite! The logit model forces symmetry regardless of which is true in reality.
For example, in modeling rare events like wars or financial crises, the true probability of extreme events might be higher than what the logistic distribution implies. These “black swan” events might require a distribution with fatter tails.
Fixed Inflection Point: The steepest part of the S-curve (where marginal effects are largest) always occurs at \(\Pr(Y=1) = 0.5\) in logit. But perhaps in reality, the steepest change happens at 0.3 or 0.7. The complementary log-log model, for instance, has its steepest slope at around 0.37.
Monotonic Effects: Logit assumes that as \(X\) increases, the probability either always increases (if \(\beta > 0\)) or always decreases (if \(\beta < 0\)), with no plateaus, reversals, or more complex patterns.
A Concrete Example of What Could Go Wrong
Imagine you’re modeling voter turnout as a function of education. The logit model forces a specific relationship: the effect of education on turnout probability must follow the logistic curve exactly. But what if reality is different?
Or consider modeling loan default as a function of credit score:
Why This Assumption is “Hidden”
The assumption is hidden because:
It looks scientific: Using a proper CDF seems more rigorous than LPM’s obvious problems.
We can’t test it directly: Unlike linearity (which we can assess with residual plots), we can’t easily test whether the true probability curve is logistic versus some other S-shape.
Software makes it automatic: You type glm(..., family=binomial)
and get results without thinking about what specific curve you’ve imposed.
Small differences in the middle: Logit, probit, and other S-curves are very similar for probabilities between 0.2 and 0.8, where most data lie. The assumption mainly matters at the extremes.
The choice of MLE models like logit versus LPM isn’t really about which is “correct” because none of them are! It’s about which set of assumptions you’re most comfortable making:
As Beck (2020) notes, for many research questions—especially those focused on average effects in experimental or quasi-experimental settings – these functional form debates matter much less than we once thought. But researchers should understand what they’re assuming when they choose a model, rather than treating logit/probit as automatically superior to LPM just because they respect probability bounds.
This is why many applied researchers now emphasize robustness checks across multiple specifications and focus on average marginal effects, which tend to be similar across reasonable models, rather than treating any single functional form as “correct.”
The practical takeaway: always estimate multiple models (LPM, logit, probit) and check whether your substantive conclusions change. If they don’t, you can be more confident in your findings. If they do, you need to think carefully about which assumptions are most appropriate for your specific application.
Below is an example of fitting a logit model in R using simulated data. We’ll generate some data, fit a logit model, and visualize the predicted probabilities.
# generate example data for all models
set.seed(6886)
500
n = rnorm(n, 0, 2)
x =# true probability follows a logistic curve
plogis(0.5 + 0.8*x)
true_prob = rbinom(n, 1, true_prob)
y =
# create data frame
data.frame(x = x, y = y)
data =
# create prediction data
data.frame(x = seq(-4, 4, 0.1))
pred_data =
# fit lpm model for later comparisons
lm(y ~ x, data = data)
lpm_model =$lpm_pred = predict(lpm_model, newdata = pred_data) pred_data
# fit the logit model
glm(y ~ x, data = data, family = binomial(link = "logit"))
logit_model =
# get predictions
$logit_pred = predict(logit_model, newdata = pred_data, type = "response")
pred_data
# show model summary
summary(logit_model)
##
## Call:
## glm(formula = y ~ x, family = binomial(link = "logit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.40111 0.11490 3.491 0.000481 ***
## x 0.88113 0.08217 10.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 688.91 on 499 degrees of freedom
## Residual deviance: 486.85 on 498 degrees of freedom
## AIC: 490.85
##
## Number of Fisher Scoring iterations: 5
# visualize the logit predictions
ggplot(pred_data, aes(x = x)) +
logit_plot = geom_line(aes(y = logit_pred), color = "grey20", size = 1.2) +
geom_point(data = data, aes(x = x, y = y), alpha = 0.1) +
geom_hline(yintercept = c(0, 1), linetype = "dashed", alpha = 0.5) +
labs(title = "Logit Model Predictions",
x = "Independent Variable (X)",
y = "Predicted Probability") +
theme_minimal()
print(logit_plot)
Earlier we said that there is an odds interpretation of the logit model. Let’s unpack what that means.
Before we dive into the math, let’s build intuition with a concrete example. Suppose 75 out of 100 people vote. The probability of voting is 0.75. The odds of voting are 75:25 or 3:1—for every three people who vote, one doesn’t. This is different from saying “3 times more likely” (which would be about probability). Odds tell us the ratio of success to failure.
The odds of an event are defined as:
\[\text{Odds} = \frac{\Pr(\text{Event})}{1 - \Pr(\text{Event})}\]
Odds represent how much more likely something is to happen than not happen. Unlike probabilities (which range from 0 to 1), odds range from 0 to infinity:
Odds are preferred by some over probabilities:
Multiplicative symmetry: If the odds of success are 3:1, the odds of failure are 1:3. With probabilities (0.75 and 0.25), the relationship isn’t as clean.
Unbounded above: While probabilities hit a ceiling at 1, odds can grow infinitely, making them better for modeling strong effects.
Natural for comparisons: Odds ratios (comparing odds across groups) are the standard in modern medical research.
The logit model can be interpreted as a linear model for the log-odds:
\[\ln\left[\frac{\Pr(Y_i = 1)}{1 - \Pr(Y_i = 1)}\right] = X_i\beta\]
Taking the logarithm of odds creates a scale that:
Ranges from \(-\infty\) to \(+\infty\) (matching the range of \(X_i\beta\))
Is symmetric around zero (log-odds of 0 = probability of 0.5)
Makes multiplicative changes additive (crucial for interpretation)
Log-odds = 0 → Probability = 0.5
Log-odds = 2 → Probability ≈ 0.88
Log-odds = -2 → Probability ≈ 0.12
As log-odds → ∞ → Probability → 1
As log-odds → -∞ → Probability → 0
The reason we use log-odds rather than odds is that logs convert multiplication into addition. If education multiplies the odds by 2 and income multiplies them by 1.5, the combined effect multiplies the odds by 3. In log terms, this is just addition: log(2) + log(1.5) = log(3).
The Key Insight: In logit models, coefficients represent the change in log-odds for a one-unit change in X.
For a single predictor model: \(\ln(\text{odds}) = \beta_0 + \beta_1 X\)
Concrete Example:
Suppose we’re modeling admission to graduate school based on GRE scores, and \(\beta_{GRE} = 0.003\).
Lets walk through this calculation in detail. If someone with a 600 GRE has a 20% admission probability (odds = 0.25), then someone with a 700 GRE has odds of 0.25 × 1.35 = 0.3375, which corresponds to a probability of 0.3375/(1+0.3375) ≈ 0.252 or 25.2%. Notice that the odds increased by 35% but the probability only increased by about 26% (from 20% to 25.2%).
For Binary Predictors, It’s Even Cleaner:
If we have a treatment variable with \(\beta_{treat} = 1.386\):
The Misconception Problem:
People often misinterpret odds ratios as probability ratios. If the odds ratio is 4:
For common outcomes (probability > 0.2), odds ratios exaggerate effect sizes:
This misconception is so common that many researchers avoid reporting odds ratios to lay audiences. In my opinion, it’s usually better to convert to probability differences or risk ratios.
# let's look at odds and log-odds
seq(0.01, 0.99, 0.01)
prob_seq = prob_seq / (1 - prob_seq)
odds_seq = log(odds_seq)
log_odds_seq =
data.frame(
odds_df =Probability = prob_seq,
Odds = odds_seq,
LogOdds = log_odds_seq
)
# create visualizations
ggplot(odds_df, aes(x = Odds, y = Probability)) +
odds_plot = geom_line(color = "blue", size = 1.2) +
labs(title = "Probability as a Function of Odds",
subtitle = "Note the asymmetric, bounded relationship") +
coord_cartesian(xlim = c(0, 10)) +
geom_vline(xintercept = 1, linetype = "dashed", alpha = 0.5) +
annotate("text", x = 1, y = 0.2, label = "Even odds\n(P = 0.5)", hjust = -0.1) +
theme_minimal()
ggplot(odds_df, aes(x = LogOdds, y = Probability)) +
log_odds_plot = geom_line(color = "red", size = 1.2) +
labs(title = "Probability as a Function of Log-Odds",
subtitle = "The logistic function - symmetric around 0") +
geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
geom_hline(yintercept = 0.5, linetype = "dashed", alpha = 0.5) +
annotate("text", x = 0, y = 0.3, label = "Log-odds = 0\n(P = 0.5)", hjust = -0.1) +
theme_minimal()
grid.arrange(odds_plot, log_odds_plot, ncol = 2)
# show odds ratio interpretation
0.2
baseline_prob = 3
odds_ratio = baseline_prob / (1 - baseline_prob)
baseline_odds = baseline_odds * odds_ratio
new_odds = new_odds / (1 + new_odds)
new_prob =
cat("Example: Baseline probability = 0.2, Odds Ratio = 3\n")
## Example: Baseline probability = 0.2, Odds Ratio = 3
cat(sprintf("Baseline odds: %.3f\n", baseline_odds))
## Baseline odds: 0.250
cat(sprintf("New odds: %.3f\n", new_odds))
## New odds: 0.750
cat(sprintf("New probability: %.3f\n", new_prob))
## New probability: 0.429
cat(sprintf("Probability ratio: %.3f (NOT equal to odds ratio!)\n", new_prob/baseline_prob))
## Probability ratio: 2.143 (NOT equal to odds ratio!)
Now let’s turn to the probit model, logit’s main competitor. The fundamental idea is the same—we’re transforming a linear predictor into a probability using an S-shaped curve. The difference is that probit uses the normal distribution (the familiar bell curve) instead of the logistic distribution.
To understand probit, let’s explore the theoretical connection between the normal distribution assumption and what we observe. When we assume the errors in our latent variable model follow a normal distribution, we’re making a specific claim about the underlying process generating our binary outcomes.
The normal distribution often arises in nature due to the Central Limit Theorem - when many small, independent factors add up, their sum tends to be normally distributed. In the context of binary choices, think about:
This theoretical foundation suggests probit might be especially appropriate when your latent variable represents an accumulation of many influences.
FFormally, the probit model assumes \(u_i\) follows a standard normal distribution:
Unlike the logistic CDF which has the neat formula \(\frac{e^x}{1+e^x}\), the normal CDF is defined only as an integral – there’s no way to write it without that integral sign. This means:
# compare normal and logistic distributions
data.frame(
comparison_df =u = u_seq,
Normal_PDF = dnorm(u_seq),
Logistic_PDF = dlogis(u_seq),
Normal_CDF = pnorm(u_seq),
Logistic_CDF = plogis(u_seq)
)
# compare probability density functions
ggplot(comparison_df, aes(x = u)) +
pdf_comp = geom_line(aes(y = Normal_PDF, color = "Normal"), size = 1.2) +
geom_line(aes(y = Logistic_PDF, color = "Logistic"), size = 1.2, linetype = "dashed") +
scale_color_brewer(palette = "Set1") +
labs(title = "PDF Comparison: Normal vs Logistic",
subtitle = "Logistic has fatter tails",
x = "u", y = "Density") +
theme_minimal()
# compare cumulative distribution functions
ggplot(comparison_df, aes(x = u)) +
cdf_comp = geom_line(aes(y = Normal_CDF, color = "Normal"), size = 1.2) +
geom_line(aes(y = Logistic_CDF, color = "Logistic"), size = 1.2, linetype = "dashed") +
scale_color_brewer(palette = "Set1") +
labs(title = "CDF Comparison: Normal vs Logistic",
subtitle = "Very similar S-shaped curves",
x = "u", y = "Pr(U ≤ u)") +
theme_minimal()
grid.arrange(pdf_comp, cdf_comp, ncol = 2)
The probit model specifies:
\[\Pr(Y_i = 1) = \Phi(X_i\beta) = \int_{-\infty}^{X_i\beta} \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{t^2}{2}\right) dt\]
Unlike logit, where \(e^{\beta_k}\) represents the multiplicative change in odds, probit coefficients have no such clean interpretation. You cannot transform a probit coefficient into a meaningful quantity without knowing where you are on the curve. This is why:
The probit likelihood has the same structure as logit:
\[L_i = [\Phi(X_i\beta)]^{Y_i} [1-\Phi(X_i\beta)]^{1-Y_i}\]
But here’s where it gets interesting: The gradient (first derivative) of the probit log-likelihood has the following form:
\[\frac{\partial \ln L}{\partial \beta} = \sum_{i=1}^N X_i \cdot \lambda_i\]
where \(\lambda_i\) is the inverse Mills ratio:
This ratio appears everywhere in the social sciences:
The fact that probit naturally produces these ratios is one reason it integrates so well with other methods.
Computational Quirks of Probit
Because \(\Phi(\cdot)\) requires numerical integration:
The practical difference between probit and logit emerges at the extremes:
Thinner Tails in Probit:
What This Means for Your Research:
When These Differences Actually Matter:
# look at extreme predicted probabilities
c(-5, -4, -3, 3, 4, 5)
X_beta_extreme <- pnorm(X_beta_extreme)
probit_p <- plogis(X_beta_extreme)
logit_p <- logit_p/probit_p
ratio <-
# at extremes, logit probabilities can be much larger than probit
print(ratio)
## [1] 2.334838e+04 5.679035e+02 3.513293e+01 9.538617e-01 9.820449e-01
## [6] 9.933074e-01
For most data (where \(X_i\beta \in [-2, 2]\)), the maximum difference in predicted probabilities is about 0.02. But in applications with:
…the choice between probit and logit may become substantive, not just aesthetic.
Despite all the theory, the real reasons are often:
Disciplinary convention:
Software defaults:
Extensions you plan:
Reviewer preferences: Sometimes you choose the model that won’t trigger “why didn’t you use X instead?” from reviewers
The bottom line: For 95% of applications, the choice between probit and logit is inconsequential for your substantive conclusions. The remaining 5% involves extreme probabilities, specific theoretical models, or particular extensions where the choice truly matters.
Below is an example of fitting a probit model in R using the same simulated data as before. We’ll fit the probit model and compare its predictions to the logit and LPM models.
# fit the probit model
glm(y ~ x, data = data, family = binomial(link = "probit"))
probit_model =
# get predictions
$probit_pred = predict(probit_model, newdata = pred_data, type = "response")
pred_data
# show model summary
summary(probit_model)
##
## Call:
## glm(formula = y ~ x, family = binomial(link = "probit"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.23591 0.06687 3.528 0.000419 ***
## x 0.52161 0.04438 11.752 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 688.91 on 499 degrees of freedom
## Residual deviance: 485.75 on 498 degrees of freedom
## AIC: 489.75
##
## Number of Fisher Scoring iterations: 5
# compare all three models
ggplot(pred_data, aes(x = x)) +
comparison = geom_line(aes(y = lpm_pred, color = "LPM"), size = 1) +
geom_line(aes(y = logit_pred, color = "Logit"), size = 1) +
geom_line(aes(y = probit_pred, color = "Probit"), size = 1, linetype = "dashed") +
geom_point(data = data, aes(x = x, y = y), alpha = 0.05) +
geom_hline(yintercept = c(0, 1), linetype = "dotted", alpha = 0.5) +
scale_color_brewer(palette = "Set1") +
labs(title = "Comparison: LPM vs Logit vs Probit",
subtitle = "Logit and Probit produce similar predictions; LPM can violate bounds",
x = "Independent Variable (X)",
y = "Predicted Probability",
color = "Model") +
theme_minimal()
print(comparison)
You might wonder: why are we suddenly talking about utility when we want to model binary outcomes? The answer reveals something profound about statistical modeling—our models often work better when they’re grounded in behavioral theory.
Think of it this way: when someone makes a choice (vote vs. don’t vote, migrate vs. stay, buy vs. don’t buy), they’re implicitly comparing the benefits of each option. The random utility model formalizes this intuition, giving us both theoretical coherence and practical tools.
Binary choice models can also be motivated through utility maximization, providing a behavioral foundation for the statistical models we’ve been discussing. This isn’t just mathematical convenience—it connects our statistical models to decades of research in economics, psychology, and decision science.
The Setup: Let’s make this concrete with an example before we dive into notation.
Imagine a citizen deciding whether to vote in an election: - Option 0 (Don’t vote): Save time and effort, avoid bad weather, skip researching candidates - Option 1 (Vote): Fulfill civic duty, influence outcome, express preferences, but incur costs
The citizen will vote if the utility (net benefit) of voting exceeds the utility of not voting. This seems obvious, but here’s the key insight: we can’t observe the citizen’s utility directly. We only see their choice.
This is the classic “paradox of voting”—why do people vote when the probability of being pivotal is essentially zero? The random utility framework helps us model this.
More formally, consider an individual \(i\) choosing between two options:
The individual chooses the option with higher utility:
\[Y_i = \begin{cases} 1 & \text{if } U_{i1} > U_{i0} \\ 0 & \text{if } U_{i0} \geq U_{i1} \end{cases}\]
Why “Random” Utility?
The word “random” might seem odd here – aren’t people’s choices deliberate? Yes, but from our perspective as researchers, there’s randomness because we can’t observe everything that influences decisions.
We call it “random” because from the researcher’s perspective, utility has:
This distinction between what we can and cannot observe is crucial—it’s why we need probabilistic models rather than deterministic ones.
Each option’s utility consists of a systematic (observable) and random (unobservable) component:
\[U_{i0} = V_{i0} + \epsilon_{i0} = X_{i0}\beta + \epsilon_{i0}\] \[U_{i1} = V_{i1} + \epsilon_{i1} = X_{i1}\beta + \epsilon_{i1}\]
where:
The Choice Probability:
Individual \(i\) chooses option 1 when \(U_{i1} > U_{i0}\):
\[\begin{align} \Pr(Y_i = 1) &= \Pr(U_{i1} > U_{i0}) \\ &= \Pr(V_{i1} + \epsilon_{i1} > V_{i0} + \epsilon_{i0}) \\ &= \Pr(\epsilon_{i0} - \epsilon_{i1} < V_{i1} - V_{i0}) \\ &= \Pr(\epsilon_{i0} - \epsilon_{i1} < X_{i1}\beta - X_{i0}\beta) \end{align}\]
Key Insight: The distribution of \((\epsilon_{i0} - \epsilon_{i1})\) determines which model we get.
Let’s pause here because this is a crucial point that often confuses students. We’re about to make an assumption about something we can’t even observe (the error terms). Why does this matter?
Think of it like this: The errors represent all the factors we don’t measure. Different assumptions about these unmeasured factors lead to different models:
The beautiful part? For most applications, these different assumptions lead to very similar predictions. It’s like choosing between different brands of tires—they all get you where you’re going, just with slightly different characteristics.
Case 1: Type I Extreme Value (Gumbel) Errors → Logit
If \(\epsilon_{i0}\) and \(\epsilon_{i1}\) are independent Type I extreme value distributed, then their difference follows a logistic distribution:
\[\Pr(Y_i = 1) = \frac{\exp(V_{i1} - V_{i0})}{1 + \exp(V_{i1} - V_{i0})} = \frac{1}{1 + \exp(-(V_{i1} - V_{i0}))}\]
Why Extreme Value? These distributions arise naturally when modeling the maximum of many random shocks—fitting for choices where people pick the “best” option.
Case 2: Normal Errors → Probit
If \(\epsilon_{i0}\) and \(\epsilon_{i1}\) are jointly normal, then their difference is also normal:
\[\epsilon_{i0} - \epsilon_{i1} \sim N(0, \sigma^2_0 + \sigma^2_1 - 2\sigma_{01})\]
Usually we assume independence (\(\sigma_{01} = 0\)) and normalize the variance, giving:
\[\Pr(Y_i = 1) = \Phi\left(\frac{V_{i1} - V_{i0}}{\sigma}\right)\]
The Fundamental Problem: We can never separately identify:
An Analogy to Make This Clear:
Imagine you’re trying to measure “democratic quality.” Different indices might:
Are these measuring the same thing? How do we compare? We need to fix a scale. Similarly, with utility, we need to pin down some reference points to make our model identifiable.
Standard Normalizations:
After normalization, our familiar models emerge:
where \(X_i\) now represents the difference in characteristics between options.
1. Behavioral Foundation:
Instead of just saying “we model probability with an S-curve,” we can say “individuals maximize utility with random components.”
2. Welfare Analysis:
We can compute consumer surplus and welfare effects because we have a utility interpretation.
3. Extensions to Multiple Choices:
The framework naturally extends to:
4. Structural Interpretation:
Coefficients represent preference parameters, not just statistical associations.
When the Random Utility Framework Matters:
When It’s Just a Story:
Often, the random utility model is just a motivation – we’re really just fitting a nonlinear probability model. That’s fine! But be aware that:
The McFadden Revolution:
Daniel McFadden won the Nobel Prize (2000) for developing these models. His key insight: by assuming specific error distributions, we can turn utility maximization into tractable statistical models.
Consider modeling whether State A initiates conflict with State B:
Utility from peace: \(U_{Peace} = \beta_0 + \beta_1 \cdot \text{TradeDependence} + \epsilon_P\)
Utility from conflict: \(U_{Conflict} = \beta_2 \cdot \text{RelativePower} + \beta_3 \cdot \text{TerritorialDispute} + \epsilon_C\)
Conflict occurs when: \(U_{Conflict} > U_{Peace}\)
After differencing and normalizing:
\[\Pr(\text{Conflict}) = \Lambda(\beta_2 \cdot \text{RelativePower} + \beta_3 \cdot \text{TerritorialDispute} - \beta_1 \cdot \text{TradeDependence} - \beta_0)\]
The coefficients now have structural interpretation:
This is the same logit model we’d fit statistically, but now with behavioral meaning. Cool, right?!
Now we come to a section that might seem highly technical but actually has very practical implications for your research. The assumption about the distribution of errors isn’t always mathematical minutiae – it affects how your model behaves in real applications.
The assumption about the distribution of \(u_i\) is not merely a technical detail but has important practical implications that researchers should understand.
Think of it this way: The error distribution is like choosing the lens through which you view your data. Different lenses (distributions) will:
Let’s explore what this means in practice:
First, what do we mean by “tails”?
In statistics, the “tails” of a distribution refer to the extreme values—the very high or very low outcomes that occur rarely. For binary models, this translates to:
The error distribution determines how the model handles extreme probabilities. This might seem abstract, so let’s use a concrete example:
Imagine modeling democratic breakdowns or coups:
Logistic Distribution (Logit): Has heavier tails (kurtosis = 1.2), meaning:
Normal Distribution (Probit): Has lighter tails (kurtosis = 0), meaning:
Practical Implication: If your phenomenon involves rare but impactful shocks (coups, revolutions, war onset), logit’s heavier tails may be more appropriate. If the process aggregates many small influences (voting behavior from multiple social pressures, gradual democratization), probit’s normal assumption may be preferable.
The variance of \(u_i\) is not identified separately from \(\beta\):
This means:
Practical Implication: Never compare raw coefficients between logit and probit models. Always transform to a common scale (predicted probabilities, marginal effects, or standardized coefficients).
The error distribution connects to different theoretical frameworks:
Logistic (Logit):
Normal (Probit):
Practical Implication: Choose based on your theoretical model. If you’re extending to multinomial choice, logit provides consistency. If you’re embedding in a larger structural model with normal assumptions, probit maintains coherence.
When choosing between error distributions, examine:
A. Empirical Fit at the Tails
# check how accurate predictions are at extreme values
data[predicted_prob < 0.1, ]
extreme_low <- data[predicted_prob > 0.9, ]
extreme_high <-# compare what actually happened vs what we predicted
B. Theoretical Consistency
C. Sensitivity Analysis
Fit both models and compare:
D. Warning Signs of Misspecification
Sometimes neither logistic nor normal errors are appropriate:
Asymmetric Processes:
Contaminated Processes:
Bounded Rationality:
Practical Implication: The logit/probit choice is often less important than recognizing when neither is appropriate. Large discrepancies between models or poor fit at extremes suggest exploring alternative link functions.
For most applications, the choice between logistic and normal errors is inconsequential – both will yield similar substantive conclusions. However, researchers should:
As Amemiya (1981) noted, “the choice between logit and probit models is largely one of convenience and convention.”
So far, we’ve focused on the symmetric S-curves of logit and probit. But what if your phenomenon doesn’t follow that symmetric pattern? What if the approach to 0 is fundamentally different from the approach to 1?
This is where alternative link functions come in. These aren’t just mathematical curiosities—they can better capture the underlying process generating your data.
Before diving into the math, let’s understand when you might need an asymmetric model:
Example: Democratic Transitions
Another Example: Revolution or Civil War Onset
The complementary log-log (cloglog) model uses an asymmetric link function:
\[\Pr(Y_i = 1) = 1 - \exp[-\exp(X_i\beta)]\]
Equivalently:
\[\ln\{-\ln[1 - \Pr(Y_i = 1)]\} = X_i\beta\]
Understanding the Double Exponential
This looks complicated, but there’s intuition here. Let’s build it up step by step with a concrete example.
Example: Modeling Government Collapse
Imagine we’re modeling the probability a government falls within a year:
\(\exp(X_i\beta)\) creates a positive hazard rate
\(-\exp(X_i\beta)\) makes it negative
\(\exp[-\exp(X_i\beta)]\) is the survival probability (probability of government surviving)
\(1 - \exp[-\exp(X_i\beta)]\) is the probability of government collapse
Working from the inside out:
This structure comes from survival analysis, where we model the hazard (instantaneous risk) of an event.
1. Discrete-Time Survival Analysis
Cloglog naturally arises when you observe a continuous-time process at discrete intervals. If the underlying hazard is constant within intervals:
\[\Pr(\text{Event in interval } t) = 1 - \exp[-\exp(\alpha_t + X_i\beta)]\]
Example: Modeling democratic transitions observed annually when the underlying democratization pressure is continuous. Or modeling war onset observed yearly when tensions build continuously.
2. Rare Events That Can Escalate Quickly
The asymmetry matches processes where:
Examples:
3. Theoretical Justification from Extreme Value Theory
If \(Y = 1\) when the maximum of many random shocks exceeds a threshold, and these shocks follow a specific distribution, you get cloglog.
Unlike logit (where coefficients are log-odds ratios), cloglog coefficients have a different interpretation. This is where many researchers get confused, so let’s break it down carefully.
\[\ln\{-\ln[1 - \Pr(Y_i = 1)]\} = X_i\beta\]
What does this mean in plain English?
Practical Tip: Focus on predicted probabilities and marginal effects rather than raw coefficients.
# define the link functions
seq(-3, 3, 0.01)
x_range = plogis(x_range)
logit_probs = pnorm(x_range)
probit_probs = 1 - exp(-exp(x_range))
cloglog_probs = exp(-exp(-x_range)) # Note: reversed interpretation
loglog_probs =
data.frame(
link_comparison =x = x_range,
Logit = logit_probs,
Probit = probit_probs,
Cloglog = cloglog_probs,
Loglog = loglog_probs
)
link_comparison %>%
link_plot = pivot_longer(cols = -x, names_to = "Model", values_to = "Probability") %>%
ggplot(aes(x = x, y = Probability, color = Model)) +
geom_line(size = 1.2) +
labs(title = "Comparison of Link Functions",
subtitle = "Note the asymmetry in cloglog and loglog",
x = "Linear Predictor (Xβ)",
y = "Pr(Y = 1)") +
theme_minimal() +
geom_hline(yintercept = 0.5, linetype = "dotted", alpha = 0.5) +
geom_vline(xintercept = 0, linetype = "dotted", alpha = 0.5)
print(link_plot)
# show where each model reaches key probabilities
c(0.1, 0.5, 0.9)
key_probs =for(p in key_probs) {
cat(sprintf("\nXβ value needed for Pr(Y=1) = %.1f:\n", p))
cat(sprintf(" Logit: %6.3f\n", qlogis(p)))
cat(sprintf(" Probit: %6.3f\n", qnorm(p)))
cat(sprintf(" Cloglog: %6.3f\n", log(-log(1-p))))
cat(sprintf(" Loglog: %6.3f\n", -log(-log(p))))
}
##
## Xβ value needed for Pr(Y=1) = 0.1:
## Logit: -2.197
## Probit: -1.282
## Cloglog: -2.250
## Loglog: -0.834
##
## Xβ value needed for Pr(Y=1) = 0.5:
## Logit: 0.000
## Probit: 0.000
## Cloglog: -0.367
## Loglog: 0.367
##
## Xβ value needed for Pr(Y=1) = 0.9:
## Logit: 2.197
## Probit: 1.282
## Cloglog: 0.834
## Loglog: 2.250
# fit the cloglog model
glm(y ~ x, data = data,
cloglog_model =family = binomial(link = "cloglog"))
summary(cloglog_model)
##
## Call:
## glm(formula = y ~ x, family = binomial(link = "cloglog"), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.21156 0.07242 -2.921 0.00349 **
## x 0.56631 0.05024 11.272 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 688.91 on 499 degrees of freedom
## Residual deviance: 488.21 on 498 degrees of freedom
## AIC: 492.21
##
## Number of Fisher Scoring iterations: 5
# compare predictions at extreme values
c(-2, -1, 0, 1, 2)
extreme_x = data.frame(
pred_comparison =x = extreme_x,
Logit = predict(logit_model, newdata = data.frame(x = extreme_x), type = "response"),
Probit = predict(probit_model, newdata = data.frame(x = extreme_x), type = "response"),
Cloglog = predict(cloglog_model, newdata = data.frame(x = extreme_x), type = "response")
)print("\nPredicted probabilities at different X values:")
## [1] "\nPredicted probabilities at different X values:"
print(pred_comparison)
## x Logit Probit Cloglog
## 1 -2 0.2040555 0.2097422 0.2295298
## 2 -1 0.3822494 0.3875535 0.3683281
## 3 0 0.5989553 0.5932502 0.5548412
## 4 1 0.7828308 0.7756334 0.7596912
## 5 2 0.8969111 0.8995768 0.9188912
The log-log model is the “reverse” of cloglog:
\[\Pr(Y_i = 1) = \exp[-\exp(-X_i\beta)]\]
Key Differences from Cloglog:
Rare Non-Events: When you’re modeling the probability of something NOT happening, and non-occurrence can become certain quickly:
The Coefficient Interpretation Trap
Because of how log-log is parameterized, interpretation is tricky:
This is so counterintuitive that many researchers avoid log-log entirely, preferring to:
The scobit (skewed-logit) model, introduced by Nagler (1994), generalizes the logit model to allow for asymmetric response curves while maintaining the odds ratio framework.
Why would you want this?
Imagine you’re modeling legislative voting behavior and you suspect:
Or consider modeling alliance formation:
Scobit lets you test whether this asymmetry exists while keeping the interpretational framework of logit.
The scobit uses the Burr-10 (of course, right? lol) distribution:
\[\Pr(Y_i = 1) = \frac{1}{[1 + \exp(-X_i\beta)]^\alpha}\]
where \(\alpha > 0\) is a shape parameter.
What α Does:
Theoretical Appeal:
Practical Problems:
1. Identification Issues
The shape parameter \(\alpha\) and slope coefficients \(\beta\) can be hard to separately identify:
2. Interpretation Complexity
With varying \(\alpha\):
Strong Theoretical Reason for Flexible Asymmetry: You believe the response curve could be asymmetric in either direction and want to let the data decide.
Large Sample with Good Coverage: You have thousands of observations with outcomes across the probability range.
Model Selection Exercise: You’re comparing multiple link functions and want the most flexible option.
# demonstrate scobit with different alpha values
function(x, alpha) {
scobit_cdf =1 / (1 + exp(-x))^alpha
}
c(0.2, 0.5, 1, 2, 3)
alpha_values = expand.grid(
scobit_df =x = x_range,
alpha = alpha_values
%>%
) mutate(
Probability = mapply(scobit_cdf, x, alpha),
Alpha = factor(alpha, labels = paste("α =", alpha_values))
)
ggplot(scobit_df, aes(x = x, y = Probability, color = Alpha)) +
scobit_plot = geom_line(size = 1.2) +
labs(title = "Scobit CDFs with Varying Shape Parameters",
subtitle = "α = 1 corresponds to standard logit",
x = "Linear Predictor (Xβ)",
y = "Pr(Y = 1)") +
theme_minimal() +
geom_hline(yintercept = c(0.37, 0.5, 0.63), linetype = "dotted", alpha = 0.3) +
annotate("text", x = -2.5, y = 0.39, label = "Cloglog max slope", size = 3) +
annotate("text", x = -2.5, y = 0.52, label = "Logit max slope", size = 3) +
annotate("text", x = -2.5, y = 0.65, label = "Log-log max slope", size = 3)
print(scobit_plot)
# demonstrate the identification challenge
# show how different parameter combinations can give similar curves
seq(-3, 3, 0.1)
demo_x =
# case 1: α = 0.5, β = 1
scobit_cdf(demo_x * 1, 0.5)
case1_prob =
# case 2: α = 2, β = 0.5
scobit_cdf(demo_x * 0.5, 2)
case2_prob =
data.frame(
ident_df =x = demo_x,
Case1 = case1_prob,
Case2 = case2_prob
)
ggplot(ident_df, aes(x = x)) +
ident_plot = geom_line(aes(y = Case1, color = "α=0.5, β=1"), size = 1.2) +
geom_line(aes(y = Case2, color = "α=2, β=0.5"), size = 1.2, linetype = "dashed") +
labs(title = "Identification Challenge in Scobit",
subtitle = "Different parameter combinations can yield similar curves",
x = "X (unstandardized)",
y = "Pr(Y = 1)",
color = "Parameters") +
theme_minimal()
print(ident_plot)
Uses the Cauchy distribution (t-distribution with 1 df): \[\Pr(Y_i = 1) = \frac{1}{2} + \frac{1}{\pi}\arctan(X_i\beta)\]
Properties:
Uses the t-distribution with ν degrees of freedom: \[\Pr(Y_i = 1) = F_t(X_i\beta; \nu)\]
Properties:
Robust version of probit using t-distribution:
Is symmetry reasonable?
Which direction is the asymmetry?
Do you have a survival/duration interpretation?
Sample size and coverage?
# fit multiple models for comparison
list(
models =logit = glm(y ~ x, family = binomial(link = "logit"), data = data),
probit = glm(y ~ x, family = binomial(link = "probit"), data = data),
cloglog = glm(y ~ x, family = binomial(link = "cloglog"), data = data),
loglog = glm(y ~ x, family = binomial(link = "loglog"), data = data)
)
# compare using information criteria
sapply(models, AIC)
sapply(models, BIC)
# vuong test for non-nested models (if needed)
# likelihood ratio tests for nested models
Despite all these options, in practice: - 90% of papers use logit or probit - 8% use cloglog (mostly survival analysis) - 1% use log-log - <1% use scobit or other exotic links
Why? Because:
The bottom line: Use alternative link functions when you have a strong theoretical reason or when standard models clearly fail. Otherwise, the added complexity rarely justifies the minimal gain in fit.
We’ve estimated our models and have a table full of coefficients. Now comes the crucial question: what do these numbers actually mean? This is where many analyses go wrong—researchers report coefficients without translating them into quantities that readers can understand.
Binary response model coefficients don’t have a direct interpretation like OLS. Let’s see why with a comparison:
In OLS (Linear Regression):
In Logit (Modeling conflict onset):
In Probit (Modeling voting behavior):
Instead:
The Fundamental Challenge
In OLS, a one-unit change in X always leads to a \(\beta\) change in Y. Simple. In binary response models, the effect of X on probability depends on where you start:
This is why we can’t interpret coefficients directly – we need to know the baseline probability first.
The marginal effect of \(X\) on \(\Pr(Y = 1)\) depends on the current values of all variables:
\[\frac{\partial \Pr(Y_i = 1)}{\partial X_{ik}} = f(X_i\beta) \cdot \beta_k\]
where \(f(\cdot)\) is the PDF of the link function.
Breaking This Down:
1. Marginal Effects at the Mean (MEM)
When to use: When you need a single summary number and your audience expects it
Problem: The “average state” might not exist. Example: If your sample is 50% democracies and 50% autocracies, the “average” country has regime type = 0.5, which is meaningless! Or the “average” legislator being half Republican and half Democrat.
2. Average Marginal Effects (AME) - The Observed Value Approach
Why AME/Observed Values? As Hanmer & Kalkan (2013) argue, this approach: - Respects the actual distribution of your data - Avoids creating impossible “average” units - Better represents population heterogeneity - Accounts for the nonlinearity of the model across the full range of data - Gives you the average effect for real units, not the effect for an imaginary average unit
Key insight: In nonlinear models, the effect at the mean ≠ mean of the effects!
3. Marginal Effects at Representative Values (MER)
# calculate Average Marginal Effects (AME) - the observed value approach
margins(logit_model)
margins_logit_ame =summary(margins_logit_ame)
## factor AME SE z p lower upper
## x 0.1418 0.0060 23.7169 0.0000 0.1301 0.1535
# compare with Marginal Effects at Means (MEM) - the traditional approach
# First create a dataset with all variables at their means
data.frame(x = mean(data$x))
mean_data = margins(logit_model, data = mean_data)
margins_logit_mem =summary(margins_logit_mem)
## factor AME SE z p lower upper
## x 0.2149 0.0197 10.8903 0.0000 0.1762 0.2535
# Show the difference
cat("\nComparing approaches:\n")
##
## Comparing approaches:
cat("AME (Observed Values):", summary(margins_logit_ame)$AME[1], "\n")
## AME (Observed Values): 0.1417677
cat("MEM (At Means):", summary(margins_logit_mem)$AME[1], "\n")
## MEM (At Means): 0.2148566
cat("Difference can be substantial when data is not centered around 0.5 probability!\n")
## Difference can be substantial when data is not centered around 0.5 probability!
# create visualizations how marginal effects vary with X
seq(-3, 3, 0.1)
x_vals = data.frame(x = x_vals)
me_data =
# For logit: ME = β * p(1-p)
coef(logit_model)[2]
logit_coef =$logit_prob = predict(logit_model, newdata = me_data, type = "response")
me_data$logit_me = logit_coef * me_data$logit_prob * (1 - me_data$logit_prob)
me_data
# For probit: ME = β * φ(Xβ)
coef(probit_model)[2]
probit_coef = predict(probit_model, newdata = me_data, type = "link")
probit_xb =$probit_me = probit_coef * dnorm(probit_xb)
me_data
# For LPM: ME is constant
coef(lpm_model)[2]
lpm_coef =$lpm_me = lpm_coef
me_data
me_data %>%
me_plot = pivot_longer(cols = c(logit_me, probit_me, lpm_me),
names_to = "Model",
values_to = "MarginalEffect") %>%
mutate(Model = str_remove(Model, "_me") %>% str_to_title()) %>%
ggplot(aes(x = x, y = MarginalEffect, color = Model)) +
geom_line(size = 1.2) +
labs(title = "Marginal Effects Across Different Models",
subtitle = "Note how logit/probit effects vary with X, while LPM is constant",
x = "Value of X",
y = "Marginal Effect (∂Pr/∂X)") +
theme_minimal() +
geom_hline(yintercept = 0, linetype = "dotted", alpha = 0.5)
print(me_plot)
# show the distribution of marginal effects in the actual data
margins(logit_model, at = list(x = data$x))
actual_me_logit =hist(actual_me_logit$fitted,
main = "Distribution of Marginal Effects in Sample",
xlab = "Individual Marginal Effect",
col = "lightblue")
abline(v = mean(actual_me_logit$fitted), col = "red", lwd = 2)
text(mean(actual_me_logit$fitted), par("usr")[4]*0.9,
"AME", pos = 4, col = "red")
What is “identification” and why should you care?
Identification means being able to uniquely determine the parameters of your model from the data. Without it, multiple different parameter sets could explain your data equally well—obviously a problem!
For latent variable models, we cannot separately identify:
The location of \(Y^*\) (where is zero?)
The scale of \(Y^*\) (what’s the unit?)
The variance of errors
The Analogy: It’s like measuring temperature. We can’t identify:
To achieve identification, we fix:
Why These Specific Values?
# demonstrate scale invariance of probabilities
# original probit model
glm(y ~ x, data = data, family = binomial(link = "probit"))
probit1 =
# create scaled version (multiply coefficients by 2)
coef(probit1)
original_coef = original_coef * 2
scaled_coef =
# predictions at original scale
c(-2, -1, 0, 1, 2)
x_test = original_coef[1] + original_coef[2] * x_test
original_xb = pnorm(original_xb)
original_prob =
# predictions at scaled version
scaled_coef[1] + scaled_coef[2] * x_test
scaled_xb =# to get same probabilities, need variance = 4, so sd = 2
pnorm(scaled_xb / 2)
scaled_prob =
data.frame(
comparison_df =X = x_test,
Original = original_prob,
Scaled_Doubled = scaled_prob,
Difference = abs(original_prob - scaled_prob)
)
print("Demonstrating Scale Invariance of Probabilities:")
## [1] "Demonstrating Scale Invariance of Probabilities:"
print(comparison_df)
## X Original Scaled_Doubled Difference
## 1 -2 0.2097422 0.2097422 0
## 2 -1 0.3875535 0.3875535 0
## 3 0 0.5932502 0.5932502 0
## 4 1 0.7756334 0.7756334 0
## 5 2 0.8995768 0.8995768 0
# show what this means for comparing models
cat("\nWhat this means for practice:\n")
##
## What this means for practice:
cat("1. Can't compare raw coefficients across:\n")
## 1. Can't compare raw coefficients across:
cat(" - Different models (logit vs probit)\n")
## - Different models (logit vs probit)
cat(" - Different samples (unless variance is constant)\n")
## - Different samples (unless variance is constant)
cat(" - Models with different covariates\n")
## - Models with different covariates
cat("2. CAN compare:\n")
## 2. CAN compare:
cat(" - Predicted probabilities\n")
## - Predicted probabilities
cat(" - Marginal effects\n")
## - Marginal effects
cat(" - Ratios of coefficients within same model\n")
## - Ratios of coefficients within same model
In regular regression, heteroscedasticity means the variance of errors changes across observations. It’s annoying but not catastrophic—you can use robust standard errors and move on.
In binary response models, it’s much more serious.
Unlike linear models where heteroscedasticity “just” affects standard errors, in binary models it affects everything:
If \(\text{Var}(u_i|X_i) = \sigma_i^2\) varies across observations:
Solutions:
You’ve estimated several models — logit, probit, maybe cloglog. How do you choose? More importantly, how do you know if any of them fit well?
This section will give you practical tools for model selection and evaluation. But first, a warning: don’t get too obsessed with finding the “best” model by statistical criteria. Often, the choice should be driven by:
# compare predictions
data.frame(
pred_comparison =logit = predict(logit_model, type = "response"),
probit = predict(probit_model, type = "response")
)
cor(pred_comparison$logit, pred_comparison$probit)
cor_pred =
ggplot(pred_comparison, aes(x = logit, y = probit)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(title = "Logit vs Probit Predictions",
subtitle = paste("Correlation:", round(cor_pred, 4)),
x = "Logit Predicted Probability",
y = "Probit Predicted Probability") +
theme_minimal()
# compare coefficients (need to scale for comparison)
# rule of thumb: logit ≈ 1.6 × probit
sqrt(pi^2/3)
scale_factor = data.frame(
scaled_comparison =Variable = names(coef(logit_model)),
Logit = coef(logit_model),
Probit = coef(probit_model),
Ratio = coef(logit_model) / coef(probit_model),
Expected_Ratio = scale_factor
)
print("Coefficient Comparison:")
## [1] "Coefficient Comparison:"
print(scaled_comparison)
## Variable Logit Probit Ratio Expected_Ratio
## (Intercept) (Intercept) 0.4011140 0.2359139 1.700256 1.813799
## x x 0.8811257 0.5216150 1.689226 1.813799
cat("\nNote: Logit/Probit ratio should be approximately", round(scale_factor, 2), "\n")
##
## Note: Logit/Probit ratio should be approximately 1.81
Unlike continuous outcomes where R² tells a clear story, binary models have multiple, non-equivalent measures of fit:
Pseudo-R² Measures (None are true R²!)
2. Classification Measures
3. Probability Measures
# function to calculate various metrics
function(model, data, model_name) {
evaluate_model =# Handle LPM differently (it's lm, not glm)
if(class(model)[1] == "lm") {
fitted(model)
pred_prob =# Truncate LPM predictions to [0,1] for fair comparison
pmax(0, pmin(1, pred_prob))
pred_prob =else {
} predict(model, type = "response")
pred_prob =
}
ifelse(pred_prob > 0.5, 1, 0)
pred_class = data$y
actual =
# Confusion matrix metrics
table(Predicted = pred_class, Actual = actual)
conf_matrix = sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy =
# Sensitivity and Specificity
conf_matrix[2,2] / sum(conf_matrix[,2])
sensitivity = conf_matrix[1,1] / sum(conf_matrix[,1])
specificity =
# Log-likelihood and information criteria
if(class(model)[1] == "lm") {
# For LPM, calculate likelihood assuming normal errors
length(actual)
n = sum((actual - pred_prob)^2)
rss = -n/2 * (log(2*pi) + log(rss/n) + 1)
loglik = -2*loglik + 2*length(coef(model))
aic = -2*loglik + log(n)*length(coef(model))
bic =else {
} as.numeric(logLik(model))
loglik = AIC(model)
aic = BIC(model)
bic =
}
# McFadden's R-squared (not meaningful for LPM)
if(class(model)[1] != "lm") {
glm(y ~ 1, data = data, family = binomial())
null_model = 1 - (logLik(model) / logLik(null_model))
mcfadden_r2 =else {
} NA
mcfadden_r2 =
}
# ROC and AUC
roc(actual, pred_prob, quiet = TRUE)
roc_obj = auc(roc_obj)
auc_val =
# Brier Score
mean((pred_prob - actual)^2)
brier =
return(data.frame(
Model = model_name,
Accuracy = round(accuracy, 3),
Sensitivity = round(sensitivity, 3),
Specificity = round(specificity, 3),
AUC = round(as.numeric(auc_val), 3),
Brier = round(brier, 4),
McFadden_R2 = ifelse(is.na(mcfadden_r2), NA, round(as.numeric(mcfadden_r2), 3))
))
}
# evaluate all models
rbind(
evaluation_results =evaluate_model(lpm_model, data, "LPM"),
evaluate_model(logit_model, data, "Logit"),
evaluate_model(probit_model, data, "Probit"),
evaluate_model(cloglog_model, data, "Cloglog")
)
print("Model Comparison:")
## [1] "Model Comparison:"
print(evaluation_results)
## Model Accuracy Sensitivity Specificity AUC Brier McFadden_R2
## 1 LPM 0.758 0.769 0.744 0.84 0.1642 NA
## 2 Logit 0.758 0.777 0.736 0.84 0.1624 0.293
## 3 Probit 0.758 0.773 0.740 0.84 0.1623 0.295
## 4 Cloglog 0.756 0.755 0.758 0.84 0.1634 0.291
# roc curves
par(mfrow = c(1, 2))
# roc plot
roc(data$y, predict(logit_model, type = "response"), quiet = TRUE)
roc_logit = roc(data$y, predict(probit_model, type = "response"), quiet = TRUE)
roc_probit = roc(data$y, predict(cloglog_model, type = "response"), quiet = TRUE)
roc_cloglog = roc(data$y, fitted(lpm_model), quiet = TRUE)
roc_lpm =
plot(roc_logit, main = "ROC Curves", col = "red")
plot(roc_probit, add = TRUE, col = "blue")
plot(roc_cloglog, add = TRUE, col = "green")
plot(roc_lpm, add = TRUE, col = "purple", lty = 2)
legend("bottomright",
legend = c("Logit", "Probit", "Cloglog", "LPM"),
col = c("red", "blue", "green", "purple"),
lty = c(1, 1, 1, 2),
cex = 0.8)
# calibration plot
predict(logit_model, type = "response")
logit_probs = data.frame(
calib_df =pred = logit_probs,
actual = data$y
)$bin = cut(calib_df$pred, breaks = seq(0, 1, 0.1))
calib_df calib_df %>%
calib_summary = group_by(bin) %>%
summarise(
mean_pred = mean(pred),
mean_actual = mean(actual),
n = n()
%>%
) filter(!is.na(bin))
plot(calib_summary$mean_pred, calib_summary$mean_actual,
main = "Calibration Plot (Logit)",
xlab = "Mean Predicted Probability",
ylab = "Actual Proportion",
xlim = c(0, 1), ylim = c(0, 1))
abline(0, 1, col = "red", lty = 2)
With all these metrics, you might feel overwhelmed. Here’s a practical guide:
For Model Selection:
For Reporting Fit:
For Practical Applications:
What it is: One or more variables perfectly predict your outcome. Sounds good, right? Wrong!
Political Science Example:
Perfect separation occurs when a predictor (or combination) perfectly predicts the outcome.
# create data with perfect separation
set.seed(456)
c(rnorm(50, -2, 0.5), rnorm(50, 2, 0.5))
x_sep = c(rep(0, 50), rep(1, 50))
y_sep = data.frame(x = x_sep, y = y_sep)
data_sep =
# try to fit logit (will show warning)
glm(y ~ x, data = data_sep, family = binomial())
logit_sep =
ggplot(data_sep, aes(x = x, y = y)) +
geom_point() +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Perfect Separation Example",
subtitle = "All y=0 have x<0, all y=1 have x>0",
x = "X", y = "Y") +
theme_minimal()
print("Model with perfect separation (note inflated standard errors):")
## [1] "Model with perfect separation (note inflated standard errors):"
summary(logit_sep)
##
## Call:
## glm(formula = y ~ x, family = binomial(), data = data_sep)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2234 18512.2327 0.000 1.000
## x 24.6958 21192.3695 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.3863e+02 on 99 degrees of freedom
## Residual deviance: 2.1510e-09 on 98 degrees of freedom
## AIC: 4
##
## Number of Fisher Scoring iterations: 25
cat("\nSolutions for perfect separation:\n")
##
## Solutions for perfect separation:
cat("1. Penalized likelihood (Firth's method)\n")
## 1. Penalized likelihood (Firth's method)
cat("2. Bayesian methods with informative priors\n")
## 2. Bayesian methods with informative priors
cat("3. Exact logistic regression\n")
## 3. Exact logistic regression
cat("4. Remove problematic predictors\n")
## 4. Remove problematic predictors
cat("5. Combine categories\n")
## 5. Combine categories
When outcomes are rare (< 5% or > 95%), standard methods can be biased.
Common in Political Science:
# create imbalanced data
set.seed(789)
500
n_rare = rnorm(n_rare)
x_rare = plogis(-3 + 0.5 * x_rare) # Very low baseline probability
p_rare = rbinom(n_rare, 1, p_rare)
y_rare =
cat("Proportion of 1s:", mean(y_rare), "\n\n")
## Proportion of 1s: 0.044
# standard logit
glm(y ~ x, data = data.frame(x = x_rare, y = y_rare),
logit_rare =family = binomial())
# cloglog might be better for rare events
glm(y ~ x, data = data.frame(x = x_rare, y = y_rare),
cloglog_rare =family = binomial(link = "cloglog"))
# compare the models
data.frame(
rare_comparison =Model = c("Logit", "Cloglog"),
Intercept = c(coef(logit_rare)[1], coef(cloglog_rare)[1]),
Slope = c(coef(logit_rare)[2], coef(cloglog_rare)[2]),
AIC = c(AIC(logit_rare), AIC(cloglog_rare))
)
print("Rare Events - Model Comparison:")
## [1] "Rare Events - Model Comparison:"
print(rare_comparison)
## Model Intercept Slope AIC
## 1 Logit -3.167135 0.4481748 180.5006
## 2 Cloglog -3.190186 0.4402903 180.4725
cat("\nSolutions for rare events:\n")
##
## Solutions for rare events:
cat("1. King & Zeng bias correction\n")
## 1. King & Zeng bias correction
cat("2. Penalized likelihood\n")
## 2. Penalized likelihood
cat("3. Cloglog or log-log links\n")
## 3. Cloglog or log-log links
cat("4. Choice-based sampling with weights\n")
## 4. Choice-based sampling with weights
Before Modeling: - [ ] Plot Y against each X - [ ] Check for separation - [ ] Examine outcome distribution (rare events?) - [ ] Consider theoretical link function
During Modeling: - [ ] Try multiple link functions - [ ] Check convergence warnings - [ ] Examine residuals (deviance, Pearson) - [ ] Test for influential observations
After Modeling: - [ ] Report marginal effects, not just coefficients - [ ] Use AME (average marginal effects) as default - the observed value approach - [ ] Only use MEM (marginal effects at means) if specifically justified - [ ] Show predicted probabilities for meaningful scenarios - [ ] Validate out-of-sample - [ ] Conduct sensitivity analyses
# create a publication-ready table
library(stargazer)
# For demonstration (would normally output to LaTeX or HTML)
stargazer(lpm_model, logit_model, probit_model,
type = "text",
title = "Comparison of Binary Response Models",
dep.var.labels = "Binary Outcome",
covariate.labels = c("X Variable", "Constant"),
keep.stat = c("n", "rsq", "adj.rsq", "ll", "aic"),
star.cutoffs = c(0.05, 0.01, 0.001))
##
## Comparison of Binary Response Models
## =================================================
## Dependent variable:
## -------------------------------
## Binary Outcome
## OLS logistic probit
## (1) (2) (3)
## -------------------------------------------------
## X Variable 0.140*** 0.881*** 0.522***
## (0.009) (0.082) (0.044)
##
## Constant 0.559*** 0.401*** 0.236***
## (0.018) (0.115) (0.067)
##
## -------------------------------------------------
## Observations 500 500 500
## R2 0.327
## Adjusted R2 0.326
## Log Likelihood -243.424 -242.875
## Akaike Inf. Crit. 490.849 489.749
## =================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
# Substantive effects plot
seq(min(data$x), max(data$x), length.out = 100)
x_range_sub = data.frame(x = x_range_sub)
pred_data_sub =
# get predictions with confidence intervals
predict(logit_model, newdata = pred_data_sub,
pred_logit =type = "response", se.fit = TRUE)
$fit = pred_logit$fit
pred_data_sub$lower = pred_logit$fit - 1.96 * pred_logit$se.fit
pred_data_sub$upper = pred_logit$fit + 1.96 * pred_logit$se.fit
pred_data_sub
ggplot(pred_data_sub, aes(x = x)) +
substantive_plot = geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, fill = "blue") +
geom_line(aes(y = fit), color = "blue", size = 1.2) +
geom_rug(data = filter(data, y == 0), aes(x = x), sides = "b", alpha = 0.3) +
geom_rug(data = filter(data, y == 1), aes(x = x), sides = "t", alpha = 0.3) +
labs(title = "Predicted Probabilities with 95% Confidence Interval",
subtitle = "Rug plots show distribution of observations",
x = "Independent Variable",
y = "Pr(Y = 1)") +
theme_minimal()
print(substantive_plot)
# Create first differences plot (effect of moving X from low to high)
quantile(data$x, 0.25)
x_low = quantile(data$x, 0.75)
x_high =
predict(logit_model, newdata = data.frame(x = x_low),
pred_low =type = "response", se.fit = TRUE)
predict(logit_model, newdata = data.frame(x = x_high),
pred_high =type = "response", se.fit = TRUE)
pred_high$fit - pred_low$fit
first_diff = sqrt(pred_high$se.fit^2 + pred_low$se.fit^2)
first_diff_se =
cat("\n\nFirst Difference (moving X from 25th to 75th percentile):\n")
##
##
## First Difference (moving X from 25th to 75th percentile):
cat(sprintf("Change in probability: %.3f (SE = %.3f)\n", first_diff, first_diff_se))
## Change in probability: 0.533 (SE = 0.039)
cat(sprintf("95%% CI: [%.3f, %.3f]\n",
- 1.96*first_diff_se,
first_diff + 1.96*first_diff_se)) first_diff
## 95% CI: [0.456, 0.610]
Bad: “The logit coefficient on education is 0.234 (p < 0.01).”
Better: “Each additional year of education increases the log-odds of voting by 0.234.”
Best: “Each additional year of education increases the probability of voting by 4.2 percentage points for an individual with average characteristics (95% CI: 2.1 to 6.3 percentage points). The effect is larger for those with low baseline probability (6.5 pp) than high baseline probability (2.1 pp).”
Key Points for Writing:
The Linear Probability Model represents the simplest approach to modeling binary outcomes. We begin with the general linear regression framework where we model the expected value of \(Y\) as a linear function of independent variables \(X\):
\[E(Y) = X\beta\]
For a binary response variable, the expected value has a special interpretation:
\[E(Y) = 1 \cdot \Pr(Y = 1) + 0 \cdot \Pr(Y = 0) = \Pr(Y = 1)\]
Therefore, in the LPM, we model:
\[\Pr(Y_i = 1) = X_i\beta\]
This amounts to fitting a standard OLS regression to a binary response variable.
# show the lpm model that was fit earlier
summary(lpm_model)
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86546 -0.33755 0.01616 0.34236 0.88700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.559460 0.018316 30.54 <2e-16 ***
## x 0.140122 0.008998 15.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4091 on 498 degrees of freedom
## Multiple R-squared: 0.3275, Adjusted R-squared: 0.3261
## F-statistic: 242.5 on 1 and 498 DF, p-value: < 2.2e-16
# Visualization
ggplot(data, aes(x = x, y = y)) +
geom_point(alpha = 0.3, position = position_jitter(height = 0.02)) +
geom_line(data = pred_data, aes(x = x, y = lpm_pred),
color = "blue", size = 1.2) +
geom_hline(yintercept = c(0, 1), linetype = "dashed", alpha = 0.5) +
labs(title = "Linear Probability Model",
subtitle = "Notice predictions can exceed [0,1] bounds",
x = "Independent Variable (X)",
y = "Probability of Y = 1") +
theme_minimal() +
annotate("text", x = -3, y = -0.15, label = "Problematic\npredictions < 0",
color = "red", size = 3) +
annotate("text", x = 3, y = 1.15, label = "Problematic\npredictions > 1",
color = "red", size = 3)
# show model summary of LPM
summary(lpm_model)
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.86546 -0.33755 0.01616 0.34236 0.88700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.559460 0.018316 30.54 <2e-16 ***
## x 0.140122 0.008998 15.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4091 on 498 degrees of freedom
## Multiple R-squared: 0.3275, Adjusted R-squared: 0.3261
## F-statistic: 242.5 on 1 and 498 DF, p-value: < 2.2e-16
While the LPM is computationally simple and provides easily interpretable coefficients (as direct effects on probability), there are some issues worth discussing:
The heteroscedasticity problem in the Linear Probability Model is both mechanical and substantive, arising directly from the binary nature of the dependent variable. But note that this is not catastrophic, there are cases when it poses genuine threats to inference versus when it can be safely managed.
For a binary variable \(Y_i \in \{0,1\}\) following a Bernoulli distribution with probability \(\pi_i = \Pr(Y_i = 1)\), the variance has a specific functional form:
\[\text{Var}(Y_i | X_i) = \pi_i(1 - \pi_i) = E(Y_i|X_i)[1 - E(Y_i|X_i)]\]
In the LPM, we model \(E(Y_i|X_i) = X_i\beta\), which means:
\[\text{Var}(Y_i | X_i) = X_i\beta(1 - X_i\beta)\]
This creates a precise, known form of heteroscedasticity where: - Variance is maximized when \(X_i\beta = 0.5\) (where \(\text{Var}(Y_i) = 0.25\)) - Variance approaches zero as \(X_i\beta\) approaches 0 or 1 - The variance function is quadratic in the linear predictor
This violates the classical OLS assumption that \(\text{Var}(\epsilon_i | X_i) = \sigma^2\) for all \(i\), where the error variance is constant across observations.
The consequences of this heteroscedasticity have been debated extensively in recent literature:
Statistical Consequences:
Coefficient Estimates Remain Consistent: Despite heteroscedasticity, OLS coefficients remain unbiased and consistent. This is because \(E(\epsilon_i | X_i) = 0\) still holds.
Standard Errors Are Incorrect: Classical OLS standard errors are inconsistent, typically underestimating the true variance when predicted probabilities are near 0.5 and overestimating when near the boundaries.
Efficiency Loss: OLS is no longer the Best Linear Unbiased Estimator (BLUE); Weighted Least Squares would be more efficient if we knew the true variance function.
Beck (2020) argues that for many practical applications, especially when predicted probabilities stay within [0.2, 0.8], the heteroscedasticity in LPM is manageable with robust standard errors. Gomila (2021) demonstrates through extensive simulations that LPM with robust standard errors often performs comparably to logistic regression for causal inference, particularly in experimental settings.
The modern consensus, reflected in current practice, suggests:
# show heteroscedasticity structure in lpm
residuals(lpm_model)
lpm_residuals = fitted(lpm_model)
lpm_fitted =
# calculate theoretical variance
lpm_fitted * (1 - lpm_fitted)
theoretical_var =
# Create comprehensive heteroscedasticity visualization
data.frame(
het_data =fitted = lpm_fitted,
residuals = lpm_residuals,
squared_residuals = lpm_residuals^2,
theoretical_var = theoretical_var
)
# Main residual plot
ggplot(het_data, aes(x = fitted, y = residuals)) +
p1 = geom_point(alpha = 0.3) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(method = "loess", se = TRUE, color = "blue") +
labs(title = "Residual Plot for LPM",
subtitle = "Variance changes systematically with fitted values",
x = "Fitted Probability",
y = "Residuals") +
theme_minimal()
# Squared residuals vs fitted to show variance pattern
ggplot(het_data, aes(x = fitted)) +
p2 = geom_point(aes(y = squared_residuals), alpha = 0.3, color = "gray40") +
geom_line(aes(y = theoretical_var), color = "red", size = 1.2) +
labs(title = "Variance Structure in LPM",
subtitle = "Gray points: squared residuals; Red line: theoretical variance",
x = "Fitted Probability",
y = "Variance") +
theme_minimal()
grid.arrange(p1, p2, ncol = 1)
# Demonstrate impact on inference
# Standard OLS standard errors
summary(lpm_model)$coefficients[, "Std. Error"]
ols_se =
# Robust standard errors (HC1)
library(sandwich)
sqrt(diag(vcovHC(lpm_model, type = "HC1")))
robust_se =
# compare the models standard errors
data.frame(
se_comparison =Variable = names(coef(lpm_model)),
OLS_SE = ols_se,
Robust_SE = robust_se,
Ratio = robust_se / ols_se
)
print("Comparison of Standard Errors:")
## [1] "Comparison of Standard Errors:"
print(se_comparison)
## Variable OLS_SE Robust_SE Ratio
## (Intercept) (Intercept) 0.018316401 0.018232059 0.9953953
## x x 0.008997758 0.006500339 0.7224399
print(paste("Average ratio of robust to OLS standard errors:",
round(mean(se_comparison$Ratio), 3)))
## [1] "Average ratio of robust to OLS standard errors: 0.859"
The heteroscedasticity in LPM has several practical implications:
Hypothesis Testing: Use robust standard errors by default. In R: lmtest::coeftest(model, vcov = sandwich::vcovHC(model, type = "HC1"))
Confidence Intervals: Construct using robust standard errors to ensure proper coverage
Model Selection: When comparing models, be aware that information criteria (AIC, BIC) assume homoscedasticity
Predictive Intervals: Standard prediction intervals will be incorrect; the variance should vary with the predicted probability
Weighted Least Squares Alternative: If pursuing efficiency, use WLS with weights \(w_i = 1/[\hat{\pi}_i(1-\hat{\pi}_i)]\), though this requires iterative estimation and can be unstable near boundaries
The bottom line: heteroscedasticity in LPM is a known, manageable issue that rarely invalidates the substantive conclusions when properly addressed through robust inference procedures.
The non-normality of errors in the Linear Probability Model is a fundamental consequence of the binary nature of the dependent variable. This issue has important implications for finite-sample inference, though its practical impact depends heavily on sample size and the specific inferential goals of the analysis.
In the LPM, we model \(Y_i = X_i\beta + \epsilon_i\) where \(Y_i \in \{0,1\}\). The error term \(\epsilon_i\) inherits a peculiar structure from this setup. Given that the model predicts \(\hat{Y}_i = X_i\hat{\beta}\), the residuals can only take two possible values for any given observation:
This creates a discrete, two-point distribution for each observation’s error term. The probability mass function of \(\epsilon_i\) conditional on \(X_i\) is:
\[P(\epsilon_i = 1 - X_i\beta | X_i) = \pi_i = X_i\beta\] \[P(\epsilon_i = -X_i\beta | X_i) = 1 - \pi_i = 1 - X_i\beta\]
We can verify that \(E(\epsilon_i | X_i) = 0\):
\[E(\epsilon_i | X_i) = (1 - X_i\beta) \cdot X_i\beta + (-X_i\beta) \cdot (1 - X_i\beta) = X_i\beta - (X_i\beta)^2 - X_i\beta + (X_i\beta)^2 = 0\]
The error distribution has a few characteristics:
Bimodality: For any given \(X_i\), the error distribution is bimodal with two point masses. This is fundamentally different from the continuous, unimodal normal distribution.
Skewness: Unless \(X_i\beta = 0.5\), the error distribution is skewed. When \(X_i\beta < 0.5\), the distribution is right-skewed; when \(X_i\beta > 0.5\), it’s left-skewed.
Bounded Support: Errors are bounded: \(-X_i\beta \leq \epsilon_i \leq 1 - X_i\beta\). This boundedness becomes more severe as predictions approach 0 or 1.
Kurtosis: The distribution exhibits excess kurtosis (leptokurtic), with the degree depending on \(X_i\beta\).
The violation of the normality assumption has several consequences for statistical inference:
Finite-Sample Issues:
Exact Tests Invalid: Classical t-tests and F-tests rely on the normality of errors for their exact finite-sample distributions. With non-normal errors, the test statistics don’t follow t or F distributions exactly.
Confidence Interval Coverage: Finite-sample confidence intervals based on normal critical values may have incorrect coverage probabilities, particularly in small samples.
Prediction Intervals: Standard prediction intervals, which assume normality, are inappropriate for LPM.
Asymptotic Theory to the Rescue:
Despite these finite-sample issues, several factors mitigate the problem:
\[\sqrt{n}(\hat{\beta} - \beta) \xrightarrow{d} N(0, V)\]
where \(V\) is the asymptotic variance-covariance matrix.
Consistency: The consistency of OLS estimates doesn’t require normality—only that \(E(\epsilon_i | X_i) = 0\).
Robust Inference: Heteroscedasticity-robust standard errors remain valid asymptotically without normality assumptions.
Additionally, there is some discussion in the literature about the practical importance of non-normal errors in LPM:
Horrace and Oaxaca (2006) demonstrate that normality violations in LPM are most problematic when:
Gomila (2021) shows through simulations that for hypothesis testing about treatment effects in randomized experiments, the non-normality of LPM errors has minimal impact on Type I error rates when n > 50, even without robust standard errors.
Beck (2020) argues that concerns about non-normality are often overstated in political science applications where:
# Demonstrate the bimodal nature of LPM errors
set.seed(42)
1000
n_demo =
# Generate data with varying predicted probabilities
rnorm(n_demo, 0, 1.5)
x_demo = plogis(0.3 + 0.6*x_demo)
true_prob_demo = rbinom(n_demo, 1, true_prob_demo)
y_demo =
# Fit LPM
lm(y_demo ~ x_demo)
lpm_demo = fitted(lpm_demo)
fitted_demo = residuals(lpm_demo)
resid_demo =
# Create groups based on fitted values for visualization
cut(fitted_demo,
fitted_groups =breaks = c(-Inf, 0.3, 0.5, 0.7, Inf),
labels = c("Low (< 0.3)", "Mid-Low (0.3-0.5)",
"Mid-High (0.5-0.7)", "High (> 0.7)"))
# create visualizations error distributions by fitted value groups
data.frame(
error_df =residuals = resid_demo,
fitted = fitted_demo,
group = fitted_groups
)
# Histogram of residuals by group
ggplot(error_df, aes(x = residuals)) +
p1 = geom_histogram(bins = 50, fill = "skyblue", color = "black", alpha = 0.7) +
facet_wrap(~group, scales = "free_y", ncol = 2) +
geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
labs(title = "Distribution of LPM Residuals by Fitted Probability Range",
subtitle = "Note the bimodal structure and changing skewness",
x = "Residuals",
y = "Frequency") +
theme_minimal()
# Q-Q plot to assess normality
ggplot(error_df, aes(sample = residuals)) +
p2 = stat_qq() +
stat_qq_line(color = "red") +
facet_wrap(~group, scales = "free") +
labs(title = "Q-Q Plots by Fitted Probability Range",
subtitle = "Systematic departures from normality (red line)",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
theme_minimal()
grid.arrange(p1, p2, ncol = 1)
# Formal tests of normality
library(nortest)
shapiro.test(resid_demo[1:min(5000, length(resid_demo))])
shapiro_overall = ad.test(resid_demo)
ad_test =
print("Formal Tests of Normality for LPM Residuals:")
## [1] "Formal Tests of Normality for LPM Residuals:"
print(paste("Shapiro-Wilk test p-value:", round(shapiro_overall$p.value, 6)))
## [1] "Shapiro-Wilk test p-value: 0"
print(paste("Anderson-Darling test p-value:", round(ad_test$p.value, 6)))
## [1] "Anderson-Darling test p-value: 0"
# Demonstrate CLT in action: sampling distribution of coefficients
1000
n_bootstrap = matrix(NA, n_bootstrap, 2)
boot_coefs =
for(i in 1:n_bootstrap) {
sample(1:n_demo, replace = TRUE)
boot_idx = lm(y_demo[boot_idx] ~ x_demo[boot_idx])
boot_model = coef(boot_model)
boot_coefs[i,] =
}
# Plot sampling distribution
data.frame(
boot_df =intercept = boot_coefs[,1],
slope = boot_coefs[,2]
)
ggplot(boot_df, aes(x = slope)) +
p3 = geom_histogram(aes(y = ..density..), bins = 30,
fill = "lightblue", color = "black", alpha = 0.7) +
stat_function(fun = dnorm,
args = list(mean = mean(boot_df$slope),
sd = sd(boot_df$slope)),
color = "red", size = 1.2) +
labs(title = "Bootstrap Sampling Distribution of Slope Coefficient",
subtitle = "Despite non-normal errors, sampling distribution approximates normality (CLT)",
x = "Slope Coefficient",
y = "Density") +
theme_minimal()
print(p3)
# Show sample size effects
c(30, 50, 100, 500)
sample_sizes = numeric(length(sample_sizes))
ks_results =
for(j in 1:length(sample_sizes)) {
sample_sizes[j]
n_temp = numeric(500)
boot_slopes_temp =
for(i in 1:500) {
sample(1:n_demo, n_temp, replace = TRUE)
idx_temp = lm(y_demo[idx_temp] ~ x_demo[idx_temp])
temp_model = coef(temp_model)[2]
boot_slopes_temp[i] =
}
# Kolmogorov-Smirnov test for normality
ks.test(scale(boot_slopes_temp), "pnorm")$p.value
ks_results[j] =
}
data.frame(
normality_df =SampleSize = sample_sizes,
KS_pvalue = ks_results,
Conclusion = ifelse(ks_results > 0.05, "Approximately Normal", "Non-Normal")
)
print("\nNormality of Sampling Distribution by Sample Size:")
## [1] "\nNormality of Sampling Distribution by Sample Size:"
print(normality_df)
## SampleSize KS_pvalue Conclusion
## 1 30 0.2311874 Approximately Normal
## 2 50 0.1119774 Approximately Normal
## 3 100 0.6953876 Approximately Normal
## 4 500 0.7290082 Approximately Normal
Given the current understanding of non-normal errors in LPM:
Sample Size Considerations:
Inference Strategy:
Diagnostic Checks:
Reporting:
The modern view, supported by simulation evidence and theoretical results, is that non-normal errors in LPM are primarily a small-sample concern that becomes negligible in the sample sizes typical of most social science applications.
The Linear Probability Model’s ability to produce predicted probabilities outside the [0,1] interval represents perhaps its most obvious theoretical deficiency, yet the practical implications of this issue are more nuanced.
The LPM specifies \(\Pr(Y_i = 1 | X_i) = X_i\beta\) without any constraints ensuring this linear combination falls within [0,1]. For any finite dataset, we can always find values of the covariates that produce:
\[\hat{\Pr}(Y_i = 1 | X_i) = X_i\hat{\beta} < 0 \quad \text{or} \quad \hat{\Pr}(Y_i = 1 | X_i) = X_i\hat{\beta} > 1\]
This occurs because the linear functional form extends infinitely in both directions, while probabilities are bounded. The problem is particularly acute when:
The frequency and severity of boundary violations depend on several factors:
Data Characteristics:
Empirical Frequency:
Horrace and Oaxaca (2006) analyzed 166 published LPM studies and found:
More recently, Gomila (2021) examined experimental studies and found that with binary treatments and typical covariate adjustment, fewer than 2% of predictions typically fall outside [0,1].
Boundary violations can be categorized by their nature and consequences:
1. In-Sample Violations
These occur for observations used to estimate the model:
2. Out-of-Sample Violations
These occur when making predictions for new observations:
3. Counterfactual Violations
These occur when computing treatment effects or marginal effects:
# Comprehensive demonstration of boundary violations in LPM
set.seed(123)
1000
n =
# Generate data with varying baseline probabilities
rnorm(n, 0, 2)
x1 = rbinom(n, 1, 0.4)
x2 = runif(n, -2, 2)
x3 =
# True model with strong effects to induce some violations
plogis(-0.5 + 0.8*x1 + 1.2*x2 + 0.6*x3)
true_prob = rbinom(n, 1, true_prob)
y =
# Fit LPM
lm(y ~ x1 + x2 + x3)
lpm_full =
# get predictions
fitted(lpm_full)
in_sample_pred =
# Create out-of-sample test data with more extreme values
500
n_test = rnorm(n_test, 0, 3) # Wider range
x1_test = rbinom(n_test, 1, 0.4)
x2_test = runif(n_test, -3, 3) # Wider range
x3_test = data.frame(x1 = x1_test, x2 = x2_test, x3 = x3_test)
test_data = predict(lpm_full, newdata = test_data)
out_sample_pred =
# Analyze violations
function(predictions, label) {
violations_summary = length(predictions)
n_total = sum(predictions < 0)
n_below = sum(predictions > 1)
n_above = n_below + n_above
n_violations = 100 * n_violations / n_total
pct_violations =
# Magnitude of violations
if(n_below > 0) predictions[predictions < 0] else NA
below_magnitude = if(n_above > 0) predictions[predictions > 1] - 1 else NA
above_magnitude =
data.frame(
Sample = label,
N = n_total,
Below_0 = n_below,
Above_1 = n_above,
Total_Violations = n_violations,
Pct_Violations = round(pct_violations, 2),
Min_Pred = round(min(predictions), 3),
Max_Pred = round(max(predictions), 3),
Worst_Below = if(n_below > 0) round(min(below_magnitude), 3) else NA,
Worst_Above = if(n_above > 0) round(max(above_magnitude), 3) else NA
)
}
rbind(
violation_results =violations_summary(in_sample_pred, "In-Sample"),
violations_summary(out_sample_pred, "Out-of-Sample")
)
print("Boundary Violations Summary:")
## [1] "Boundary Violations Summary:"
print(violation_results)
## Sample N Below_0 Above_1 Total_Violations Pct_Violations Min_Pred
## 1 In-Sample 1000 52 68 120 12.0 -0.368
## 2 Out-of-Sample 500 64 59 123 24.6 -0.760
## Max_Pred Worst_Below Worst_Above
## 1 1.378 -0.368 0.378
## 2 2.035 -0.760 1.035
# create visualizations the distribution of predictions
data.frame(
pred_df =Prediction = c(in_sample_pred, out_sample_pred),
Sample = c(rep("In-Sample", length(in_sample_pred)),
rep("Out-of-Sample", length(out_sample_pred)))
)
ggplot(pred_df, aes(x = Prediction, fill = Sample)) +
p1 = geom_histogram(alpha = 0.6, position = "identity", bins = 50) +
geom_vline(xintercept = c(0, 1), color = "red", linetype = "dashed", size = 1) +
facet_wrap(~Sample, ncol = 1, scales = "free_y") +
scale_fill_manual(values = c("In-Sample" = "blue", "Out-of-Sample" = "orange")) +
labs(title = "Distribution of LPM Predictions",
subtitle = "Red lines mark [0,1] boundaries; note violations outside",
x = "Predicted Probability",
y = "Frequency") +
theme_minimal() +
theme(legend.position = "none")
# Examine which observations have violations
data.frame(
violation_analysis =x1 = x1,
x2 = x2,
x3 = x3,
prediction = in_sample_pred,
violation = ifelse(in_sample_pred < 0, "Below 0",
ifelse(in_sample_pred > 1, "Above 1", "Valid"))
)
ggplot(violation_analysis, aes(x = x1, y = prediction, color = violation)) +
p2 = geom_point(alpha = 0.6) +
geom_hline(yintercept = c(0, 1), linetype = "dashed", color = "gray50") +
scale_color_brewer(palette = "Set1") +
labs(title = "Boundary Violations by Covariate Values",
subtitle = "Violations occur primarily at extreme covariate values",
x = "X1 (continuous predictor)",
y = "LPM Prediction",
color = "Status") +
theme_minimal()
grid.arrange(p1, p2, ncol = 1)
# Demonstrate impact on marginal effects
# Calculate marginal effect of x2 (binary treatment)
violation_analysis %>%
me_calculations = group_by(x2) %>%
summarise(
n = n(),
mean_pred = mean(prediction),
n_violations = sum(violation != "Valid"),
.groups = 'drop'
)
diff(me_calculations$mean_pred)
ate_naive =
# Show problematic individual treatment effects
data.frame(
ite_df =x1 = x1,
x3 = x3,
baseline_pred = predict(lpm_full, newdata = data.frame(x1 = x1, x2 = 0, x3 = x3)),
treated_pred = predict(lpm_full, newdata = data.frame(x1 = x1, x2 = 1, x3 = x3))
)$ite = ite_df$treated_pred - ite_df$baseline_pred
ite_df
# Identify problematic ITEs
$problem = ifelse(ite_df$baseline_pred < 0 | ite_df$baseline_pred > 1 |
ite_df ite_df$treated_pred < 0 | ite_df$treated_pred > 1,
"Problematic", "Valid")
ggplot(ite_df, aes(x = baseline_pred, y = treated_pred, color = problem)) +
p3 = geom_point(alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
geom_vline(xintercept = c(0, 1), linetype = "dotted", color = "red") +
geom_hline(yintercept = c(0, 1), linetype = "dotted", color = "red") +
scale_color_brewer(palette = "Set1") +
labs(title = "Individual Treatment Effects in LPM",
subtitle = "Red box shows valid [0,1] × [0,1] region",
x = "Predicted Probability (Control)",
y = "Predicted Probability (Treated)",
color = "ITE Status") +
theme_minimal()
print(p3)
print(paste("\nAverage Treatment Effect (naive):", round(ate_naive, 3)))
## [1] "\nAverage Treatment Effect (naive): 0.196"
print(paste("Number of problematic ITEs:", sum(ite_df$problem == "Problematic"),
"out of", nrow(ite_df)))
## [1] "Number of problematic ITEs: 183 out of 1000"
Recent literature has some useful notes on boundary violations:
Beck (2020) argues that boundary violations are often a “red herring” when:
Gomila (2021) demonstrates through extensive simulations that:
Kitagawa and Tetenov (2018) show that for policy learning and optimal treatment assignment:
Several approaches can address boundary violations while retaining LPM’s advantages:
1. Trimming/Truncation:
# Simple truncation
pmax(0, pmin(1, predictions)) predictions_trimmed =
2. Feasible Generalized Least Squares (FGLS):
3. Bounded Regression:
4. Sample Restriction:
5. Honest Reporting:
Boundary violations are most problematic when:
Prediction is the Goal: If the model’s purpose is to generate individual-level predictions or risk scores
Decision Rules Depend on Probabilities: When thresholds determine binary decisions (e.g., treatment assignment)
Extrapolation is Required: Making predictions for covariate values outside the training data range
Small Sample Sizes: With limited data, violations can substantially affect average predictions
Extreme Treatment Effects: When treatments push many units across probability boundaries
Simulation Studies: When using the model to generate synthetic data
Modern methodological consensus suggests that boundary violations in LPM, while theoretically inelegant, are often practically inconsequential for:
However, researchers should:
As Angrist and Pischke (2009) memorably note: “While a nonlinear model may fit the CEF [conditional expectation function] for LDVs [limited dependent variables] more closely than a linear model, when it comes to marginal effects, this probably matters little.”
The Linear Probability Model’s assumption of constant marginal effects represents a departure from how we typically conceptualize probability relationships. This mechanical feature of linearity has profound implications for interpretation and can lead to substantively different conclusions than nonlinear models, though recent methodological work suggests these differences are often smaller than traditionally believed.
In the LPM, the marginal effect of any continuous variable \(X_k\) on the probability of success is:
\[\frac{\partial \Pr(Y_i = 1 | X_i)}{\partial X_{ik}} = \frac{\partial (X_i\beta)}{\partial X_{ik}} = \beta_k\]
This constancy implies that:
This contrasts sharply with nonlinear models where marginal effects vary with the level of all covariates.
The assumption of constant marginal effects conflicts with several theoretical and empirical regularities:
1. Ceiling and Floor Effects:
Near probability boundaries, there is less “room” for effects:
2. Substantive Theory Often Implies Nonlinearity:
Many theories predict varying effects:
3. Heterogeneous Treatment Effects:
The constant effect assumption implies no effect heterogeneity based on baseline risk, contradicting evidence from many fields that treatment effects vary by baseline characteristics.
In nonlinear models, marginal effects depend on location:
Logit Model:
\[\frac{\partial \Pr(Y = 1)}{\partial X_k} = \beta_k \cdot \pi_i(1 - \pi_i)\] where \(\pi_i = \Pr(Y_i = 1 | X_i)\)
Probit Model:
\[\frac{\partial \Pr(Y = 1)}{\partial X_k} = \beta_k \cdot \phi(X_i\beta)\] where \(\phi(\cdot)\) is the standard normal PDF
These formulations show that marginal effects:
# Comprehensive demonstration of marginal effects across models
set.seed(456)
1000
n =
# Generate data with a range of baseline probabilities
runif(n, -3, 3)
x1 = rnorm(n, 0, 1)
x2 = plogis(-0.5 + 0.8*x1 + 0.5*x2)
true_prob = rbinom(n, 1, true_prob)
y =
data.frame(y = y, x1 = x1, x2 = x2)
data_me =
# Fit three models
lm(y ~ x1 + x2, data = data_me)
lpm_me = glm(y ~ x1 + x2, data = data_me, family = binomial(link = "logit"))
logit_me = glm(y ~ x1 + x2, data = data_me, family = binomial(link = "probit"))
probit_me =
# Calculate marginal effects across the range of predicted probabilities
# For LPM: constant marginal effect
coef(lpm_me)["x1"]
lpm_me_x1 =
# For logit and probit: varying marginal effects
# Create evaluation points
seq(-3, 3, 0.1)
eval_points = expand.grid(x1 = eval_points, x2 = 0) # Hold x2 at mean
eval_data =
# Logit marginal effects
predict(logit_me, newdata = eval_data, type = "response")
logit_pred = coef(logit_me)["x1"] * logit_pred * (1 - logit_pred)
logit_me_x1 =
# Probit marginal effects
predict(probit_me, newdata = eval_data, type = "link")
probit_link = coef(probit_me)["x1"] * dnorm(probit_link)
probit_me_x1 =
# Create comprehensive comparison plot
data.frame(
me_comparison =x1 = eval_points,
baseline_prob_logit = logit_pred,
LPM = lpm_me_x1,
Logit = logit_me_x1,
Probit = probit_me_x1
)
me_comparison %>%
me_long = pivot_longer(cols = c(LPM, Logit, Probit),
names_to = "Model",
values_to = "Marginal_Effect")
ggplot(me_long, aes(x = x1, y = Marginal_Effect, color = Model)) +
p1 = geom_line(size = 1.2) +
labs(title = "Marginal Effects of X1 Across Its Range",
subtitle = "LPM shows constant effect; Logit/Probit show varying effects",
x = "Value of X1",
y = "Marginal Effect (∂Pr/∂X1)") +
scale_color_brewer(palette = "Set1") +
theme_minimal() +
geom_hline(yintercept = lpm_me_x1, linetype = "dashed", color = "blue", alpha = 0.3)
# Show marginal effects as function of baseline probability
ggplot(me_long %>%
p2 = mutate(baseline_prob = rep(me_comparison$baseline_prob_logit, 3)),
aes(x = baseline_prob, y = Marginal_Effect, color = Model)) +
geom_line(size = 1.2) +
labs(title = "Marginal Effects by Baseline Probability",
subtitle = "Nonlinear models show maximum effect at Pr = 0.5",
x = "Baseline Probability",
y = "Marginal Effect") +
scale_color_brewer(palette = "Set1") +
theme_minimal()
grid.arrange(p1, p2, ncol = 1)
# Distribution of marginal effects across sample
# Calculate individual marginal effects for each observation
library(margins)
# Marginal effects at observed values
mean(rep(coef(lpm_me)["x1"], n))
lpm_ame = margins(logit_me, variables = "x1")
logit_margins = logit_margins$fitted
logit_mes = margins(probit_me, variables = "x1")
probit_margins = probit_margins$fitted
probit_mes =
# Create distribution plot
data.frame(
me_dist =LPM = rep(lpm_ame, n),
Logit = logit_mes,
Probit = probit_mes
)
me_dist %>%
me_dist_long = pivot_longer(cols = everything(),
names_to = "Model",
values_to = "ME")
ggplot(me_dist_long, aes(x = ME, fill = Model)) +
p3 = geom_histogram(alpha = 0.6, position = "identity", bins = 30) +
facet_wrap(~Model, scales = "free_y", ncol = 1) +
labs(title = "Distribution of Marginal Effects Across Sample",
subtitle = "LPM: single value; Nonlinear models: heterogeneous effects",
x = "Marginal Effect",
y = "Frequency") +
theme_minimal() +
theme(legend.position = "none")
print(p3)
# Calculate and compare average marginal effects
data.frame(
ame_comparison =Model = c("LPM", "Logit", "Probit"),
AME = c(lpm_ame, mean(logit_mes), mean(probit_mes)),
SD = c(0, sd(logit_mes), sd(probit_mes)),
Min = c(lpm_ame, min(logit_mes), min(probit_mes)),
Max = c(lpm_ame, max(logit_mes), max(probit_mes))
)
print("Average Marginal Effects Comparison:")
## [1] "Average Marginal Effects Comparison:"
print(ame_comparison)
## Model AME SD Min Max
## 1 LPM 0.1468397 0.0000000 0.14683973 0.1468397
## 2 Logit 0.4250000 0.2743343 0.01116335 0.9547737
## 3 Probit 0.4250558 0.2736289 0.00396370 0.9650805
The practical importance of allowing for varying marginal effects depends on several factors:
1. Distribution of Predicted Probabilities:
Angrist (2001) shows that when most predicted probabilities fall in [0.2, 0.8], the difference between LPM and nonlinear models is minimal. The average marginal effect (AME) from logit/probit often closely approximates the LPM coefficient.
2. Research Question:
3. Sample Composition:
Recent work by Breen, Karlson, and Holm (2018) demonstrates that comparing coefficients across models or samples is problematic in nonlinear models due to scale identification. LPM avoids these issues entirely.
# Demonstrate when constant vs varying marginal effects matter
# Scenario 1: Most observations in middle range
set.seed(789)
1000
n_scenario =
# Middle range scenario
rnorm(n_scenario, 0, 0.5)
x_middle = plogis(0 + x_middle) # Most probs in [0.3, 0.7]
prob_middle = rbinom(n_scenario, 1, prob_middle)
y_middle =
# Extreme range scenario
rnorm(n_scenario, 0, 2)
x_extreme = plogis(0 + x_extreme) # Probs span [0.02, 0.98]
prob_extreme = rbinom(n_scenario, 1, prob_extreme)
y_extreme =
# Fit models for both scenarios
list(
models_middle =lpm = lm(y_middle ~ x_middle),
logit = glm(y_middle ~ x_middle, family = binomial())
)
list(
models_extreme =lpm = lm(y_extreme ~ x_extreme),
logit = glm(y_extreme ~ x_extreme, family = binomial())
)
# Calculate average marginal effects
c(
ame_middle =LPM = coef(models_middle$lpm)[2],
Logit = mean(margins(models_middle$logit, variables = "x_middle")$fitted)
)
c(
ame_extreme =LPM = coef(models_extreme$lpm)[2],
Logit = mean(margins(models_extreme$logit, variables = "x_extreme")$fitted)
)
# create visualizations the difference
data.frame(
scenario_comparison =Scenario = rep(c("Middle Range", "Extreme Range"), each = 2),
Model = rep(c("LPM", "Logit AME"), 2),
AME = c(ame_middle, ame_extreme)
)
ggplot(scenario_comparison, aes(x = Model, y = AME, fill = Model)) +
p4 = geom_bar(stat = "identity") +
facet_wrap(~Scenario) +
labs(title = "Average Marginal Effects: When LPM Approximates Nonlinear Models",
subtitle = "LPM closely approximates logit AME in middle range, diverges at extremes",
y = "Average Marginal Effect") +
theme_minimal() +
theme(legend.position = "none")
print(p4)
print("AME Comparison by Scenario:")
## [1] "AME Comparison by Scenario:"
print(scenario_comparison)
## Scenario Model AME
## 1 Middle Range LPM 0.2571512
## 2 Middle Range Logit AME 0.5080000
## 3 Extreme Range LPM 0.1528286
## 4 Extreme Range Logit AME 0.4710000
print(paste("Middle range - Difference:",
round(abs(ame_middle[1] - ame_middle[2]), 4)))
## [1] "Middle range - Difference: 0.2508"
print(paste("Extreme range - Difference:",
round(abs(ame_extreme[1] - ame_extreme[2]), 4)))
## [1] "Extreme range - Difference: 0.3182"
# Show individual vs average effects
# Generate data with clear heterogeneity
runif(n_scenario, -3, 3)
x_het = rbinom(n_scenario, 1, 0.5) # Binary moderator
z_het = ifelse(z_het == 1, 1.5, 0.3) # Heterogeneous effect
true_effect = plogis(-0.5 + true_effect * x_het)
prob_het = rbinom(n_scenario, 1, prob_het)
y_het =
# Fit models
lm(y_het ~ x_het * z_het)
lpm_het = glm(y_het ~ x_het * z_het, family = binomial())
logit_het =
# Calculate marginal effects by subgroup
margins(logit_het, at = list(z_het = 0), variables = "x_het")
me_z0 = margins(logit_het, at = list(z_het = 1), variables = "x_het")
me_z1 =
data.frame(
het_results =Group = c("Z=0", "Z=0", "Z=1", "Z=1"),
Model = rep(c("LPM", "Logit AME"), 2),
Effect = c(
coef(lpm_het)["x_het"],
summary(me_z0)$AME,
coef(lpm_het)["x_het"] + coef(lpm_het)["x_het:z_het"],
summary(me_z1)$AME
)
)
print("\nHeterogeneous Effects Comparison:")
## [1] "\nHeterogeneous Effects Comparison:"
print(het_results)
## Group Model Effect
## 1 Z=0 LPM 0.06748769
## 2 Z=0 Logit AME 0.06615296
## 3 Z=1 LPM 0.20862741
## 4 Z=1 Logit AME 0.16200196
Recent methodological advances have nuanced our understanding of when constant marginal effects are problematic:
Angrist and Pischke (2009) show that LPM provides the best linear approximation to the true conditional expectation function, minimizing mean squared error.
Lewbel, Dong, and Yang (2012) demonstrate that:
Cattaneo, Crump, Farrell, and Feng (2024) in recent work show:
The constant marginal effects assumption has several practical consequences:
1. Interaction Interpretation:
2. Comparison Across Groups:
3. Specification Testing:
4. Covariate Balance:
Constant marginal effects are most problematic when:
Constant effects are less concerning when:
The assumption of constant marginal effects in LPM represents a clear departure from the theoretical ideal of how probabilities should behave. However, mounting empirical evidence suggests that:
As Beck (2020) concludes: “The LPM is not a model of individual probabilities but of average probabilities in the population. For this purpose, it often performs remarkably well.”
The choice between constant and varying marginal effects should depend on:
Rather than viewing constant marginal effects as a fatal flaw, recent literature suggests viewing it as a modeling choice with well-understood tradeoffs.
Binary response models are fundamental tools in social science research. Key insights from this deep dive:
Theoretical Foundations Matter (Sometimes): - All models make strong assumptions about functional form - No model is “correct”—they’re all approximations - Theory should guide choice when possible
Practical Differences Are Often Small: - Logit and probit usually give similar substantive conclusions - LPM often approximates nonlinear models well for AME - Focus on what you’re trying to learn, not model purity
Interpretation Is Everything: - Coefficients are not effects - Marginal effects vary across observations - Graphics often communicate better than tables
Modern Best Practices: 1. Report average marginal effects (AME) using observed values (Hanmer & Kalkan 2013) - Don’t use marginal effects at means unless specifically justified - The margins
package in R defaults to AME 2. Show robustness across specifications (LPM, logit, probit) 3. Validate predictions out-of-sample 4. Be transparent about limitations and modeling choices
The Evolution Continues: - Machine learning methods (random forests, neural nets) for prediction - Causal inference focus shifting emphasis to identification over functional form - Semiparametric methods reducing reliance on distributional assumptions
The choice of binary response model is less crucial than understanding what each model assumes, when those assumptions matter, and how to interpret results correctly. In the end, the goal is not to find the “right” model but to learn something useful about the world.
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Pop!_OS 22.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/Detroit
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] nortest_1.0-4 sandwich_3.1-1 stargazer_5.2.3 gridExtra_2.3
## [5] pROC_1.18.5 margins_0.3.28 broom_1.0.8 MASS_7.3-65
## [9] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [13] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
## [17] ggplot2_3.5.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.3 lattice_0.22-5 stringi_1.8.7
## [5] hms_1.1.3 digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3
## [9] grid_4.5.1 timechange_0.3.0 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] Matrix_1.7-4 plyr_1.8.9 jsonlite_2.0.0 backports_1.5.0
## [17] mgcv_1.9-1 scales_1.4.0 jquerylib_0.1.4 cli_3.6.5
## [21] rlang_1.1.6 splines_4.5.1 withr_3.0.2 cachem_1.1.0
## [25] yaml_2.3.10 tools_4.5.1 tzdb_0.5.0 vctrs_0.6.5
## [29] R6_2.6.1 zoo_1.8-14 lifecycle_1.0.4 pkgconfig_2.0.3
## [33] pillar_1.11.0 bslib_0.9.0 gtable_0.3.6 glue_1.8.0
## [37] data.table_1.17.0 Rcpp_1.1.0 xfun_0.52 tidyselect_1.2.1
## [41] knitr_1.50 farver_2.1.2 nlme_3.1-168 htmltools_0.5.8.1
## [45] rmarkdown_2.29 prediction_0.3.18 labeling_0.4.3 compiler_4.5.1