PLS 900/803 MLE Binary Models

class: center, middle, inverse, title-slide

.title[
# PLS 900/803 MLE Binary Models
]
.author[
### Shahryar Minhas [s7minhas.com]
]

---

exclude: true

``` r
library(stringr)
library(magrittr)
library(rvest)
```

---

## Readings associated with this lecture

- Discussion about other link functions:
  + See binary response models handout (pls900_mle_week2_binaryModels.pdf)
- Interaction
  + [Ai and Norton (2003)](https://www.sciencedirect.com/science/article/pii/S0165176503000326)
  + [Berry et al. (2009)](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-5907.2009.00429.x)
- Simulation: 
	+ [Making the most of statistical analyses](https://web.stanford.edu/~tomz/pubs/ajps00.pdf)
	+ [Behind the curve](https://gvpt.umd.edu/sites/gvpt.umd.edu/files/pubs/Hanmer%20and%20Kalkan%20AJPS%20Behind%20the%20Curve.pdf)
- Model Assessment: 
  + ["To explain or to predict". Shmueli. Statistical Science. 2010.](https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf)
	+ [Perils of policy by p-value](https://journals.sagepub.com/doi/10.1177/0022343309356491)
	+ [Cross-Validation](http://www.marcel-neunhoeffer.com/pdf/papers/pa_cross-validation.pdf)
	+ [Separation Plot](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-5907.2011.00525.x)
	+ [Box's Loop](https://journals.sagepub.com/doi/full/10.1177/0022343316682065)
- OLS for binary data (optional): 
  + [Beck (2020)](https://www.cambridge.org/core/journals/political-analysis/article/estimating-grouped-data-models-with-a-binarydependent-variable-and-fixed-effects-via-a-logit-versus-a-linear-probability-model-the-impact-of-dropped-units/AD6E3A3EA15BEDECA6B6FD49FE0216B3)
  + [Horrace & Oaxaca (2005)](https://surface.syr.edu/cgi/viewcontent.cgi?article=1138&context=ecn)
  + [Gomila (2020)](https://psyarxiv.com/4gmbv)
  + [Gelman Blog Post](https://statmodeling.stat.columbia.edu/2020/01/10/linear-or-logistic-regression-with-binary-outcomes/)

---

## Generalized linear models

All of the models we're talking about here belong to a class called generalized linear models (GLM)

Elements of GLM:

- distribution for `$Y$` (stochastic component)
- linear predictor `$X\beta$` (systematic component)
- link function that relates the linear predictor to the parameters of the distribution

---

## Binary data & social science

Binary data are very common in social science measurement

- Did you vote?
- Did an event occur (conflict, democratic transition)?
- Is an institution present in a country? 
- Did an individual commit a crime?

---

## Binary data & the advantages of MLE

The analysis of these data is also fundamental to many advanced topics

- Event history models
- Network models
- Censoring
- Causal analysis

Also models of binary data are a good way to understand the process of maximum likelihood

---

## Why not OLS/Linear model?

---

## Why not OLS/Linear model?

---

## Why not OLS/Linear model?

---

## Why not OLS/Linear model?

---

## Model for binary dependent variable

Of course, we start by specifying a distribution for Y:

`\begin{eqnarray}
	y_{i} & \sim & Bernoulli(pi_{i}) \nonumber \\
	p(y | \pi) & = & \prod_{i=1}^{n} \pi_{i}^{y_{i}} ( 1- \pi_{i})^{1- y_{i}} \nonumber
\end{eqnarray}`

Specify linear predictor:

`\begin{eqnarray}
	X_{i}\beta & = & \beta_{0} + X_{i1}\beta_{1} +  X_{i2}\beta_{2} + ...  X_{ip}\beta_{p} \nonumber 
\end{eqnarray}`

---

## Model for binary dependent variable

Specify a link function:

Logit:

`\begin{eqnarray}
	\pi_{i} & = & \frac{1}{1+e^{-X_{i}\beta}} \nonumber 
\end{eqnarray}`

Could have chosen several other options as well:

- Probit: `$\pi_{i} = \Phi(X_{i} \beta)$`
- Complementary log-log: `$pi_{i} = 1 - exp(-exp(X_{i} \beta))$`
- ...

---

## Does the link function matter?

The logit function is a simple transformation mapping values in `$(0,1)$` to `$(-\infty,\infty)$`:

`\begin{eqnarray}
	logit(y) & = & log \frac{y}{1-y}  \nonumber 
\end{eqnarray}`

We actually use the inverse-logit, which maps from `$(-\infty,\infty)$` to `$(0,1)$`:

`\begin{eqnarray}
	logit(y) & = & \frac{1}{1+exp(-y)}  \nonumber 
\end{eqnarray}`

- Inverse-logit exagerrates differences in the midrange of `$y$`
- And compresses differences in the neighborhood of relatively small or large `$y$`s

---

## Differences in link functions

Inverse logit is shown in red, probit in blue, and complementary log-log in green. What are the notable differences?

---

## Typical binary model formulation

Standard approach in political science is:

`\begin{eqnarray}
	y_{i} & \sim & Bernoulli(pi_{i}) \nonumber \\
	\pi_{i} & = & \frac{1}{1 + exp(-X_{i} \ beta)} \nonumber \\
	& = & logit^{-1}(X_{i}\beta) \nonumber
\end{eqnarray}`

where,

`\begin{eqnarray}
	X_{i}\beta & = & \beta_{0} + X_{i1}\beta_{1} +  X_{i2}\beta_{2} + ...  X_{ip}\beta_{p} \nonumber 
\end{eqnarray}`

---

## Log-likelihood of logit

---

## Log-likelihood of logit

---

## Log-likelihood of logit

---

## Log-likelihood of logit

---

## What next?

- We could do some calculus and algebra to solve, but that seems rather onerous. Lets use a computer. 
- Specifically, we want to maximize: `$\sum_{i=1}^{n} y_{i} X_{i} \beta - X_{i}\beta - log(1+exp(-X_{i}\beta))$`. 
- Lets write down the log likelihood function in R

``` r
logitLL = function(par, y, X){
  X = as.matrix(cbind(1, X))
  xBeta = X %*% par
  ll = sum(y *xBeta - xBeta - log(1+exp(-xBeta)))
  return(ll)
}
```

---

## To use this lets bring in some data

[Salehyan et al. 2011: Explaining external support for insurgent groups](https://www.jstor.org/stable/23016231?seq=1#page_scan_tab_contents)

Load the data from the .zip:

``` r
load('R/saleyhan_etal_2011.rda')
```

---

## To use this lets bring in some data

We'll do a small regression using their data:

``` r
dv = 'supp1'
ivs = c(
  'log_rgdpch','cinc','pol6', 
  'gs', 'tk', 'riv'
  )
form = paste0(dv, '~', paste(ivs, collapse = '+'))

dat2 = dat[,c(dv, ivs)]
dat2 = na.omit(dat2)
```

---

## Variable descriptions

Unit of analysis: government - rebel dyad

- `log_rgdpch`: logged real gdp per capita of government (min-max: 5.23-10.43)
- `cinc`: military capability of government (0-0.28)
- `pol6`: is polity score for government greater than 6 (0,1)
- `gs`: does government receive foreign support (0,1)
- `tk`: does rebel group state is fighting have a transnational constituency (0,1)
- `riv`: is state fighting rebel group engaged in international rivalry (0,1)

---

## Lets apply our function ...

to retrieve the point estimates for `$\beta$`

``` r
opt1 = optim(
  par=rep(0, length(ivs) + 1), 
  fn=logitLL, 
  X=dat2[,-1],
  y=dat2[,1],
  control=list(fnscale=-1),
  hessian=TRUE,
  method='BFGS'
)
```

---

## Results?

``` r
coefs = opt1$par
coefs
```

```
## [1]  0.8038712 -0.3304774  8.0934579 -0.3181530  1.0266367  1.1762919  0.9413188
```

---

## Is this really it?

What would the `glm` function give us?

``` r
logit1 <- glm(
  form,
  data=dat2,
  na.action=na.omit,
  family=binomial(link='logit')
  )

cbind(coef(logit1), coefs)
```

```
##                             coefs
## (Intercept)  0.8037752  0.8038712
## log_rgdpch  -0.3304646 -0.3304774
## cinc         8.0932181  8.0934579
## pol6        -0.3181335 -0.3181530
## gs           1.0266308  1.0266367
## tk           1.1762873  1.1762919
## riv          0.9413082  0.9413188
```

---

## Standard errors?

The `glm` function though does give us other nice things like standard errors:

``` r
summary(logit1)$'coefficients'
```

```
##               Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)  0.8037752  1.0067487  0.7983872 4.246459e-01
## log_rgdpch  -0.3304646  0.1357397 -2.4345465 1.491046e-02
## cinc         8.0932181  4.5335410  1.7851869 7.423100e-02
## pol6        -0.3181335  0.2996382 -1.0617253 2.883604e-01
## gs           1.0266308  0.2476238  4.1459299 3.384374e-05
## tk           1.1762873  0.2605794  4.5141220 6.357961e-06
## riv          0.9413082  0.2395058  3.9302102 8.487165e-05
```

---

## Retrieving standard errors

- Standard errors are defined as the diagonal of:

- where the quantity in the brackets is the Hessian matrix
- Hessian matrix represents how much the likelihood is changing with respect to changes in the parameters
- The negative of its expectation is referred to as the Information matrix

---

## Retrieving info from Hessians

Variance-covariance matrix

``` r
varcov = -solve(opt1$hessian)
varcov
```

```
##             [,1]          [,2]        [,3]         [,4]          [,5]
## [1,]  1.01355149 -0.1320404404  0.12254403  0.072214226 -0.0343487771
## [2,] -0.13204044  0.0184254802 -0.06326793 -0.010916428 -0.0008513621
## [3,]  0.12254403 -0.0632679298 20.55316366 -0.378853320  0.3260238955
## [4,]  0.07221423 -0.0109164280 -0.37885332  0.089783350 -0.0070570440
## [5,] -0.03434878 -0.0008513621  0.32602390 -0.007057044  0.0613176952
## [6,]  0.06009534 -0.0111285002  0.01620599  0.002119022 -0.0005401340
## [7,] -0.03388172 -0.0006652542  0.05323945 -0.005674163  0.0132121369
##              [,6]          [,7]
## [1,]  0.060095341 -0.0338817239
## [2,] -0.011128500 -0.0006652542
## [3,]  0.016205992  0.0532394475
## [4,]  0.002119022 -0.0056741628
## [5,] -0.000540134  0.0132121369
## [6,]  0.067901678  0.0031449619
## [7,]  0.003144962  0.0573631922
```

---

## Retrieving info from Hessians

Standard errors:

``` r
serrors = sqrt(diag(varcov))
serrors
```

```
## [1] 1.0067529 0.1357405 4.5335597 0.2996387 0.2476241 0.2605795 0.2395061
```

---

## Compare and contrast

`glm`:

``` r
summary(logit1)$'coefficients'
```

---

## Compare and contrast

BFGS optim:

``` r
cbind('Estimate'=coefs,'Std. Error'=serrors,'Z value'=coefs/serrors)
```

```
##        Estimate Std. Error    Z value
## [1,]  0.8038712  1.0067529  0.7984791
## [2,] -0.3304774  0.1357405 -2.4346264
## [3,]  8.0934579  4.5335597  1.7852324
## [4,] -0.3181530  0.2996387 -1.0617888
## [5,]  1.0266367  0.2476241  4.1459481
## [6,]  1.1762919  0.2605795  4.5141384
## [7,]  0.9413188  0.2395061  3.9302492
```

---

## So what do we do with this stuff?

---

## So what do we do with this stuff?

- What does it mean for the rebel transnational constituency variable (`tk`) to be 1.18 and juiced up with stars? 
- One common way old people interpret these models is in terms of log odds

---

## So what do we do with this stuff?

- If a rebel group has a transnational constituency, there is a 1.18 unit change in the log odds of the incumbent winning and is true with THREE STARS!

---

## No, angry baby from previous slide ...

We gotta dig further ... this guy has the right idea

---

## How to interpret?

- We could go onto relative risk and odds ratios but there are more straightforward ways
- It is increasingly becoming standard practice that when you present your findings, you do so in terms of something that has substantive meaning to the reader
- For binary outcomes, this means turning results into predicted probabilities 
- Also we need to take into account uncertainty when presenting results

---

## How do we get predictions in terms of probabilities from this model?

Of course there is the predict function:

``` r
yhats = predict(logit1)
head(yhats)
```

```
##         1         2         3         8         9        10 
## 1.4167999 1.4167999 0.5212350 0.5194176 1.8878211 0.1158178
```

Apply the link function to retrieve probabilities (could also have used `predict(model, type='response')`:

``` r
yprobs = 1/(1+exp(-yhats))
head(yprobs)
```

```
##         1         2         3         8         9        10 
## 0.8048363 0.8048363 0.6274365 0.6270116 0.8685069 0.5289221
```

---

## How does the predict function work?

Retrieve parameter estimates:

``` r
beta = coef(logit1)
```

Set up covariate matrix:

``` r
# add an intercept
X = cbind(1, dat2[,ivs])
# turn into matrix
X = data.matrix(X)
head(X)
```

```
##    1 log_rgdpch      cinc pol6 gs tk riv
## 1  1   7.699140 0.0016176    0  1  1   1
## 2  1   7.699140 0.0016176    0  1  1   1
## 3  1   7.553019 0.0013032    0  1  1   0
## 8  1   6.846688 0.0012716    0  1  0   1
## 9  1   6.266764 0.0013299    0  1  1   1
## 10 1   8.620994 0.0053601    0  0  1   1
```

---

## How does the predict function work?

Calculate in-sample predictions:

``` r
# gen yhats
yhat = X %*% beta

# apply link fn again to get probs
yprobs = 1/(1+exp(-yhat))

# compare with predict
head(
	cbind(
		yprobs, predict(logit1, type='response')
		)
	)
```

```
##         [,1]      [,2]
## 1  0.8048363 0.8048363
## 2  0.8048363 0.8048363
## 3  0.6274365 0.6274365
## 8  0.6270116 0.6270116
## 9  0.8685069 0.8685069
## 10 0.5289221 0.5289221
```

---

## How does this help us with interpretation?

Well lets create a pair of scenarios to understand the effect of `tk`, we'll hold all variables at their measure of central tendency in both scenarios, but in one scenario we'll set `tk` to 1 and in the other to 0

``` r
scenarios = with(dat2, cbind(
	intercept=1, 
	gdp = mean(log_rgdpch), 
	cinc = mean(cinc),
	pol6 = median(pol6),
	gs = median(gs),
	tk = c(0,1), 
	riv = median(riv)
	))

scenarios
```

```
##      intercept      gdp      cinc pol6 gs tk riv
## [1,]         1 7.620791 0.0118693    0  0  0   1
## [2,]         1 7.620791 0.0118693    0  0  1   1
```

---

## How does this help us with interpretation?

Now lets repeat the process of getting predicted probabilities:

``` r
# gen yhats
yhat = scenarios %*% beta

# apply link fn again to get probs
yprobs = 1/(1+exp(-yhat))

# print
yprobs
```

```
##           [,1]
## [1,] 0.3368737
## [2,] 0.6222314
```

What's the interpretation here?

---

## Where Does Uncertainty Come From?

Remember, we're *estimating* `$\beta$` from a finite sample:

- **We don't know the true `$\beta$`** - we only have our estimate `$\hat{\beta}$`
- **Our data is just one sample** - different samples would give slightly different estimates
- **MLE gives us a point estimate** - but how confident should we be?

Think of it like polling:
- Poll 1000 people → get 52% support
- Poll a different 1000 → might get 49% or 55%
- The estimate varies, but in a predictable way!

**Key insight**: The sampling distribution of our estimates follows a pattern

---

## The Magic of Maximum Likelihood Theory

### Why can we use a normal distribution for `$\hat{\beta}$`?

**Three key results from MLE theory:**

1. **Consistency**: As `$n \to \infty$`, `$\hat{\beta} \to \beta_{true}$`
   - With more data, we get closer to the truth

2. **Asymptotic Normality**: For large `$n$`:
   `$$\sqrt{n}(\hat{\beta} - \beta_{true}) \xrightarrow{d} N(0, I^{-1})$$`
   - Even if errors aren't normal, `$\hat{\beta}$` becomes normal!
   - This is the Central Limit Theorem at work

3. **Efficiency**: MLE has the smallest variance among consistent estimators
   - We're getting the best estimates possible

---

## From Theory to Practice: The Multivariate Normal

### What we know from MLE theory:

`$$\hat{\beta} \sim MVN(\beta_{true}, \Sigma)$$`

where `$\Sigma$` is the variance-covariance matrix

### But wait, we don't know `$\beta_{true}$` or `$\Sigma$`!

**Solution**: Use our estimates!
- Replace `$\beta_{true}$` with `$\hat{\beta}$` (our MLE estimate)
- Replace `$\Sigma$` with `$\hat{\Sigma} = [-H]^{-1}$` (inverse of negative Hessian)

This gives us: `$$\hat{\beta} \sim MVN(\hat{\beta}, \hat{\Sigma})$$`

Seems circular? It's not - we're describing the *sampling distribution*

---

## Understanding the Variance-Covariance Matrix

The variance-covariance matrix `$\hat{\Sigma}$` tells us two crucial things:

```
##             (Intercept) democracy   gdp
## (Intercept)        0.04     -0.02  0.01
## democracy         -0.02      0.09 -0.03
## gdp                0.01     -0.03  0.16
```

**Diagonal elements** = Variances of each `$\hat{\beta}$`
- Larger values → more uncertainty
- Square root gives standard errors

**Off-diagonal elements** = Covariances between `$\hat{\beta}$`s
- Shows how parameters vary together
- Important for joint inference!

---

## Why This Matters: From Single to Joint Uncertainty

### Single parameter uncertainty (univariate):
`$$\hat{\beta}_k \sim N(\hat{\beta}_k, SE^2_k)$$`

Easy to visualize: bell curve around estimate

### Multiple parameters together (multivariate):
`$$\begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \end{bmatrix} \sim MVN\left(\begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \end{bmatrix}, \hat{\Sigma}\right)$$`

This captures:
- Individual parameter uncertainty
- **Correlations between parameters**
- Allows us to make statements about combinations of `$\beta$`s

**Key for interpretation**: When we compute predicted probabilities, we use *multiple* `$\beta$`s together!

---