class: center, middle, inverse, title-slide .title[ # PLS 900/803 MLE Binary Models ] .author[ ### Shahryar Minhas [s7minhas.com] ] --- exclude: true ``` r library(stringr) library(magrittr) library(rvest) ``` --- ## Readings associated with this lecture - Discussion about other link functions: + See binary response models handout (pls900_mle_week2_binaryModels.pdf) - Interaction + [Ai and Norton (2003)](https://www.sciencedirect.com/science/article/pii/S0165176503000326) + [Berry et al. (2009)](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-5907.2009.00429.x) - Simulation: + [Making the most of statistical analyses](https://web.stanford.edu/~tomz/pubs/ajps00.pdf) + [Behind the curve](https://gvpt.umd.edu/sites/gvpt.umd.edu/files/pubs/Hanmer%20and%20Kalkan%20AJPS%20Behind%20the%20Curve.pdf) - Model Assessment: + ["To explain or to predict". Shmueli. Statistical Science. 2010.](https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf) + [Perils of policy by p-value](https://journals.sagepub.com/doi/10.1177/0022343309356491) + [Cross-Validation](http://www.marcel-neunhoeffer.com/pdf/papers/pa_cross-validation.pdf) + [Separation Plot](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-5907.2011.00525.x) + [Box's Loop](https://journals.sagepub.com/doi/full/10.1177/0022343316682065) - OLS for binary data (optional): + [Beck (2020)](https://www.cambridge.org/core/journals/political-analysis/article/estimating-grouped-data-models-with-a-binarydependent-variable-and-fixed-effects-via-a-logit-versus-a-linear-probability-model-the-impact-of-dropped-units/AD6E3A3EA15BEDECA6B6FD49FE0216B3) + [Horrace & Oaxaca (2005)](https://surface.syr.edu/cgi/viewcontent.cgi?article=1138&context=ecn) + [Gomila (2020)](https://psyarxiv.com/4gmbv) + [Gelman Blog Post](https://statmodeling.stat.columbia.edu/2020/01/10/linear-or-logistic-regression-with-binary-outcomes/) --- ## Generalized linear models All of the models we're talking about here belong to a class called generalized linear models (GLM) Elements of GLM: - distribution for `\(Y\)` (stochastic component) - linear predictor `\(X\beta\)` (systematic component) - link function that relates the linear predictor to the parameters of the distribution --- ## Binary data & social science Binary data are very common in social science measurement - Did you vote? - Did an event occur (conflict, democratic transition)? - Is an institution present in a country? - Did an individual commit a crime? --- ## Binary data & the advantages of MLE The analysis of these data is also fundamental to many advanced topics - Event history models - Network models - Censoring - Causal analysis Also models of binary data are a good way to understand the process of maximum likelihood --- ## Why not OLS/Linear model? <img src="ols1.png" width="900px" style="display: block; margin: auto;" /> --- ## Why not OLS/Linear model? <img src="ols2.png" width="900px" style="display: block; margin: auto;" /> --- ## Why not OLS/Linear model? <img src="ols3.png" width="900px" style="display: block; margin: auto;" /> --- ## Why not OLS/Linear model? <img src="ols4.png" width="900px" style="display: block; margin: auto;" /> --- ## Model for binary dependent variable Of course, we start by specifying a distribution for Y: `\begin{eqnarray} y_{i} & \sim & Bernoulli(pi_{i}) \nonumber \\ p(y | \pi) & = & \prod_{i=1}^{n} \pi_{i}^{y_{i}} ( 1- \pi_{i})^{1- y_{i}} \nonumber \end{eqnarray}` Specify linear predictor: `\begin{eqnarray} X_{i}\beta & = & \beta_{0} + X_{i1}\beta_{1} + X_{i2}\beta_{2} + ... X_{ip}\beta_{p} \nonumber \end{eqnarray}` --- ## Model for binary dependent variable Specify a link function: Logit: `\begin{eqnarray} \pi_{i} & = & \frac{1}{1+e^{-X_{i}\beta}} \nonumber \end{eqnarray}` Could have chosen several other options as well: - Probit: `\(\pi_{i} = \Phi(X_{i} \beta)\)` - Complementary log-log: `\(pi_{i} = 1 - exp(-exp(X_{i} \beta))\)` - ... --- ## Does the link function matter? The logit function is a simple transformation mapping values in `\((0,1)\)` to `\((-\infty,\infty)\)`: `\begin{eqnarray} logit(y) & = & log \frac{y}{1-y} \nonumber \end{eqnarray}` We actually use the inverse-logit, which maps from `\((-\infty,\infty)\)` to `\((0,1)\)`: `\begin{eqnarray} logit(y) & = & \frac{1}{1+exp(-y)} \nonumber \end{eqnarray}` - Inverse-logit exagerrates differences in the midrange of `\(y\)` - And compresses differences in the neighborhood of relatively small or large `\(y\)`s --- ## Differences in link functions Inverse logit is shown in red, probit in blue, and complementary log-log in green. What are the notable differences? <img src="binaryModels_MLE_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Typical binary model formulation Standard approach in political science is: `\begin{eqnarray} y_{i} & \sim & Bernoulli(pi_{i}) \nonumber \\ \pi_{i} & = & \frac{1}{1 + exp(-X_{i} \ beta)} \nonumber \\ & = & logit^{-1}(X_{i}\beta) \nonumber \end{eqnarray}` where, `\begin{eqnarray} X_{i}\beta & = & \beta_{0} + X_{i1}\beta_{1} + X_{i2}\beta_{2} + ... X_{ip}\beta_{p} \nonumber \end{eqnarray}` --- ## Log-likelihood of logit <img src="log_like_logit.png" width="500px" style="display: block; margin: auto;" /> --- ## Log-likelihood of logit <img src="log_like_logit_2.png" width="900px" style="display: block; margin: auto;" /> --- ## Log-likelihood of logit <img src="log_like_logit_3.png" width="900px" style="display: block; margin: auto;" /> --- ## Log-likelihood of logit <img src="log_like_logit_4.png" width="900px" style="display: block; margin: auto;" /> --- ## What next? - We could do some calculus and algebra to solve, but that seems rather onerous. Lets use a computer. - Specifically, we want to maximize: `\(\sum_{i=1}^{n} y_{i} X_{i} \beta - X_{i}\beta - log(1+exp(-X_{i}\beta))\)`. - Lets write down the log likelihood function in R ``` r logitLL = function(par, y, X){ X = as.matrix(cbind(1, X)) xBeta = X %*% par ll = sum(y *xBeta - xBeta - log(1+exp(-xBeta))) return(ll) } ``` --- ## To use this lets bring in some data [Salehyan et al. 2011: Explaining external support for insurgent groups](https://www.jstor.org/stable/23016231?seq=1#page_scan_tab_contents) Load the data from the .zip: ``` r load('R/saleyhan_etal_2011.rda') ``` --- ## To use this lets bring in some data We'll do a small regression using their data: ``` r dv = 'supp1' ivs = c( 'log_rgdpch','cinc','pol6', 'gs', 'tk', 'riv' ) form = paste0(dv, '~', paste(ivs, collapse = '+')) dat2 = dat[,c(dv, ivs)] dat2 = na.omit(dat2) ``` --- ## Variable descriptions Unit of analysis: government - rebel dyad - `log_rgdpch`: logged real gdp per capita of government (min-max: 5.23-10.43) - `cinc`: military capability of government (0-0.28) - `pol6`: is polity score for government greater than 6 (0,1) - `gs`: does government receive foreign support (0,1) - `tk`: does rebel group state is fighting have a transnational constituency (0,1) - `riv`: is state fighting rebel group engaged in international rivalry (0,1) --- ## Lets apply our function ... to retrieve the point estimates for `\(\beta\)` ``` r opt1 = optim( par=rep(0, length(ivs) + 1), fn=logitLL, X=dat2[,-1], y=dat2[,1], control=list(fnscale=-1), hessian=TRUE, method='BFGS' ) ``` --- ## Results? ``` r coefs = opt1$par coefs ``` ``` ## [1] 0.8038712 -0.3304774 8.0934579 -0.3181530 1.0266367 1.1762919 0.9413188 ``` --- ## Is this really it? What would the `glm` function give us? ``` r logit1 <- glm( form, data=dat2, na.action=na.omit, family=binomial(link='logit') ) cbind(coef(logit1), coefs) ``` ``` ## coefs ## (Intercept) 0.8037752 0.8038712 ## log_rgdpch -0.3304646 -0.3304774 ## cinc 8.0932181 8.0934579 ## pol6 -0.3181335 -0.3181530 ## gs 1.0266308 1.0266367 ## tk 1.1762873 1.1762919 ## riv 0.9413082 0.9413188 ``` --- ## Standard errors? The `glm` function though does give us other nice things like standard errors: ``` r summary(logit1)$'coefficients' ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.8037752 1.0067487 0.7983872 4.246459e-01 ## log_rgdpch -0.3304646 0.1357397 -2.4345465 1.491046e-02 ## cinc 8.0932181 4.5335410 1.7851869 7.423100e-02 ## pol6 -0.3181335 0.2996382 -1.0617253 2.883604e-01 ## gs 1.0266308 0.2476238 4.1459299 3.384374e-05 ## tk 1.1762873 0.2605794 4.5141220 6.357961e-06 ## riv 0.9413082 0.2395058 3.9302102 8.487165e-05 ``` --- ## Retrieving standard errors - Standard errors are defined as the diagonal of: <img src="info_matrix.png" width="150px" style="display: block; margin: auto;" /> - where the quantity in the brackets is the Hessian matrix - Hessian matrix represents how much the likelihood is changing with respect to changes in the parameters - The negative of its expectation is referred to as the Information matrix --- ## Retrieving info from Hessians Variance-covariance matrix ``` r varcov = -solve(opt1$hessian) varcov ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1.01355149 -0.1320404404 0.12254403 0.072214226 -0.0343487771 ## [2,] -0.13204044 0.0184254802 -0.06326793 -0.010916428 -0.0008513621 ## [3,] 0.12254403 -0.0632679298 20.55316366 -0.378853320 0.3260238955 ## [4,] 0.07221423 -0.0109164280 -0.37885332 0.089783350 -0.0070570440 ## [5,] -0.03434878 -0.0008513621 0.32602390 -0.007057044 0.0613176952 ## [6,] 0.06009534 -0.0111285002 0.01620599 0.002119022 -0.0005401340 ## [7,] -0.03388172 -0.0006652542 0.05323945 -0.005674163 0.0132121369 ## [,6] [,7] ## [1,] 0.060095341 -0.0338817239 ## [2,] -0.011128500 -0.0006652542 ## [3,] 0.016205992 0.0532394475 ## [4,] 0.002119022 -0.0056741628 ## [5,] -0.000540134 0.0132121369 ## [6,] 0.067901678 0.0031449619 ## [7,] 0.003144962 0.0573631922 ``` --- ## Retrieving info from Hessians Standard errors: ``` r serrors = sqrt(diag(varcov)) serrors ``` ``` ## [1] 1.0067529 0.1357405 4.5335597 0.2996387 0.2476241 0.2605795 0.2395061 ``` --- ## Compare and contrast `glm`: ``` r summary(logit1)$'coefficients' ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.8037752 1.0067487 0.7983872 4.246459e-01 ## log_rgdpch -0.3304646 0.1357397 -2.4345465 1.491046e-02 ## cinc 8.0932181 4.5335410 1.7851869 7.423100e-02 ## pol6 -0.3181335 0.2996382 -1.0617253 2.883604e-01 ## gs 1.0266308 0.2476238 4.1459299 3.384374e-05 ## tk 1.1762873 0.2605794 4.5141220 6.357961e-06 ## riv 0.9413082 0.2395058 3.9302102 8.487165e-05 ``` --- ## Compare and contrast BFGS optim: ``` r cbind('Estimate'=coefs,'Std. Error'=serrors,'Z value'=coefs/serrors) ``` ``` ## Estimate Std. Error Z value ## [1,] 0.8038712 1.0067529 0.7984791 ## [2,] -0.3304774 0.1357405 -2.4346264 ## [3,] 8.0934579 4.5335597 1.7852324 ## [4,] -0.3181530 0.2996387 -1.0617888 ## [5,] 1.0266367 0.2476241 4.1459481 ## [6,] 1.1762919 0.2605795 4.5141384 ## [7,] 0.9413188 0.2395061 3.9302492 ``` --- ## So what do we do with this stuff? <img src="stargazing.jpeg" width="500px" style="display: block; margin: auto;" /> --- ## So what do we do with this stuff? - What does it mean for the rebel transnational constituency variable (`tk`) to be 1.18 and juiced up with stars? - One common way old people interpret these models is in terms of log odds <img src="log_odds.png" width="500px" style="display: block; margin: auto;" /> --- ## So what do we do with this stuff? - If a rebel group has a transnational constituency, there is a 1.18 unit change in the log odds of the incumbent winning and is true with THREE STARS! <img src="angry_baby.png" width="400px" style="display: block; margin: auto;" /> --- ## No, angry baby from previous slide ... We gotta dig further ... this guy has the right idea <img src="baby_smurf.jpeg" width="400px" style="display: block; margin: auto;" /> --- ## How to interpret? - We could go onto relative risk and odds ratios but there are more straightforward ways - It is increasingly becoming standard practice that when you present your findings, you do so in terms of something that has substantive meaning to the reader - For binary outcomes, this means turning results into predicted probabilities - Also we need to take into account uncertainty when presenting results --- ## How do we get predictions in terms of probabilities from this model? Of course there is the predict function: ``` r yhats = predict(logit1) head(yhats) ``` ``` ## 1 2 3 8 9 10 ## 1.4167999 1.4167999 0.5212350 0.5194176 1.8878211 0.1158178 ``` Apply the link function to retrieve probabilities (could also have used `predict(model, type='response')`: ``` r yprobs = 1/(1+exp(-yhats)) head(yprobs) ``` ``` ## 1 2 3 8 9 10 ## 0.8048363 0.8048363 0.6274365 0.6270116 0.8685069 0.5289221 ``` --- ## How does the predict function work? Retrieve parameter estimates: ``` r beta = coef(logit1) ``` Set up covariate matrix: ``` r # add an intercept X = cbind(1, dat2[,ivs]) # turn into matrix X = data.matrix(X) head(X) ``` ``` ## 1 log_rgdpch cinc pol6 gs tk riv ## 1 1 7.699140 0.0016176 0 1 1 1 ## 2 1 7.699140 0.0016176 0 1 1 1 ## 3 1 7.553019 0.0013032 0 1 1 0 ## 8 1 6.846688 0.0012716 0 1 0 1 ## 9 1 6.266764 0.0013299 0 1 1 1 ## 10 1 8.620994 0.0053601 0 0 1 1 ``` --- ## How does the predict function work? Calculate in-sample predictions: ``` r # gen yhats yhat = X %*% beta # apply link fn again to get probs yprobs = 1/(1+exp(-yhat)) # compare with predict head( cbind( yprobs, predict(logit1, type='response') ) ) ``` ``` ## [,1] [,2] ## 1 0.8048363 0.8048363 ## 2 0.8048363 0.8048363 ## 3 0.6274365 0.6274365 ## 8 0.6270116 0.6270116 ## 9 0.8685069 0.8685069 ## 10 0.5289221 0.5289221 ``` --- ## How does this help us with interpretation? Well lets create a pair of scenarios to understand the effect of `tk`, we'll hold all variables at their measure of central tendency in both scenarios, but in one scenario we'll set `tk` to 1 and in the other to 0 ``` r scenarios = with(dat2, cbind( intercept=1, gdp = mean(log_rgdpch), cinc = mean(cinc), pol6 = median(pol6), gs = median(gs), tk = c(0,1), riv = median(riv) )) scenarios ``` ``` ## intercept gdp cinc pol6 gs tk riv ## [1,] 1 7.620791 0.0118693 0 0 0 1 ## [2,] 1 7.620791 0.0118693 0 0 1 1 ``` --- ## How does this help us with interpretation? Now lets repeat the process of getting predicted probabilities: ``` r # gen yhats yhat = scenarios %*% beta # apply link fn again to get probs yprobs = 1/(1+exp(-yhat)) # print yprobs ``` ``` ## [,1] ## [1,] 0.3368737 ## [2,] 0.6222314 ``` What's the interpretation here? --- ## Where Does Uncertainty Come From? Remember, we're *estimating* `\(\beta\)` from a finite sample: - **We don't know the true `\(\beta\)`** - we only have our estimate `\(\hat{\beta}\)` - **Our data is just one sample** - different samples would give slightly different estimates - **MLE gives us a point estimate** - but how confident should we be? Think of it like polling: - Poll 1000 people → get 52% support - Poll a different 1000 → might get 49% or 55% - The estimate varies, but in a predictable way! **Key insight**: The sampling distribution of our estimates follows a pattern --- ## The Magic of Maximum Likelihood Theory ### Why can we use a normal distribution for `\(\hat{\beta}\)`? **Three key results from MLE theory:** 1. **Consistency**: As `\(n \to \infty\)`, `\(\hat{\beta} \to \beta_{true}\)` - With more data, we get closer to the truth 2. **Asymptotic Normality**: For large `\(n\)`: `$$\sqrt{n}(\hat{\beta} - \beta_{true}) \xrightarrow{d} N(0, I^{-1})$$` - Even if errors aren't normal, `\(\hat{\beta}\)` becomes normal! - This is the Central Limit Theorem at work 3. **Efficiency**: MLE has the smallest variance among consistent estimators - We're getting the best estimates possible --- ## From Theory to Practice: The Multivariate Normal ### What we know from MLE theory: `$$\hat{\beta} \sim MVN(\beta_{true}, \Sigma)$$` where `\(\Sigma\)` is the variance-covariance matrix ### But wait, we don't know `\(\beta_{true}\)` or `\(\Sigma\)`! **Solution**: Use our estimates! - Replace `\(\beta_{true}\)` with `\(\hat{\beta}\)` (our MLE estimate) - Replace `\(\Sigma\)` with `\(\hat{\Sigma} = [-H]^{-1}\)` (inverse of negative Hessian) This gives us: `$$\hat{\beta} \sim MVN(\hat{\beta}, \hat{\Sigma})$$` Seems circular? It's not - we're describing the *sampling distribution* --- ## Understanding the Variance-Covariance Matrix The variance-covariance matrix `\(\hat{\Sigma}\)` tells us two crucial things: ``` ## (Intercept) democracy gdp ## (Intercept) 0.04 -0.02 0.01 ## democracy -0.02 0.09 -0.03 ## gdp 0.01 -0.03 0.16 ``` **Diagonal elements** = Variances of each `\(\hat{\beta}\)` - Larger values → more uncertainty - Square root gives standard errors **Off-diagonal elements** = Covariances between `\(\hat{\beta}\)`s - Shows how parameters vary together - Important for joint inference! --- ## Why This Matters: From Single to Joint Uncertainty ### Single parameter uncertainty (univariate): `$$\hat{\beta}_k \sim N(\hat{\beta}_k, SE^2_k)$$` Easy to visualize: bell curve around estimate ### Multiple parameters together (multivariate): `$$\begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \end{bmatrix} \sim MVN\left(\begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \\ \hat{\beta}_2 \end{bmatrix}, \hat{\Sigma}\right)$$` This captures: - Individual parameter uncertainty - **Correlations between parameters** - Allows us to make statements about combinations of `\(\beta\)`s **Key for interpretation**: When we compute predicted probabilities, we use *multiple* `\(\beta\)`s together! --- ## Now what about uncertainty? Lets start with what we know about MLEs: ``` r vcov(logit1) ``` ``` ## (Intercept) log_rgdpch cinc pol6 gs tk riv ## (Intercept) 1.01 -0.13 0.12 0.07 -0.03 0.06 -0.03 ## log_rgdpch -0.13 0.02 -0.06 -0.01 0.00 -0.01 0.00 ## cinc 0.12 -0.06 20.55 -0.38 0.33 0.02 0.05 ## pol6 0.07 -0.01 -0.38 0.09 -0.01 0.00 -0.01 ## gs -0.03 0.00 0.33 -0.01 0.06 0.00 0.01 ## tk 0.06 -0.01 0.02 0.00 0.00 0.07 0.00 ## riv -0.03 0.00 0.05 -0.01 0.01 0.00 0.06 ``` The variance covariance matrix gives us a measure of not only how precisely we think we have estimates `\(\beta_{p}\)` but also how `\(\beta\)`s vary together --- ## Now what about uncertainty? - We often summarize uncertainty with confidence intervals around parameters - These intervals around the parameter estimate are normally distributed because by the central limit theorem, we assume that `\(\hat \beta_{MLE} \sim mvnorm(\hat\beta, \hat\Sigma)\)`, where `\(\hat\Sigma\)` represents the variance covariance matrix <img src="beta_uncert.png" width="600px" style="display: block; margin: auto;" /> --- ## Now what about uncertainty? - Given that the `\(\beta\)`s are estimated from a multivariate normal, we know that if we take enough random draws from them that will be representative of the set of possible values of `\(\beta\)` as estimated by our model - This is really useful because it means that we can capture uncertainty in any estimated parameter using draws from its predictive distribution <img src="beta_uncert_draws.png" width="400px" style="display: block; margin: auto;" /> --- ## Now what about uncertainty? - In most models, we have more than one explanatory variable, and unless off-diagonal entry of those parameters is zero we know that they are correlated with one another <img src="twobetas.png" width="400px" style="display: block; margin: auto;" /> --- ## Now what about uncertainty? - In most models, we have more than one explanatory variable, and unless off-diagonal entry of those parameters is zero we know that they are correlated with one another <img src="twobetas_2.png" width="400px" style="display: block; margin: auto;" /> --- ## Now what about uncertainty? - Using a multivariate normal we can draw a sample of values of these `\(\beta\)`s jointly <img src="twobetas_3.png" width="400px" style="display: block; margin: auto;" /> --- ## Outline of procedure - Form model and estimate `\(\hat\beta\)` and `\(\hat\Sigma\)` - Simulate from sampling distribution of `\(hat\beta\)` to incorporate **estimation uncertainty** - Multiply these simulated `\(\beta\)` values by some covariates in the model to get predictions - Plug those predictions into link function, so that the resulting estimate is on the same scale as the parameters in the stochastic component of our model - Use the transformed predicted values to take thousands of draws from your stochastic function in order to incorporate **fundamental uncertainty** - Store the means of these simulations --- ## More formally 1. Choose a scenario, `\(X_{s}\)` 2. Estimate the model and obtain the parameter vector, `\(\hat\beta\)`, and its variance covariance matrix, `\(\hat\Sigma_{\hat\beta}\)` 3. Draw `\(\tilde\beta\)` from the multivariate normal `\(MVN(\hat\beta, \hat\Sigma_{\hat\beta})\)` 4. Calculate `\(\tilde\mu_{s} = f(\tilde\beta, X_{s})\)` 5. To account for fundamental uncertainty, draw `\(\tilde y_{s} \sim g(\tilde\mu_{s}, \alpha)\)` from the model's distribution (this is a predicted value) 6. Repeat steps 3 to 5 many times, averaging the results to get one expected value 7. Repeat step 6 many times to obtain a vector of expected values 8. Summarize this vector using means, sds, or confidence intervals --- ## Lets do this! Specifically, to evaluate the effect of having a transnational constituency ``` r library(mvtnorm) sims = 1000 betaDraws = rmvnorm(sims, coef(logit1), vcov(logit1)) head(betaDraws) ``` ``` ## (Intercept) log_rgdpch cinc pol6 gs tk riv ## [1,] 0.3416528 -0.2618608 7.622924 0.07450124 0.5474533 1.4920103 0.9581079 ## [2,] 0.6710217 -0.3256797 9.253792 -0.23892030 0.8484446 1.2290140 1.0472092 ## [3,] 0.2907513 -0.2280450 16.216061 -0.54956733 1.0914486 0.9312046 0.4204781 ## [4,] 0.3250921 -0.2165770 9.570286 -0.99250230 0.9540670 1.0413666 0.8618989 ## [5,] 0.2873576 -0.1825133 10.244827 -0.56880893 0.8059053 0.9015962 0.4201509 ## [6,] 1.2399410 -0.4187754 4.794175 -0.12410758 1.2376469 1.1981770 1.4965370 ``` --- ## Lets do this! ``` r ypreds = scenarios %*% t(betaDraws) yprobs = 1/(1+exp(-ypreds)) yprobs = t(yprobs) head(yprobs) ``` ``` ## [,1] [,2] ## [1,] 0.3531214 0.7082012 ## [2,] 0.3421180 0.6399494 ## [3,] 0.3027618 0.5242369 ## [4,] 0.4134056 0.6662922 ## [5,] 0.3631337 0.5841438 ## [6,] 0.4017885 0.6900084 ``` --- ## Lets do this! ``` r summary(yprobs) ``` ``` ## V1 V2 ## Min. :0.2067 Min. :0.3671 ## 1st Qu.:0.3074 1st Qu.:0.5771 ## Median :0.3365 Median :0.6204 ## Mean :0.3392 Mean :0.6184 ## 3rd Qu.:0.3694 3rd Qu.:0.6622 ## Max. :0.5538 Max. :0.7753 ``` --- ## Visualize the difference: Binary IV Say that we use the simulation based procedure to visualize the difference in probability of rebel support estimated by our model under a scenario of no transnational constituency and a scenario of transnational constituency. One way to do so is to use our simulation based procedure to generate a plot like the following: <img src="second_dens.png" width="600px" style="display: block; margin: auto;" /> --- ## Visualize the difference: Continuous IV What if we wanted to visualize predictions in probability of rebel support by a continuous independent variable like Log(GDP per capita)? We could use a plot like the below: <img src="gdp_eff.png" width="600px" style="display: block; margin: auto;" /> --- ## The "Average Case" Problem ### What we've been doing so far: - Set covariates to their **means** (or medians/modes) - Calculate effect for this "average case" - Report this as our estimate ### The problem: **The "average case" might not exist in reality!** Example: - Average age = 35 years - Average education = 14 years - Average income = $50,000 - Average gender = 0.52 (???) This "average person" is 52% female? That's not a real person! --- ## Why This Matters More for Logit/Probit ### In OLS, it doesn't matter: - Linear models have **constant** marginal effects - Effect at the mean = Mean of the effects - `\(\beta\)` is the same everywhere ### In logit/probit, it matters a lot: - Nonlinear models have **varying** marginal effects - Effect depends on where you are on the S-curve - Effect at the mean ≠ Mean of the effects! --- ## The Observed Value Approach = AME! ### Key clarification: **"Observed value approach" and "Average Marginal Effects (AME)" are the SAME THING** - **Hanmer & Kalkan (2013)** called it: "Observed value approach" - **Statisticians** call it: "Average Marginal Effects (AME)" - **Both mean**: Calculate effects using actual data values, then average --- ## The Observed Value Approach = AME! ### The insight: Instead of the effect for an **imaginary average person**, calculate the **average effect for real people** 1. **Keep all observations at their actual values** - Don't collapse to means/medians, use the real data distribution 2. **Calculate the marginal effect for each observation** - Change variable of interest for each person, compute difference 3. **Average across all observations** - This gives you AME = the observed value approach --- ## Observed Values in Practice ### Traditional approach (MEM): ``` 1. Set all X to means 2. Change treatment from 0→1 3. Calculate: Δp = p(Treatment=1) - p(Treatment=0) ``` This gives **Marginal Effects at Means (MEM)** ### Observed value approach (AME): ``` For each observation i: 1. Keep all Xi at observed values 2. Set Treatment=0, get p0i 3. Set Treatment=1, get p1i 4. Calculate: Δpi = p1i - p0i Average effect = mean(Δpi) ``` This gives **Average Marginal Effects (AME)** **Remember**: AME = Observed Value Approach (same thing, different names!) --- ## Is this important? ### Yes, particularly when: 1. **You have substantial heterogeneity** in your sample - Effects vary across subgroups - Nonlinearity is strong 2. **The "average case" is unrealistic** - Categorical variables (can't have average gender) - Combinations that don't exist 3. **You care about policy implications** - Real people, not statistical constructs - Distribution of effects matters **Bottom line**: Use AME (observed values) unless you have a specific reason not to --- ## Observed value approach - Operationally, this is not hard to do - Instead of creating scenarios we just use the observed data and to test effects we just vary the variable of interest - Computationally, though this gets annoying, we'll work through this together in R --- ## What about interactions - In the context of binomial models, every variable is curvilinear and interacted with every other variable automatically because of the **link function** - By including a link function, the model automatically specifies interactions on the natural scale of the linear predictor - Lets take a peak through the lens of some economists --- ## First: Linear world If we had something like: `\begin{eqnarray} E[ Y | X] = \beta_{0} + \beta_{1}x_{1} + (\beta_2 + \beta_{3}x_{1})x_{2} \nonumber \end{eqnarray}` What would be the marginal effect of `\(x_{1}\)`? `\(x_{2}\)`? What about how the value of `\(x_{2}\)` change the effect of `\(x_{1}\)` (cross-partial derivative)? --- ## Interactions: Ai and Norton (2003) Say we have: `\begin{eqnarray} E(y = 1 | x_{1}, x_{2}) = F(\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + + \beta_{12}(x_{1} \times x_{2})) \nonumber \end{eqnarray}` where `\(F(.)\)` can be a logit, probit or other link function and `\(x_{1}\)` is a continuous variable. The marginal effect of `\(x_{1}\)` is: `\begin{eqnarray} \frac{d}{d x_1} P(y=1) = (\beta_1+\beta_{12} x_2) F'(\cdot) \nonumber \end{eqnarray}` - What you need to note is that the presence of a link function requires the use of the chain rule and therefore retains other terms on the right-hand side besides `\(\beta_{1}\)` --- ## Interactions: Ai and Norton (2003) - How does the marginal effect of `\(x_1\)` change as the value of `\(x_2\)` changes? - Rephrase: How does the value of `\(x_2\)` change the effect of `\(x_1\)` on `\(P(y=1)\)`? `\begin{eqnarray} \frac{d^2}{dx_1dx_2} P(y=1) &=& \frac{d}{dx_2}( (\beta_1+\beta_{12}x_2) F'(\cdot) ) \nonumber \\ \frac{d^2}{dx_1dx_2} P(y=1) &=& \beta_{12} F'(\cdot) +(\beta_1+\beta_{12}x_2)(\beta_2+\beta_{12}x_1) F''(\cdot) \nonumber \end{eqnarray}` --- ## Interactions: Ai and Norton (2003) `\begin{eqnarray} \frac{d^2}{dx_1dx_2} P(y=1) &=& \beta_{12} F'(\cdot) +(\beta_1+\beta_{12}x_2)(\beta_2+\beta_{12}x1) F''(\cdot) \nonumber \end{eqnarray}` What can we learn about interactions from this? - Even if `\(\beta_{12}=0\)` the expression above for `\(\frac{d^2}{dx_1dx_2} P(y=1)\)` has a **nonzero** value. - The sign of `\(\beta_{12}\)` does **not necessarily** indicate the sign of the cross-partial effect. - The "significance" of `\(\beta_{12}\)` does **not** provide information about "significance" of the "interaction effect". - The cross-partial effect must be evaluated for the specific **nonlinear function** in question. You have to evaluate for the values of the variables that you are interested in. --- ## Illustrating implications of a product term This issue was further commented on by [Berry et al. (2010)](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-5907.2009.00429.x) <img src="berry_1.png" width="700px" style="display: block; margin: auto;" /> --- ## Illustrating implications of a product term - For `\(X_{1}=6\)` and `\(X_{2}=0\)`, the marginal effect of `\(X_{1}\)` on `\(Pr(Y)\)` is 0.105 in both figures <img src="berry_1.png" width="700px" style="display: block; margin: auto;" /> --- ## Illustrating implications of a product term - For `\(X_{1}=6\)` and `\(X_{2}=2\)`, the marginal effect of `\(X_{1}\)` on `\(Pr(Y)\)` decreases to 0.018 and on the figure on the right the marginal effect decreases only to 0.096 <img src="berry_1.png" width="700px" style="display: block; margin: auto;" /> --- ## What to do with interactions in GLMs? - Marginal effect of `\(X_{1}\)` on `\(P(Y)\)` will vary with `\(X_{2}\)` - There will be some interaction between `\(X_{1}\)` and `\(X_{2}\)` even when the model does not include a product term + Again this is because to do maximization in the presence of the link function we need to use the chain rule and as a result parameters get retained - In general, always include a product term if you think there is a compelling theoretical or empirical reason to do so --- ## So say that we run an interaction ... Specifically, say that we add an interaction between transnational constituency and Log(GDP per capita), with the idea that transnational constituencies are more likely to support rebel groups when the country has low levels of economic development. If we run such a model, how do we evaluate whether it improved our ability to explain variation in the data? --- ## How to evaluate models In-Sample Fit - Test the model’s performance on the same data used to fit the model - Easy to do – you already have the data - Sadly, the default approach unless otherwise stated - Tends towards overconfidence and overfitting: + Treats models that predict quirks of the sample as good models of the population --- ## How to evaluate models? Out-of-Sample Forecast - Goal: Find underlying structure in your data - Process: + Partition data between training and test sets + Fit model to training set and predict test set + Compare to truth for average prediction and full distribution - Worry: If world changes, model may fail anyway - For example, see [Bowlsby et al. (2019)](https://www.cambridge.org/core/journals/british-journal-of-political-science/article/future-is-a-moving-target-predicting-political-instability/0028744BE1AFF83F879E7759D798D88A) --- ## How to evaluate models? Cross-validation - Goal: See how well our model reproduces random partitions of our data in an out-of-sample context - Process: + Randomly select `\(k\)` observations as the "test set" + Evaluate with the rest of data as training data + Set aside another set of `\(k\)` observations + Evaluate and repeat + Report performance averaged over subsets --- ## Lets talk about evaluation metrics first Four generic ways to check MLE GOF - t-test / Wald Statistic - Likelihood Ratio Test - Bayesian Information Criterion (BIC) - Akaike Information Criterion (AIC) Some better ways to check MLE GOF for binary outcomes - Percent correctly predicted - Separation plots - Sensitivity and specificity / ROC plots - Precision and recall / PR plots --- ## T-test - Lets start with the silliest, t-tests - These are a general test of GOF applied in-sample - A simple test of whether a parameter is significantly different from zero is: `\begin{eqnarray} p &=& 1 - F_{t} ( \frac{\hat\beta}{\sqrt{Var(\hat\beta)}}, n-k ) \nonumber \end{eqnarray}` --- ## T-test Lets see this in practice: ``` r summary(logit1)$'coefficients' ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.803775 1.006749 0.798387 0.424646 ## log_rgdpch -0.330465 0.135740 -2.434547 0.014910 ## cinc 8.093218 4.533541 1.785187 0.074231 ## pol6 -0.318133 0.299638 -1.061725 0.288360 ## gs 1.026631 0.247624 4.145930 0.000034 ## tk 1.176287 0.260579 4.514122 0.000006 ## riv 0.941308 0.239506 3.930210 0.000085 ``` - What does a p value of 0.000006 mean? - A more general form of the t-test is the Wald test and can be applied to multiple parameters at once --- ## Likelihood ratios Likelihood ratio test of nested models `\begin{eqnarray} LR &=& -2 log \frac{\mathcal{L}(\mathcal{M}_{1})}{\mathcal{L}(\mathcal{M}_{2})} \nonumber \\ &=& 2( log \; \mathcal{L}(\mathcal{M}_{2}) - log \; \mathcal{L}(\mathcal{M}_{1}) ) \nonumber \\ LR &\sim& f_{\chi^{2}}(m) \nonumber \end{eqnarray}` where `\(m\)` is the number of restrictions placed on model 2 by model 1 (e.g., parameters held constant in model 1) --- ## Likelihood ratios - Lets compare these models ``` r summary(logit1)$'coefficients' ; summary(logit2)$'coefficients' ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.804 1.007 0.798 0.425 ## log_rgdpch -0.330 0.136 -2.435 0.015 ## cinc 8.093 4.534 1.785 0.074 ## pol6 -0.318 0.300 -1.062 0.288 ## gs 1.027 0.248 4.146 0.000 ## tk 1.176 0.261 4.514 0.000 ## riv 0.941 0.240 3.930 0.000 ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.529 1.277 0.414 0.679 ## log_rgdpch -0.294 0.170 -1.732 0.083 ## cinc 8.319 4.570 1.820 0.069 ## pol6 -0.322 0.300 -1.076 0.282 ## gs 1.033 0.249 4.157 0.000 ## tk 1.900 2.088 0.910 0.363 ## riv 0.948 0.241 3.941 0.000 ## tk_gdp -0.093 0.266 -0.350 0.727 ``` --- ## Likelihood ratios - `logit1` is nested within `logit2` because the latter has all the same parameters and the interaction between `tk` and `log_rgdpch` ``` r library(lmtest) lrtest(logit1, logit2) ``` ``` ## Likelihood ratio test ## ## Model 1: supp1 ~ log_rgdpch + cinc + pol6 + gs + tk + riv ## Model 2: supp1 ~ log_rgdpch + cinc + pol6 + gs + tk + riv + tk_gdp ## #Df LogLik Df Chisq Pr(>Chisq) ## 1 7 -217.36 ## 2 8 -217.30 1 0.1226 0.7262 ``` --- ## Likelihood ratios - You can also use likelihood ratio tests to compare against a null model ``` r lrtest(logit1) ``` ``` ## Likelihood ratio test ## ## Model 1: supp1 ~ log_rgdpch + cinc + pol6 + gs + tk + riv ## Model 2: supp1 ~ 1 ## #Df LogLik Df Chisq Pr(>Chisq) ## 1 7 -217.36 ## 2 1 -246.61 -6 58.494 9.097e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Information Criteria The Bayesian Information Criterion (BIC) is a type of penalized likelihood ratio test `\begin{eqnarray} BIC_{\mathcal{M}_{1}} &=& log \; n \times p_{1} -2 log \mathcal{L}(\mathcal{M}_{1}) \nonumber \\ BIC_{\mathcal{M}_{2}} &=& log \; n \times p_{2} -2 log \mathcal{L}(\mathcal{M}_{2}) \nonumber \\ \end{eqnarray}` - where `\(p\)` are the number of parameters and `\(n\)` the number of observations - Lower values of BIC are better - We refer to this as a penalized LRT because there is a penalty for extra observations and parameters --- ## Information Criteria To calculate in R for a glm: ``` r BIC(logit1) ``` ``` ## [1] 475.9509 ``` ``` r BIC(logit2) ``` ``` ## [1] 481.7171 ``` Keep in mind like LRT, BIC can be only used to compare models that have been run on the same underlying sample --- ## Information Criteria The Akaike Information Criterion (AIC) is also a type of penalized likelihood ratio test `\begin{eqnarray} AIC_{\mathcal{M}_{1}} &=& 2 p_{1} -2 log \mathcal{L}(\mathcal{M}_{1}) \nonumber \\ AIC_{\mathcal{M}_{2}} &=& 2 p_{2} -2 log \mathcal{L}(\mathcal{M}_{2}) \nonumber \\ \end{eqnarray}` - where `\(p\)` again is the number of parameters - Lower values of AIC are again better --- ## Information Criteria To calculate in R for glm: ``` r AIC(logit1) ``` ``` ## [1] 448.7287 ``` ``` r AIC(logit2) ``` ``` ## [1] 450.6061 ``` --- ## Percent correctly predicted <img src="pcp.png" width="700px" style="display: block; margin: auto;" /> --- ## Percent correctly predicted Calculate the percent correctly predicted (PCP) for `logit1` and `logit2`, you should get the following for the two models, respectively: ``` ## [1] 0.6842105 ``` ``` ## [1] 0.6731302 ``` - Also what would you say about a model that had a PCP of less than `\(\pi^{\ast} = 0.50\)`? --- ## Percent correctly predicted This approach of gauging performance might be intuitive but it is extremely problematic - Do we want a metric that will distinguish between cases of .51 and .49? - What about .51 and .99? - Why use a strict threshold of .5 at all? --- ## Separation plots [Greenhill et al.](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-5907.2011.00525.x) suggest a simple graphical approach to asses GOF for binary models: ``` r library(separationplot) ``` ``` ## Loading required package: RColorBrewer ``` ``` ## Loading required package: Hmisc ``` ``` ## ## Attaching package: 'Hmisc' ``` ``` ## The following objects are masked from 'package:base': ## ## format.pval, units ``` ``` ## Loading required package: MASS ``` ``` ## Loading required package: foreign ``` ``` r separationplot( pred=predict(logit1, type='response'), actual=dat2$supp1 ) separationplot( pred=predict(logit2, type='response'), actual=dat2$supp1 ) ``` <img src="logit1sep.png" width="700px" style="display: block; margin: auto;" /> <img src="logit2sep.png" width="700px" style="display: block; margin: auto;" /> --- ## Brief aside on perfect separation Suppose the separation plot looked like this <img src="perfect_sep.png" width="700px" style="display: block; margin: auto;" /> - What kind of model would produce perfect separation? - Here we're using a green line to trace the probability of each case, what is the slope of this line near the separation between tan and red? - If there is a binary covariate that is 0 for every `\(y=0\)` and 1 for every `\(y=1\)`, what is the logit coefficient for that covariate? - Could a logit model fit too well? --- ## Back to our models <img src="logit1sep.png" width="700px" style="display: block; margin: auto;" /> <img src="logit2sep.png" width="700px" style="display: block; margin: auto;" /> - When we were working with percent correctly predicted, we noted that `\(\pi^{\ast} = .5\)` was an arbitrary threshold - Can we conceptualize different thresholds `\(\pi^{\ast}\)` that would classify the data? --- ## Sensitivity and Specificity Let `\(\pi^{\ast}\)` be a threshold at some level in [0,1], above which we suppose `\(\hat y = 1\)` Given a threshold `\(\pi^{\ast}\)`, we can compute the models sensitivity and specificity: <img src="two_two.png" width="250px" style="display: block; margin: auto;" /> - **Sensitivity** (true positive rate): + is the fraction of true positives the model tends to identify: `\(P(\hat y \geq \pi^{\ast} | y = 1)\)` + in terms of the 2 x 2, `\(\frac{TP}{TP + FN}\)` - **Specificity** (true negative rate): + is the fraction of true negatives the model tends to identify: `\(P(\hat y \leq \pi^{\ast} | y = 0)\)` + in terms of the 2 x 2, `\(\frac{TN}{TN + FP}\)` --- ## Sensitivity and Specificity We can compute in-sample specificity & sensitivity for any `\(\pi^{\ast}\)` & model - `\(\pi^{\ast}=.5\)` + logit1: * Sensitivity = .55 * Specificity = .78 + logit2: * Sensitivity = .53 * Specificity = .78 - `\(\pi^{\ast}=.15\)` + logit1: * Sensitivity = .98 * Specificity = .10 + logit2: * Sensitivity = .98 * Specificity = .10 What changes when we move `\(\pi^{\ast}\)`? How do you think it would change if we set `\(\pi^{\ast}\)` to .95 --- ## Receiver Operating Characteristic (ROC) Curves - Unless we have a `\(\pi^{\ast}\)` of particular interest, we're more interested in the range of sensitivity and specificity - We'd also like to understand the tradeoff better and see the threshold that maximizes their sum - This is where ROC curves come into play! - These were developed in WWII to evaluate radar operators - ROC curves show sensitivity & specificity under all thresholds from 0 to 1 --- ## ROC curves One change we make when using ROC curves is that we plot sensitivity against 1-specificity `\begin{eqnarray} Specificity &=& \frac{TN}{TN + FP} \nonumber \\ 1 - Specificity &=& 1 - \frac{TN}{TN + FP} \nonumber \\ 1 - Specificity &=& \frac{TN + FP - TN}{TN + FP} \nonumber \\ 1 - Specificity &=& \frac{FP}{TN + FP} \nonumber \end{eqnarray}` - While specificity gives us the true negative rate, 1-specificity gives us the false positive rate - So together sensitivity gives us the true positive rate and 1-specificity gives us the false positive rate - Thus, an ROC curve has a nice interpretation in that it gives us an indication of how well the probabilities from the positive class are separated from the negative class --- ## Producing a ROC curve is easy! ``` r roc = function(threshold, predProbs, dv){ # Set up output matrix pr=matrix(NA, ncol=3, nrow=length(threshold), dimnames=list(NULL, c('Threshold', 'FPR', 'TPR') ) ) # Loop through thresholds for(ii in 1:length(threshold)){ predRoc = as.numeric( predProbs > threshold[ii] ) FPR = sum( (predRoc==1)*(dv==0) ) / sum(dv==0) TPR = sum( (predRoc==1)*(dv==1) ) / sum(dv==1) pr[ii,1] = threshold[ii] pr[ii,2] = FPR pr[ii,3] = TPR } # Return output return( data.frame( pr ) ) } ``` --- ## Plugging in our data ``` r rocCurve = roc( seq(0, 1, by=.001), dat2$logit1Probs, dat2$supp1 ) ggplot(rocCurve, aes(x=FPR, y=TPR)) + geom_line() + geom_abline(intercept=0, slope=1) ``` --- ## Plugging in our data <img src="binaryModels_MLE_files/figure-html/unnamed-chunk-73-1.png" style="display: block; margin: auto;" /> --- ## Lots of handy packages ``` r library(PRROC) roc<-roc.curve( scores.class0 = dat2$logit1Probs[dat2$supp1==1], scores.class1 = dat2$logit1Probs[dat2$supp1==0], curve=TRUE ) plot(roc) ``` --- ## Lots of handy packages ``` ## Loading required package: rlang ``` ``` ## ## Attaching package: 'rlang' ``` ``` ## The following object is masked from 'package:magrittr': ## ## set_names ``` <img src="binaryModels_MLE_files/figure-html/unnamed-chunk-75-1.png" style="display: block; margin: auto;" /> --- ## Precision-recall curves - Frequently within political science, at least in conflict studies, we are studying events that are rare - In cases where we have sparse/rare dependent variables, ROC curves are a misleading indication of model performance - This occurs because in sparse data it is easy to predict negatives - Interstate conflict for example is an extremely rare event, if you just had a model that predicted nothing but zero, by the standard of ROC plots you would have a great classifier --- ## Precision-recall curves - Given that it's easy to predict zeros in sparse data, using the false positive rate is not that helpful - One option is to compare false positives to the overall number of positive predictions made by a model for a given threshold ... this is **precision** - Precision gives you a sense of how believable a model's predictions are, i.e., model says there was an event, what are the actual chances this is true `\begin{eqnarray} Precision &=& \frac{TP}{TP + FP} \end{eqnarray}` - Also note that recall in a PR curve is the same thing as sensitivity - The process of constructing one of these curves is operationally the same as the ROC curve, we are just using precision instead of 1-specificity --- ## Precision-recall curves ``` r pr<-pr.curve( scores.class0 = dat2$logit1Probs[dat2$supp1==1], scores.class1 = dat2$logit1Probs[dat2$supp1==0], curve=TRUE ) plot(pr) ``` --- ## Precision-recall curves <img src="binaryModels_MLE_files/figure-html/unnamed-chunk-77-1.png" style="display: block; margin: auto;" /> --- ## Paradigms to test GOF Now that we've identified some metrics to assess binary outcomes, we can go back to strategies for empirically assessing our model: - In-sample analysis - Out-of-sample analysis - Cross-validation --- ## One last note specific to categorical models ... separation - Separation refers to there being the presence of one or more covariates that perfectly predict the outcome of interest - For example, say we are modeling the final passage vote for the affordable care act + Our dv is `\(Y_{i}\)` = {0, 1}, where a 1 indicates voting for the ACA + `\(X_{i,1}\)` is a binary indicator for democrat + `\(X_{i,1}\)` is a measure of congressperson ideology (e.g., DW-Nominate) - Did any Republicans vote for the ACA ... no ... just ask frustrated Obama <img src="angry_obama.jpeg" width="400px" style="display: block; margin: auto;" /> --- ## Consequence of separation - Given that not a single republican voted, we know that: `\(Pr(Y_{i}=1 | X_{i1}=0) = 0\)` - If we try to model this: ``` r mod = glm( ACA ~ Democrat + Ideology, data=data, family=binomial(link='logit') ) ``` ``` ## Warning: glm.fit: algorithm did not converge ``` ``` r round(summary(mod)$'coefficients', 3) ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -26.566 32020.55 -0.001 0.999 ## Democrat 53.132 51908.91 0.001 0.999 ## Ideology 0.000 25959.54 0.000 1.000 ``` --- ## So what to do? Bring back Bayes (a bit)! - Can we just add more observations ... typically not or we would have been using them already - Penalized (Prior)-Logistic Regression + Separation: causes coefficients to diverge + Penalty (prior): forces coefficients towards zero - Step 1: Standardize inputs (Gelman et al) + Binary variables: mean 0, differ by 1 * Democrats: (30\%), so new values (0.3, -0.7) + Other variables: mean 0, sd 0.5 --- ## So what to do? Bring back Bayes (a bit)! - Step 2: Penalize Likelihood + Firth's penalty (Zorn 2005): `\begin{eqnarray} L(\boldsymbol{\beta}| \boldsymbol{X}, \boldsymbol{Y} ) & = & \prod_{i=1}^{N} \pi_{i}^{Y_{i}} (1- \pi_{i})^{1-Y_{i}} |I(\boldsymbol{\beta})|^{1/2} \nonumber \end{eqnarray}` where: `\begin{eqnarray} \pi_{i} & = & \frac{1}{1 + \exp(-\boldsymbol{X}_{i}^{'}\boldsymbol{\beta})} \nonumber \\ |I(\boldsymbol{\beta})| & = & \text{ Determinant of Fisher's information at } \boldsymbol{\beta} \nonumber \\ I(\boldsymbol{\beta}) & = & \boldsymbol{X}^{'} \boldsymbol{W} \boldsymbol{X} \nonumber \end{eqnarray}` <img src="meat.png" width="500px" style="display: block; margin: auto;" /> --- ## So what to do? Bring back Bayes (a bit)! `\begin{eqnarray} L(\boldsymbol{\beta}| \boldsymbol{X}, \boldsymbol{Y} ) & = & \prod_{i=1}^{N} \pi_{i}^{Y_{i}} (1- \pi_{i})^{1-Y_{i}}|I(\boldsymbol{\beta})|^{1/2} \nonumber \\ l(\boldsymbol{\beta}| \boldsymbol{X}, \boldsymbol{Y} ) & = & \sum_{i=1}^{N} Y_{i} \log \pi_{i} + (1-Y_{i} ) \log (1- \pi_{i}) + \frac{1}{2}\log(|I(\boldsymbol{\beta})|) \nonumber \end{eqnarray}` --- ## How to estimate? ``` r jeffPrior <- function(params, X, Y){ beta <- params y.tilde <- X%*%beta y.prob <- plogis(y.tilde) temp <- matrix(0, nrow = length(Y), ncol=length(Y)) part1 <- Y%*%log(y.prob) + (1-Y)%*%log(1- y.prob) diag(temp)<- y.prob*(1-y.prob) part2 <- 0.5*log(det(t(X)%*%temp%*%X)) out <- part1 + part2 return(out) } firth <- optim( rnorm(3), # random starting vals jeffPrior, method = 'BFGS', control=list(fnscale=-1), hessian=T, X = data.matrix(cbind(1, data[,c('Democrat','Ideology')])), Y = data[,'ACA'] ) ``` --- ## Lets see if this helped ``` r round(cbind(firth$par, sqrt(diag(-solve(firth$hessian)))), 3) ``` ``` ## [,1] [,2] ## [1,] -7.199 3.194 ## [2,] 14.385 5.606 ## [3,] -1.355 1.724 ``` ``` r round(summary(mod)$'coefficients', 3) ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -26.566 32020.55 -0.001 0.999 ## Democrat 53.132 51908.91 0.001 0.999 ## Ideology 0.000 25959.54 0.000 1.000 ``` Seems like it? --- ## Any other popular solutions? - From a Bayesian perpsective, Gelman notes that the application of a Jeffrey's prior by Firth is strange, since it can't be interpreted as actual prior information because there is no interpretable scale and it can depend on the data in complex ways - Gelman argues for the usage of a weakly informative Cauchy(0, 2.5) prior on coefficients + Like Jeffrey's prior this will bound estimates away from positive and negative infinity but can also be interpreted as actual prior information - However, Rainey (2016) notes that this approach can have problems as well + Basic insight, is we are entering the world of priors and Bayesian analysis with these solutions, we need to be careful with how we parameterize the prior information we are injecting into the model --- ## So ... what you doing for your project? <img src="picard.jpeg" width="400px" style="display: block; margin: auto;" />