PLS 900/803 Bayesian Analysis

---

``` r
library(stringr)
library(magrittr)
library(rvest)
```

---

## Readings to go with this lecture

$$
  \require{cancel}
  \DeclareMathOperator*{\argmin}{arg\,min}
$$

$$
  \DeclareMathOperator*{\argmax}{arg\,max}
$$

Two books that I highly recommend, McElreath's Statistical Rethinking (linked below) provides a great and friendly introduction to Bayesian analysis. A much more comprehensive book comes is Bayesian Data Analysis from Gelman et al (also linked below). For those of you interested in Bayes, I recommend delving into each. For our purposes the below is all that goes with this lecture:

- [Statistical Rethinking Ch 4 and 5](https://github.com/Booleans/statistical-rethinking/blob/master/Statistical%20Rethinking%202nd%20Edition.pdf)

- Advanced:

- [Bayesian Data Analysis Ch 10 and 11](http://www.stat.columbia.edu/~gelman/book/BDA3.pdf)
  - [Conceptual Introduction to Hamiltonian Monte Carlo](https://arxiv.org/pdf/1701.02434.pdf)

- Other:

- [High level overview](https://www.annualreviews.org/doi/abs/10.1146/annurev.polisci.7.012003.104706)
  - [Bayes and Qual Approaches I](https://www.cambridge.org/core/journals/political-analysis/article/abs/simple-bayesian-inference-for-qualitative-political-research/CB75ACD52AF02FE6E83EC79B6BD39986)
  - [Bayes and Qual Approaches II](https://www.cambridge.org/core/journals/american-political-science-review/article/abs/mixing-methods-a-bayesian-approach/BB1DFC2FDA3D7F2224F3341042FEA5F4)

---

## Quick MLE Review (You Already Know This!)

**Maximum Likelihood Estimation**:

- Find parameters `$\theta$` that maximize `$P(\text{data} | \theta)$`
- Point estimate: `$\hat{\theta}_{MLE} = \argmax_\theta L(\theta | \text{data})$`
- Uncertainty via standard errors from Fisher Information

---

## The Bayesian Alternative

**Different Philosophy**: Treat parameters as probability distributions, not fixed values

`$$P(\theta | \text{data}) = \frac{P(\text{data} | \theta) \times P(\theta)}{P(\text{data})}$$`

**Key Advantages for Modern Political Science**:

1. **Uncertainty quantification**: Full probability distributions for all parameters
2. **Complex models**: Natural framework for hierarchical/multilevel structures
3. **Small samples**: More stable estimates when data is limited
4. **Direct probability statements**: Can say "75% chance the effect is positive"

---

## Why Bayesian Methods Matter for Your Research

**Two Main Advantages**:

**1. Can incorporate prior information (when you have it)**:
- Historical election patterns, expert surveys, theoretical bounds
- But honestly? Most of us use uninformative priors anyway!
- Weakly informative priors mainly for regularization

**2. Better framework for complex models (the real advantage)**:
- **Multilevel models**: Automatic partial pooling for nested data
- **Missing data**: Principled uncertainty propagation
- **High dimensions**: When MLE won't converge
- **Latent variables**: Ideal points, democracy scores
- **Small samples**: Regularization prevents overfitting

---

## Bayes Rule: The Foundation

From conditional probability:
`$$P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A) \times P(A)}{P(B)}$$`

Applied to inference:
`$$\underbrace{P(\theta | \text{data})}_{\text{posterior}} = \frac{\overbrace{P(\text{data} | \theta)}^{\text{likelihood}} \times \overbrace{P(\theta)}^{\text{prior}}}{\underbrace{P(\text{data})}_{\text{evidence}}}$$`

**In words**: Updated beliefs = (How well data fits × Prior beliefs) / Normalizing constant

---

## Why Do We Multiply Likelihood and Prior?

**The Deep Question**: Why multiply `$P(\text{data}|\theta) \times P(\theta)$`?

**Answer**: It's the *calculus* of belief updating!

- **Prior** `$P(\theta)$`: How plausible is each parameter value *before* seeing data?
- **Likelihood** `$P(\text{data}|\theta)$`: How plausible is the data *if* this parameter is true?
- **Multiplication**: Combines these two sources of information

**Intuitive Example**:

- Prior: "Incumbent usually wins by 5-10%" → Most weight on 5-10% values
- Data: "Poll shows 7% lead" → Data most consistent with ~7%
- Posterior: Combines both → Centers around 7% but pulled toward prior range

**Key**: We're not multiplying probabilities! We're multiplying **probability densities** to weight different parameter values.

---

## [Aside] Clarifying Probability vs Density

**Quick clarification before we continue**: When we multiply in Bayes' theorem in this context, we're NOT multiplying probabilities!

**Think of it this way:**

**Discrete Events (Probabilities)**:
- Rolling a die: P(getting 3) = 1/6 ≈ 0.17
- Must be ≤ 1 because only 1 of 6 outcomes

**Continuous Values (Densities)**:
- Height of adults: Most people ~5'9" (very peaked distribution)
- Density at 5'9" might = 5.2 (NOT a probability!)
- It's the "height" of the curve, not an area

---

## [Aside] A Simple Height Example

**Why can density > 1? Because it's measuring concentration, not probability!**

Imagine measuring people's heights to nearest inch vs nearest 0.01 inch:

**Nearest inch** (wider bins):
- 5'9" bucket: Contains 20% of people → Bar height = 0.20

**Nearest 0.01 inch** (narrower bins):
- 5'9.00" bucket: Contains 2% of people
- But the bin is 100x narrower!
- So density (height per unit width) = 2.0

**Same total probability (area), different density (height)!**

The narrower your measurement precision, the higher the density can get at the peak.

---

## [Aside] What This Means for Bayes

**In our Bayesian multiplication:**

`$$\text{Posterior density} \propto \text{Likelihood} \times \text{Prior density}$$`

**We're multiplying "plausibility scores", not probabilities:**

- **Prior density = 3.0 at θ = 0.5**: "This parameter value is quite plausible a priori"
- **Likelihood = 2.0 at θ = 0.5**: "Data fits well if parameter = 0.5"
- **Product = 6.0**: "Combined plausibility score" (will be normalized later)

**Think of it like confidence from two experts:**
- Expert 1 (Prior): "I'm 3x more confident in θ = 0.5 than θ = 0.2"
- Expert 2 (Data): "I'm 2x more confident in θ = 0.5 than θ = 0.2"
- Combined: Their joint confidence is 6x stronger for θ = 0.5
- Later: Scale everything so total confidence = 100%

**The key insight**: We're combining evidence, not multiplying probabilities!

---

## Understanding "Proportional To"

`$$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$$`

**What does** `$\propto$` **mean?**

The true formula:
`$$P(\theta|\text{data}) = \frac{P(\text{data}|\theta) \times P(\theta)}{P(\text{data})}$$`

But `$P(\text{data})$` is just a number that makes the posterior integrate to 1!

**Restaurant Analogy**:

- You have preferences for restaurants (prior)
- Your friend has preferences (likelihood)
- Multiply preferences → Combined preferences (shape matters!)
- Divide by total → Just rescales to sum to 100% (shape unchanged!)

**For MCMC**: We only need relative probabilities, not absolute ones. If Parameter A is twice as likely as Parameter B, that relationship stays the same whether we divide by `$P(\text{data})$` or not!

---

## The Posterior as a Weighted Compromise

**The posterior is ALWAYS between the prior and likelihood!**

``` r
# Conceptual illustration (not real code)
Posterior = w1 * Prior + w2 * Likelihood
# where weights depend on relative "strength" (precision)
```

**Who wins the tug-of-war?**

- **Lots of data** → Likelihood dominates (data speaks loudly)
- **Little data** → Prior dominates (data whispers)
- **Precise prior** → Prior pulls harder
- **Vague prior** → Prior barely pulls

**Why does more data overwhelm the prior?**

Likelihoods multiply across observations:

- **n=10**: Posterior ∝ Prior × (Likelihood₁ × Likelihood₂ × ... × Likelihood₁₀)
- **n=1000**: Posterior ∝ Prior × (1000 likelihoods multiplied together!)

---

## What is MCMC? The Big Picture

**The Problem**: We need the posterior `$P(\theta|\text{data})$` but can't calculate it directly!

**Why can't we calculate it?**

- Need `$P(\text{data}) = \int P(\text{data}|\theta) P(\theta) d\theta$`
- This integral is often impossible to solve (especially in high dimensions)
- Example: 50 parameters → 50-dimensional integral! 🤯

**The MCMC Solution**: Don't calculate, **simulate!**

---

## What is MCMC? The Big Picture

**Markov Chain Monte Carlo (MCMC)**:

- **Markov Chain**: A sequence where each step depends only on the current position
- **Monte Carlo**: Random sampling to approximate something
- **Together**: Generate samples from the posterior by "wandering" through parameter space

**The Magic**:

- Instead of finding the whole posterior distribution...
- Generate thousands of samples FROM that distribution
- Use samples to approximate any quantity we want!

**Think of it like**: Exploring a dark mountain (posterior) with a flashlight, keeping track of where you spend the most time (high probability regions)

---

## How MCMC Uses Likelihood × Prior

**The Key Connection**: MCMC needs to evaluate `$P(\theta|\text{data})$` at different `$\theta$` values

**What MCMC actually does**:

1. Proposes a parameter value `$\theta_{\text{new}}$`
2. Calculates: `prior(θ_new) × likelihood(data|θ_new)`
3. Compares to current value: `prior(θ_current) × likelihood(data|θ_current)`
4. Decides whether to move based on the ratio

**We never need** `$P(\text{data})$` **because**:
`$$\frac{P(\theta_{\text{new}}|\text{data})}{P(\theta_{\text{current}}|\text{data})} = \frac{\text{Prior}_{\text{new}} \times \text{Like}_{\text{new}}}{\text{Prior}_{\text{current}} \times \text{Like}_{\text{current}}}$$`

The denominators cancel! MCMC cleverly avoids the hard integral.

---

## Working on the Log Scale (What Actually Happens)

**Practical Reality**: We work with log probabilities to avoid numerical problems

`$$\log P(\theta|\text{data}) = \log P(\text{data}|\theta) + \log P(\theta) - \log P(\text{data})$$`

**Why logs?**

- Probabilities get tiny fast: `$0.01^{100} = 10^{-200}$` (computer says 0!)
- Log probabilities stay manageable: `$100 \times \log(0.01) = -460.5$` ✓

**In Stan/brms**:

```stan
target += normal_lpdf(y | mu, sigma);  // log-likelihood
target += normal_lpdf(mu | 0, 10);     // log-prior
// Adding logs = multiplying probabilities!
```

**You don't write this**, but understanding it helps interpret errors!

---

## Frequentist vs Bayesian

| Aspect | Frequentist | Bayesian |
|--------|------------|----------|
| **What we care about** | Long-run properties | Actual probabilities |
| **Parameters** | Fixed but unknown | Uncertain quantities |
| **Can we say "75% chance the policy will pass"?** | No! | Yes! |
| **Use prior research?** | No formal mechanism | Yes, via priors |
| **Small samples** | Often unreliable | Can borrow strength |
| **Hierarchical data** | Complicated | Natural framework |

**The key difference for your research**: Bayesian lets you make probability statements about your hypotheses!

---

## When to Use Which Approach?

**Common Research Designs**:

| Design | Frequentist | Bayesian | Why? |
|--------|------------|----------|------|
| RCT/Experiment | ✓ | | Random assignment handles confounding |
| Observational with many units | ✓ | | Large N, weak priors won't help |
| Rare events | | ✓ | Priors stabilize estimates |
| Multilevel/Panel | | ✓ | Natural hierarchy, partial pooling |
| Small-n comparative | | ✓ | Borrow strength across cases |

---

## The Bayesian Framework

**Core Components**:

`$$\underbrace{f(\theta | \mathbf{Y})}_{\text{posterior}} = \frac{\overbrace{f(\mathbf{Y} | \theta)}^{\text{likelihood}} \times \overbrace{f(\theta)}^{\text{prior}}}{\underbrace{f(\mathbf{Y})}_{\text{evidence}}}$$`

- **Prior** `$f(\theta)$`: Initial distribution (often uninformative!)
- **Likelihood** `$f(\mathbf{Y}|\theta)$`: Same as in MLE!
- **Posterior** `$f(\theta|\mathbf{Y})$`: Full distribution of parameters given data
- **Evidence** `$f(\mathbf{Y})$`: Normalizing constant (MCMC avoids computing this!)

**Key shift**: Parameters are **probability distributions**, not point estimates!

---

## Bayesian vs MLE for Regression

**Same Goal**: Find parameters that explain the data

| Aspect | MLE | Bayesian |
|--------|-----|----------|
| **Output** | Point estimate + SE | Full posterior distribution |
| **Optimization** | Find maximum of likelihood | Sample from posterior |
| **Uncertainty** | Asymptotic approximation | Exact (given model) |

---

## What Makes Bayesian Different?

**Philosophical Shift**:

- **Frequentist**: "If we repeated this election 1000 times..."
  - But we don't repeat elections!
  - Parameters are fixed, data varies

- **Bayesian**: "Given what we observed, we believe..."
  - Matches how we actually think
  - Parameters are uncertain, data is fixed (once observed)

**Practical Implications**:

- Can answer: "What's the probability the incumbent wins?"
- Can incorporate: Previous elections, polls, theory
- Can handle: Small samples, rare events, missing data

---

## Simple Example: Forecasting an Election

**Scenario**: Predicting if incumbent wins (probability `$\pi$`)

**Frequentist approach**:
- Poll 500 voters, 270 support incumbent
- `$\hat{\pi} = 270/500 = 0.54$`
- 95% CI: [0.50, 0.58]
- **Interpretation**: "If we repeated this poll many times..."
- **Cannot say**: "82% chance incumbent wins" ❌

**Bayesian approach (even with uninformative prior!)**:
- **Prior**: Uniform (no prior preference)
- **Data**: Same poll (270/500)
- **Posterior**: Beta distribution centered at 0.54
- **Can say**: "There's an 82% probability the incumbent wins" ✓

**Key difference**: Direct probability statements about the parameter, not about repeated sampling!

---

## Simple Application: County Vote Share Estimation

Let's work through a complete Bayesian analysis with real data.

**The Problem**: Estimating Democratic vote share with limited data
- Some counties have very few votes
- How uncertain should we be?
- Simple Beta-Binomial model before regression

**Our Data**: 2018 U.S. House elections
- Focus on Allegan County, MI District 2
- Only 945 total votes (very small sample!)
- 388 Democratic votes

---

## Setting Up the Model

**Beta-Binomial Model** for vote share:

`$$Y_i \sim \text{Binomial}(n_i, \pi_i)$$`
`$$\pi_i \sim \text{Beta}(\alpha_0, \beta_0)$$`

Where:
- `$Y_i$` = number of Democratic votes
- `$n_i$` = total votes
- `$\pi_i$` = unknown true Democratic share
- `$\alpha_0, \beta_0$` = prior parameters

**Why this model?**
- Natural for proportions (bounded 0-1)
- Beta is conjugate to Binomial
- We get a closed-form posterior!

---

## The Magic of Conjugacy

**Conjugacy** means prior and posterior are same family!

Starting with:
- **Likelihood**: `$Y_i \sim \text{Binomial}(n_i, \pi_i)$`
- **Prior**: `$\pi_i \sim \text{Beta}(\alpha_0, \beta_0)$`

Apply Bayes' rule:
`$$f(\pi_i | Y_i) \propto f(Y_i | \pi_i) \times f(\pi_i)$$`

`$$\propto \pi_i^{Y_i}(1-\pi_i)^{n_i-Y_i} \times \pi_i^{\alpha_0-1}(1-\pi_i)^{\beta_0-1}$$`

Combining exponents:
`$$\propto \pi_i^{Y_i + \alpha_0 - 1}(1-\pi_i)^{n_i - Y_i + \beta_0 - 1}$$`

This is a Beta distribution!

**Posterior**: `$\pi_i | Y_i \sim \text{Beta}(\alpha_0 + Y_i, \beta_0 + n_i - Y_i)$`

---

## Interpreting the Conjugate Update

The posterior Beta(`$\alpha_0 + Y_i$`, `$\beta_0 + n_i - Y_i$`) shows:

**Prior "pseudo-counts" + Observed counts**:
- Prior Democratic: `$\alpha_0$` → Post: `$\alpha_0 + Y_i$`
- Prior Republican: `$\beta_0$` → Post: `$\beta_0 + (n_i - Y_i)$`

**Posterior mean**:
`$$E[\pi_i | Y_i] = \frac{\alpha_0 + Y_i}{\alpha_0 + \beta_0 + n_i}$$`

This is a weighted average:
- Weight on prior: `$\frac{\alpha_0 + \beta_0}{\alpha_0 + \beta_0 + n_i}$`
- Weight on data: `$\frac{n_i}{\alpha_0 + \beta_0 + n_i}$`

As `$n_i$` grows, data dominates!

---

## Choosing a Prior for Vote Share

For proportions, we use the Beta distribution:

`$$\pi_i \sim \text{Beta}(\alpha_0, \beta_0)$$`

**Intuition**: Think of `$\alpha_0$` and `$\beta_0$` as "pseudo-counts"
- `$\alpha_0$` = prior Democratic votes
- `$\beta_0$` = prior Republican votes
- Prior mean = `$\frac{\alpha_0}{\alpha_0 + \beta_0}$`

**Three common choices**:

1. **Uninformative**: Beta(1, 1) = Uniform
   - "I know nothing"

2. **Weakly informative**: Beta(2, 2)
   - "Probably somewhere in the middle"

3. **Informative**: Beta(30, 70)
   - "Based on past elections, ~30% Democratic"

---

## Working with Real Data

``` r
elections <- read_csv("us-house-wide.csv")

# Look at Allegan County, MI
allegan <- elections %>% filter(county == "Allegan")
print(allegan)
```

```
## # A tibble: 2 × 10
##   state county  fipscode fipscode2 office district total.votes   dem   rep other
##   <chr> <chr>   <chr>    <chr>     <chr>     <dbl>       <dbl> <dbl> <dbl> <dbl>
## 1 MI    Allegan 26005    26005000… US Ho…        2         945   426   506    13
## 2 MI    Allegan 26005    26005000… US Ho…        6       48295 17654 28257  2384
```

Key observation: MI District 2 has only 945 votes from Allegan County!

---

## Bayesian Analysis with Uninformative Prior

Let's apply our conjugate formula to Allegan County (MI-2):

**Data**: 388 Democratic votes out of 945 total

**Prior**: Beta(1, 1) [uninformative]

**Apply conjugate update**:

- Posterior = Beta(1 + 388, 1 + 557) = Beta(389, 558)

---

## Bayesian Analysis with Uninformative Prior

``` r
mi2_allegan <- elections %>% filter(state=="MI" & county == "Allegan" & district == 2)

# Apply conjugate formula: Beta(α₀ + Y, β₀ + n - Y)
posterior_alpha <- 1 + mi2_allegan$dem  # 1 + 388 = 389
posterior_beta <- 1 + (mi2_allegan$total.votes - mi2_allegan$dem)  # 1 + 557 = 558

# Posterior mean = α/(α + β)
posterior_mean <- posterior_alpha/(posterior_alpha + posterior_beta)
print(paste("Posterior mean:", round(posterior_mean, 3)))
```

```
## [1] "Posterior mean: 0.451"
```

``` r
# 95% Credible interval
posterior_ci <- qbeta(c(.025, .975), posterior_alpha, posterior_beta)
print(paste("95% CI:", round(posterior_ci[1], 3), "to", round(posterior_ci[2], 3)))
```

```
## [1] "95% CI: 0.419 to 0.483"
```

---

## Visualizing the Posterior

``` r
# Plot posterior distribution
ggplot() + xlim(.3, .5) +
  geom_function(fun=dbeta, args=list(shape1=posterior_alpha, shape2=posterior_beta)) +
  geom_vline(xintercept=posterior_mean, linetype="dashed", color="red") +
  theme_bw() + xlab("Democratic Vote Share") + ylab("Density") +
  ggtitle("Posterior Distribution: Uninformative Prior")
```

With uninformative prior, posterior mean ≈ MLE = 388/945 = 41%

---

## Importance of Prior Distributions

- Priors reflect **external knowledge** or beliefs.

- Can **stabilize estimates** in small samples.

- Can also **bias results** if chosen inappropriately.

---

## Visual Demonstration: Prior Influence vs Sample Size

**How Sample Size Changes the Prior's Influence**:

---

## Choosing a prior

- There are two ways to think about choosing a prior distribution
  - **Informative** prior - Use the prior to encode our existing beliefs about the parameter
  - **Uninformative/Diffuse** prior - Pick a prior that will have the least impact on the posterior distribution

- We also need to consider the *shape* of the prior distribution `$f(\pi_i)$`

- Why did we pick the Beta distribution?
  - Because it has a special property when combined with a binomial likelihood. It is a **conjugate** prior to the binomial likelihood.

- **Conjugate prior**: A prior distribution is *conjugate* to a particular likelihood if the posterior distribution is of the same form as the prior
  - `$f(\theta)$` is beta; `$f(\mathbf{Y} | \theta)$` is binomial; `$f(\theta | \mathbf{Y})$` is beta
  - Lots of other examples [here](https://en.wikipedia.org/wiki/Conjugate_prior).

---

## Priors in Practice: Usually Uninformative or Weakly Informative

Most political science research uses:

- **Uninformative priors**: Let the data speak (uniform, very wide normals)
- **Default software priors**: Whatever Stan/brms provides
- **Weakly informative priors**: Mainly for regularization, not encoding beliefs

---

## Priors in Practice: Usually Uninformative or Weakly Informative

**When We Actually Use Informative Priors**:

1. **Regularization** (most common):
   - Prevent coefficients from exploding
   - Handle separation in logistic regression
   - Shrink small effects toward zero

2. **Computational stability**:
   - Help MCMC converge
   - Constrain parameter space to reasonable values

3. **Genuine prior information** (rare but useful):
   - Election forecasting with polls + fundamentals
   - Updating from previous studies
   - Expert-informed Bayesian models

**Bottom line**: Don't stress about "encoding beliefs" - focus on computational benefits!

---

## Bayesian updating

- Now that we have defined our model: variables, likelihood, and prior; we can feed it with our data to obtain the posterior distribution.
- The process of going from the prior to the posterior is called **Bayesian updating**.

**The Update Process:**

1. **Start**: Prior belief `$P(\theta)$` - what we think before seeing data
2. **Observe**: Collect data `$y$`
3. **Update**: Combine prior with likelihood via Bayes rule
4. **Result**: Posterior `$P(\theta|y)$` - what we believe after seeing data

---

## Sampling to summarize

- In the theoretical world (when the posterior has a closed mathematical form), answering questions about point estimates and uncertainty intervals implies calculating yucky integrals BUT practically all we really have to do is basic summary statistics on samples from the posterior 
- Once we have a posterior distribution, we typically will report **summaries** in the style of our typical frequentist point and interval estimates.
  - Note, however, that these have a different interpretation in the Bayesian framework.

- **Point** summaries
  - *Posterior Mean*: `$\hat{\theta} = E[\theta | \mathbf{Y}]$`
  - *Posterior Mode*: `$\hat{\theta} = \argmax_{\theta} p(\theta | \mathbf{Y})$`

- **Credible interval**: A `$95\%$` credible interval `$(l_{95}, h_{95})$` is a range of values that contains `$95\%$` of the posterior density

`$$\int_{l_{95}}^{h_{95}} f(\theta | \mathbf{Y})d\theta = .95$$`

---

## Point estimates in a Bayesian setting

- The idea of point estimation in Bayesian settings is to summarize the posterior with a single value

- The three most common options are: 
  - mode: which is the value with the highest posterior probability, also known as the maximum a posteriori (MAP) estimate
  - mean: just take the average of the samples from the posterior
  - median: just take the median of the samples from the posterior
  
- For many of the parameters that you will be typically estimating via Bayes each of these three options will yield very similar answers
  
---

## Credible vs confidence intervals

- Credible intervals resemble very much the confidence intervals we saw in OLS and MLE

- The interpretations are very different though:

**Confidence Interval (Frequentist):**

- "If we repeated this study 100 times, 95 of those intervals would contain the true parameter"
- The parameter is fixed, the interval is random
- **CANNOT say**: "There's a 95% probability the parameter is in this interval"

---

## Credible vs confidence intervals

**Credible Interval (Bayesian):**

- "Given the data, there's a 95% probability the parameter is in this interval"
- The parameter is random, the interval is fixed (given the data)
- **CAN say**: "We're 95% sure the effect is between 0.2 and 0.4"

**Example**: Election forecast

- Confidence: "Our method captures the true vote share 95% of the time"
- Credible: "We believe there's a 95% chance the candidate gets 48-52% of votes"

---

## Posterior Predictive Distribution: Predictions with Honest Uncertainty

**The Key Insight**: We don't just predict using our "best guess" - we predict using ALL plausible parameter values!

### What It Means Conceptually

Think of it like weather forecasting:
- **Frequentist**: "Based on our model, tomorrow will be 72°F"
- **Bayesian**: "Considering all uncertainties, tomorrow will be:
  - 70-74°F with 50% probability
  - 68-76°F with 80% probability
  - 65-79°F with 95% probability"

---

## The Mathematics of Posterior Prediction

`$$f(\tilde{Y}|\mathbf{Y}) = \int f(\tilde{Y}| \theta) f(\theta | \mathbf{Y}) d\theta$$`

**Breaking it down**:

- `$f(\theta | \mathbf{Y})$`: Our posterior (all plausible parameters)
- `$f(\tilde{Y}| \theta)$`: Prediction given each parameter
- Integration: Average predictions weighted by posterior probability

**How it works in practice**:

1. Draw parameter sample from posterior: θ₁ ~ posterior
2. Generate prediction using that parameter: Ỹ₁ ~ f(Y|θ₁)
3. Repeat many times: Get distribution of predictions
4. **Result**: Full uncertainty in predictions!

---

## Quantities of interest

- There is no one unique credible interval! As a result, there are a few common choices for how to construct a credible interval

- **Highest Density** interval (HDI) - no values outside of the interval have higher density than values *inside* the interval

- Use when: Posterior is skewed or multimodal
    - Interpretation: "The most credible parameter values"

- **Equal-tailed** interval (ETI) - the probability of being below the lower limit is equal to the probability above the upper limit

- Use when: Posterior is roughly symmetric (most common)
    - Interpretation: "95% probability the parameter lies in this range"

**For your papers**: ETI is standard in political science. Report as: "The 95% credible interval is [0.2, 0.4], indicating a 95% probability that the true effect lies in this range."

---

## Beyond Single Counties: The Power of Pooling

We've analyzed Allegan County with an uninformative prior. But we have information from other counties in MI-2!

``` r
mi2 <- elections %>% filter(state=="MI" & district == 2)
mi2
```

```
## # A tibble: 8 × 10
##   state county  fipscode fipscode2 office district total.votes   dem   rep other
##   <chr> <chr>   <chr>    <chr>     <chr>     <dbl>       <dbl> <dbl> <dbl> <dbl>
## 1 MI    Allegan 26005    26005000… US Ho…        2         945   426   506    13
## 2 MI    Kent    26081    26081000… US Ho…        2       65089 31822 32182  1085
## 3 MI    Lake    26085    26085000… US Ho…        2        4620  1767  2730   123
## 4 MI    Mason   26105    26105000… US Ho…        2        9880  4143  5514   223
## 5 MI    Muskeg… 26121    26121000… US Ho…        2       67627 35685 30603  1339
## 6 MI    Newaygo 26123    26123000… US Ho…        2       20126  6798 12763   565
## 7 MI    Oceana  26127    26127000… US Ho…        2       10651  4205  6218   228
## 8 MI    Ottawa  26139    26139000… US Ho…        2      126525 46408 78454  1663
```

---

## When is Partial Pooling Crucial?

**This "borrowing strength" approach is particularly valuable when:**

- **Small samples within groups**
  - Few voters in a county-district combination
  - New parties with limited electoral history
  - Rare events in specific contexts

- **Hierarchical structure in data**
  - Voters in counties, counties in states
  - Legislators in parties, parties in parliaments
  - Countries in regions, regions in the world

- **Regularization needed**
  - Many groups with varying amounts of data
  - Want to avoid extreme estimates from small samples
  - Trade variance for (acceptable) bias

**What we're doing:** Use information from similar units to stabilize estimates for units with little data. This is the core insight of **multilevel/hierarchical modeling**!

---

## Using Other Counties as Prior Information

What if we constructed our prior using the average from all other counties in the district?
- Center prior on empirical evidence from similar units
- Control prior strength via the variance

---

## Implementing an Informative Prior

``` r
# Calculate the mean vote share from all other counties in the district
prior_mean <- sum(mi2 %>% filter(county != "Allegan") %>% 
  pull(dem))/sum(mi2 %>% filter(county != "Allegan") %>% 
  pull(total.votes))

# Set prior variance to be relatively wide but informative
# We use 1000 as an "effective sample size" - like having seen 1000 prior votes
# This makes the prior influential but not overwhelming (you could also use 100 for weaker prior, or 10000 for stronger)
prior_variance <- (prior_mean*(1-prior_mean))/1000

# Convert desired mean and variance to Beta distribution parameters
# The math here: solve for alpha and beta given mean and variance of Beta distribution
prior_alpha <- (
  ((1 - prior_mean)/prior_variance) - (1/prior_mean)
  )*prior_mean^2
prior_beta <- prior_alpha*(1/prior_mean - 1)
```

---

## Comparing Uninformative vs Informative Priors

Now let's apply both priors and visualize the difference:

``` r
attach(mi2_allegan)
posterior2_alpha <- dem + prior_alpha
posterior2_beta <- total.votes - dem + prior_beta
detach(mi2_allegan)
```

---

## Comparing Uninformative vs Informative Priors

``` r
# Plot the posterior
mi2_posterior2 <- ggplot() + 
  xlim(.3, .5)  +  
  geom_function(
    fun=dbeta, 
    args=list(
      shape1=posterior_alpha, 
      shape2=posterior_beta), 
      lty=2, lwd=1.5, col="dodgerblue") + 
  theme_bw() + 
  xlab(expression(pi)) + 
  ylab("Density") + 
  geom_vline(xintercept=prior_mean, lty=2, lwd=1.5, col="indianred") +
  geom_function(
    fun=dbeta, 
    args=list(shape1=posterior2_alpha, 
    shape2=posterior2_beta),lwd=1.5) 
```

---

## Visualizing Prior Influence

Blue dashed = uninformative prior, Black solid = informative prior, Red dashed = prior mean:

``` r
mi2_posterior2
```

---

## The Effect of Informative Priors

Compare the posterior estimates:

``` r
posterior2_mean <- posterior2_alpha/(posterior2_alpha + posterior2_beta)
posterior2_mean
```

```
## [1] 0.44
```

- And 95% credible interval

``` r
posterior2_ci <- c(qbeta(.025, shape1=posterior2_alpha, shape2=posterior2_beta), qbeta(.975, shape1=posterior2_alpha, shape2=posterior2_beta))
posterior2_ci
```

```
## [1] 0.418 0.462
```

---

## Application: The Weighted Average Interpretation

**Mathematical Detail**: The posterior mean is exactly a weighted average!

`$$\text{Posterior Mean} = \frac{\alpha_0 + Y_i}{\alpha_0 + \beta_0 + n_i}$$`

This can be rewritten as:

`$$= \underbrace{\frac{\alpha_0 + \beta_0}{\alpha_0 + \beta_0 + n_i}}_{\text{prior weight}} \cdot \underbrace{\frac{\alpha_0}{\alpha_0 + \beta_0}}_{\text{prior mean}} + \underbrace{\frac{n_i}{\alpha_0 + \beta_0 + n_i}}_{\text{data weight}} \cdot \underbrace{\frac{Y_i}{n_i}}_{\text{MLE}}$$`

**Key insights for your research**:

- **Prior weight** = `$\frac{\text{prior sample size}}{\text{prior + actual sample size}}$`
- **Data weight** = `$\frac{\text{actual sample size}}{\text{prior + actual sample size}}$`
- With small data (`$n_i$` small): Prior dominates
- With large data (`$n_i$` large): Data dominates
- The prior "shrinks" extreme estimates toward the prior mean

---

## Application: The Weighted Average Interpretation

- Stronger priors `$\leadsto$` narrower credible intervals
  - But it takes a *lot* more data to move the posterior distribution from the prior.

- Often the prior serves to **regularize** our estimates
  - We want our estimates to be "pulled" towards a particular value if there's very little data.
  - A common type of "regularizing" prior is designed to attenuate our estimates to `$0$` - we'll talk about this when we get to the last topic of the course!

---

## Looking Ahead: Multilevel Models

**What we just did** sets up an important concept for next week:

- We used other counties in the district to inform our estimate for Allegan County
- This is **partial pooling** - Allegan's estimate gets "pulled" toward the district average
- The amount of pulling depends on how much data Allegan has

**Next week's multilevel models** formalize this idea:

- Instead of manually choosing a prior based on other units, the model **learns** how much pooling to do
- The model simultaneously estimates:
  1. Parameters for each group (county, state, etc.)
  2. **Hyperparameters** that control how much groups can differ
- This allows **adaptive regularization** - units with little data borrow more strength

**Example**: Counties with many voters stay close to their data; counties with few voters get pulled toward the group mean. The model figures out the optimal balance automatically!

---

## Connections with Frequentism/MLE

- You'll notice that for the uninformative uniform prior, our posterior mode is equivalent to the MLE
  - If `$f(\theta) \propto 1$`, `$f(\theta|\mathbf{Y}) = f(\mathbf{Y}|\theta)$`

- More generally, for most well-behaved priors and likelihoods, the **Bernstein-von Mises** theorem states that as `$n \to \infty$`...
  - ...the posterior distribution `$f(\theta | \mathbf{Y})$` will converge to a **normal** distribution
  - ...centered at the true parameter `$\theta_0$`
  - ...with variance-covariance matrix equal to the inverse Fisher information

- Essentially: In large samples, posterior distributions converge to the sampling distribution of the MLEs

---

## The Limits of Conjugacy

**Reality Check**: Our Beta-Binomial example was unusually nice!

**We had conjugacy**, so posterior had closed form:

- Beta prior + Binomial likelihood → Beta posterior ✓
- No integrals needed!

**But most real models lack conjugacy**:

- Logistic regression: No conjugate prior exists
- Hierarchical models: Multiple levels of parameters
- Non-standard likelihoods: Truncated, mixture models
- Multiple parameters: Joint posteriors get complex

**The problem**: We need to evaluate:

`$$P(\theta|data) = \frac{P(data|\theta) \times P(\theta)}{\int P(data|\theta) \times P(\theta) d\theta}$$`

That integral in the denominator? Often **impossible** to solve!

---

# Markov-Chain Monte Carlo

### When the math gets hard, we dump the math, kinda, and simulate!

---

## MCMC: The Universal Solution

**The brilliant insight**: We don't need to solve the integral - we just need samples from the posterior!

`$$f(\theta | \mathbf{Y}) = \frac{\overbrace{f(\mathbf{Y}|\theta)}^{\text{We know this!}} \times \overbrace{f(\theta)}^{\text{We chose this!}}}{\underbrace{f(\mathbf{Y})}_{\text{Can't compute - but don't need to!}}}$$`

**What MCMC does**:

- Generates samples from the posterior WITHOUT computing the denominator
- Uses clever algorithms to explore parameter space
- Spends more time in high-probability regions

**The magic**: We only need to evaluate likelihood × prior (the numerator), not the full fraction!

---

## Markov-Chain Monte Carlo (MCMC)

- But `$f(\mathbf{Y})$` is a tough to evaluate integral!

`$$f(\mathbf{Y}) = \int f(\mathbf{Y},\theta)d\theta$$`

- We've used the trick of having a **conjugate prior** which lets us know the distribution of the posterior, in the application above we knew that the posterior followed a Beta
  
---

## What Does MCMC Give Us?

**MCMC produces a list of parameter values** (e.g., 10,000 values of `$\beta_1$`):

``` r
# Example MCMC output for a single parameter
mcmc_samples <- c(0.72, 0.75, 0.74, 0.76, 0.73, ...) # 10,000 values
```

**These samples approximate the posterior distribution**:

- Histogram of samples → Shape of posterior
- Mean of samples → Posterior mean
- SD of samples → Posterior uncertainty
- Quantiles of samples → Credible intervals

**Key insight**: We're trading an integral we can't solve for samples we can summarize!

---

## Basic MCMC Intuition (Before HMC)

**Simplest MCMC**: Random walk through parameter space

1. **Current position**: You're at parameter value `$\theta$`
2. **Propose new value**: Move to nearby `$\theta'$`
3. **Decide**: Accept or reject based on posterior probability
   - Higher probability → Always accept
   - Lower probability → Sometimes accept (allows exploration)
4. **Repeat**: Thousands of times

**Problem**: This is slow! Like exploring a dark room by taking tiny random steps

**Solution (HMC)**: Use gradient information - like having a flashlight showing which way is uphill!

---

## MCMC Benefits/Costs

**Benefits:**

- No maximization needed - just sampling!
- Works with **any** model structure (hierarchical, networks, text, etc.)
- Handles complex dependencies naturally
- Uncertainty quantification is automatic
- Can incorporate constraints easily

**Costs:**

- Computational time (hours vs. seconds)
- Convergence diagnostics needed
- Storage requirements (keeping thousands of samples)
- Requires careful tuning sometimes

---

## MCMC High level

- The essence of MCMC is to produce samples from the posterior using only the product of the likelihood and prior, which are always available to us since we're specifying them

- So by just evaluating the likelihood `$\times$` prior and without normalizing the denominator, MCMC is supposed to allow us to generate random representative values from the posterior distribution

- This is awesome because computing that evidence term (denominator of Bayes formula) is at times just not possible

**Why this matters for your research**:

- With 50 parameters (common in multilevel models), the integral has 50 dimensions!
- Analytical solutions exist only for simple models (like our Beta-Binomial example)
- MCMC lets us fit realistic models that match complex political processes

---

## MCMC ... a bit more formally

- MCMC methods allow us to generate a sample of observations from the target posterior **even when we don't know it** as long as we can evaluate a function that is **proportional** to the posterior
  - **Monte Carlo**: Use repeated samples to obtain numeric approximations of key quantities
  - **Markov-Chain**: The samples from the posterior are constructed via a "chain" that has the Markov property

- **Our Goal**: Generate a monte carlo sample `$\{\theta^{(1)}, \theta^{(2)}, \theta^{(3)}, \dotsc \}$` of arbitrary length such that the sequence of draws **converges** to a **stationary distribution** that is the target posterior

- This sample is a **markov chain** in that the distribution of the `$(i+1)$`th draw depends on the value of the `$i$`th, but only on that past value

`$$f(\theta^{(i+1)} | \mathbf{Y},  \theta^{(i)}, \theta^{(i-1)}, \dotsc, \theta^{(1)}) =  f(\theta^{(i+1)} | \mathbf{Y},  \theta^{(i)})$$`

- Important note: our samples from the target posterior will be **dependent**

---

## What Does "Convergence" Mean?

**Convergence = The chain has found and is exploring the posterior**

**Visual signs of convergence**:

- Trace plot looks like a "fuzzy caterpillar" (not trending)
- Multiple chains overlap completely
- Samples look stationary (not drifting)

**Formal diagnostics** (we'll see later):

- R̂ < 1.01 (chains agree with each other)
- ESS > 400 (enough effective samples)
- No divergent transitions (geometry is OK)

**Not converged?** Your results are wrong! Must fix before interpreting.

---

## MCMC Warmup: Finding the Posterior

**The chain needs time to find the posterior region**:

- **Warmup/Burn-in**: Initial samples we throw away
- Chain starts at arbitrary values (often terrible!)
- Takes time to "find" the high-probability region
- Once there, starts producing representative samples

**Analogy**: Like a GPS recalculating - first few suggestions are wrong, but it eventually finds the right path

**In practice**:

- brms default: 1000 warmup + 1000 sampling iterations per chain
- We only keep the post-warmup samples for inference

---

## From MCMC Theory to Practice

**What we've learned**: MCMC generates samples from posterior without solving integral

**Traditional MCMC methods** (Metropolis-Hastings, Gibbs):
- Work well but can be slow
- Many rejected proposals
- Takes many iterations to explore posterior

**Modern solution**: Hamiltonian Monte Carlo (HMC)
- Uses gradient information for smart proposals
- Nearly 100% acceptance rate
- Much more efficient exploration

**For you**: brms/Stan handles all the details - you just specify the model!

---

## MCMC in Practice: Hamiltonian Monte Carlo

- While Metropolis-Hastings and Gibbs sampling are foundational MCMC methods (see handout), a lot of problems can be dealt with via **Hamiltonian Monte Carlo (HMC)**
  - Much more efficient than traditional MCMC
  - Implemented in **Stan**, which has interfaces in R, Python, Julia, etc.
  - We'll use **brms**, an R package that makes Stan easy to use

- **Why HMC is better:**
  - Explores the posterior much more efficiently
  - Requires fewer samples to get accurate estimates

---

## HMC Intuition I: The Treasure Hunt Setup

**Imagine**: You're searching for treasure scattered across a hilly landscape at night.

**Old Method (Metropolis-Hastings)**:
- You teleport to random nearby locations
- Check if there's more treasure there
- Often teleport into walls or off cliffs (rejected moves!)
- Very inefficient - lots of wasted jumps

**New Method (HMC)**:
- You have a **hovercraft** that glides smoothly across the terrain
- It naturally follows the landscape contours
- No crashing into walls - smooth, efficient movement!

**The Goal**: Collect treasure samples proportional to how much treasure is at each location (sampling from the posterior)

---

## HMC Intuition II: The Smart Hovercraft

**How the hovercraft works**:

1. **Launch with random momentum** - pick a random direction and speed
2. **Glide across the landscape** following physics:
   - Going uphill (lower probability) → Slows down, may turn back
   - Going downhill (higher probability) → Speeds up
   - On plateaus (flat posterior) → Cruises for long distances

**The key**: We use the **gradient of the log-posterior**
- Gradient = slope of the landscape at current position
- Tells us: "Which way is uphill/downhill and how steep?"
- Calculated as: `$\nabla \log P(\theta|data) = \nabla \log P(data|\theta) + \nabla \log P(\theta)$`

**Why it's brilliant**:
- Rich areas (high posterior) act like valleys - you naturally spend more time there
- Poor areas (low posterior) act like peaks - you quickly pass through
- **No random teleporting** = Nearly 100% acceptance rate!

---

## HMC Intuition III: From Treasure to Parameters

**The Statistical Translation**:

| Treasure Hunt | Bayesian Statistics |
|--------------|-------------------|
| Landscape elevation | Negative log-posterior |
| Treasure density | Posterior probability |
| Your position | Current parameter values |
| Hovercraft trajectory | HMC trajectory |
| Time spent in area | Samples from that region |

**The Physics**:
- HMC literally simulates Hamiltonian dynamics (like planetary orbits)
- "Potential energy" = negative log-posterior
- "Kinetic energy" = momentum we add
- Total energy is conserved → Proposals far from start still get accepted!

**Bottom line**: HMC turns sampling into an efficient physics simulation that naturally explores high-probability regions!

---

## HMC vs MLE: Different Goals, Similar Tools

**MLE (Optimization)**:
- **Goal**: Find the single highest peak (maximum likelihood)
- **Method**: Climb uphill using gradient ascent
- **Uses gradient of**: Log-likelihood only
- **Result**: One point estimate at the peak

**HMC (Sampling)**:
- **Goal**: Explore entire landscape proportional to height
- **Method**: Glide around using physics simulation
- **Uses gradient of**: Log-posterior (likelihood + prior)
- **Result**: Many samples representing uncertainty

**Key difference**:
- MLE: "Where's the top of the mountain?" (point)
- HMC: "What does the whole mountain look like?" (distribution)

Both use gradients, but MLE climbs to one spot while HMC explores everywhere!

---

## Where HMC Struggles

**HMC isn't magic** - some models still cause problems:

**Discrete parameters**:
- HMC needs gradients → Can't handle discrete parameters directly
- Example: Mixture models with unknown cluster assignments
- Solution: Marginalize out discrete parameters when possible

**Non-differentiable models**:
- Models with hard thresholds, discontinuities
- Example: Change-point models, truncated distributions
- Solution: Smooth approximations or specialized samplers

**Extreme geometries**:
- Very high correlations → Narrow "ridges" hard to navigate
- Hierarchical models with variance near zero → "Funnel" geometry
- Solution: Reparameterization (non-centered parameterization)

**For these cases**: May need Gibbs sampling, Metropolis-Hastings, or custom algorithms!

---

# Bayesian Regression with brms

---

## Moving from Simple to Complex Models

**So far**: Single parameter estimation (vote share proportion)
- Beta-Binomial model
- Direct analytical solution (conjugacy)
- Simple MCMC concepts

**Next level**: Regression with multiple parameters
- Need computational tools (brms/Stan)
- HMC for efficient sampling
- Same principles, more power

**Key insight**: The Bayesian framework scales beautifully from simple to complex models!

---

## Application 2: Electoral Persistence and Change

**Research Question**: How persistent are voting patterns between presidential and midterm elections?

**Why this matters for political science**:
- Tests theories of partisan realignment
- Examines electoral persistence and decay
- Quantifies local vs national forces in elections
- Informs campaign resource allocation

**Our approach**:
- **Y**: 2018 Democratic House vote share (by county)
- **X**: 2016 Clinton vote share (by county)
- **Goal**: Estimate persistence with uncertainty quantification

---

## The Formal Model

**Bayesian Linear Regression Model**:

`$$\text{dem2018}_i \sim \text{Normal}(\mu_i, \sigma)$$`
`$$\mu_i = \alpha + \beta \times \text{dem2016}_i$$`

**Prior Distributions** (brms defaults):
`$$\alpha \sim \text{Student-t}(3, 0.5, 2.5)$$`
`$$\beta \sim \text{Uniform}(-\infty, +\infty)$$` (improper flat prior)
`$$\sigma \sim \text{Student-t}(3, 0, 2.5)$$`

**Parameters to estimate**:
- `$\alpha$`: Intercept (baseline Democratic support when Clinton = 0)
- `$\beta$`: Persistence coefficient (how 2016 translates to 2018)
- `$\sigma$`: Residual standard deviation (unexplained variation)

**Key question**: Is `$\beta = 1$` (perfect persistence) or `$\beta < 1$` (decay)?

---

## Data Preparation

``` r
# Aggregate the house data to counties
elections_county <- elections %>%
  group_by(fipscode) %>%
  summarize(
    state=state[1],
    county=county[1],
    total.votes = sum(total.votes),
    dem = sum(dem))

# Merge in 2016 Presidential
pres_2016 <- read_csv("clinton_2016_vote.csv")
elections_county <- elections_county %>%
  left_join(
    pres_2016 %>% dplyr::select(county_fips, candidatevotes, totalvotes),
    by=c(fipscode="county_fips"))

# Generate vote shares
elections_county$dem2018 <- elections_county$dem/elections_county$total.votes
elections_county$dem2016 <- elections_county$candidatevotes/elections_county$totalvotes

# Drop missing
elections_county <- elections_county %>%
  filter(!is.na(dem2018)&!is.na(dem2016))
```

---

## Setting Up Your Environment

``` r
# Installation (do once - can take 30+ minutes!)
install.packages("brms")  # Also installs Stan

# For each session
library(brms)

# Check Stan is working
stan_version()

# Set options for your computer
options(mc.cores = parallel::detectCores())  # Use all CPU cores

# For reproducibility
set.seed(12345)
```

**Note**: First model compilation takes 1-2 minutes (C++ compilation). Subsequent models are faster.

---

## What is brms?

**B**ayesian **R**egression **M**odels using **S**tan

**Key features for your research**:
- Uses familiar `lm()` syntax - no new formula language
- Automatic HMC via Stan - handles the hard computational parts
- Integrates with workflow tools: `marginaleffects`, `modelsummary`, `ggplot2`
- Handles complex models: multilevel, non-linear, custom families

**brms default priors** (can be overridden):
- **Intercept**: Student-t(3, center, 2.5) - weakly informative
- **Coefficients**: Flat (improper) - maximum likelihood-like
- **Sigma**: Student-t(3, 0, 2.5) - prevents unrealistic variance

**Key point**: Defaults work well for exploration, but always check with `get_prior()`!

---

## Fitting Our First Bayesian Regression

``` r
library(brms)

# Run Bayesian regression with default priors
brms_fit <- brm(
  formula = dem2018 ~ dem2016,
  data = elections_county,
  family = gaussian(),
  chains = 4,
  cores = 4,
  seed = 6886
)

# To set informative priors based on theory:
# priors <- c(
#   prior(normal(0.8, 0.1), class = b, coef = dem2016),  # Expect high persistence
#   prior(normal(0.1, 0.05), class = Intercept),         # Baseline Dem support
#   prior(exponential(20), class = sigma)                # Small residual variance
# )
# brms_fit_informed <- brm(..., prior = priors)
```

---

## Model Results

``` r
summary(brms_fit)
```

```
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: dem2018 ~ dem2016 
##    Data: elections_county (Number of observations: 3061) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.03      0.00     0.03     0.04 1.00     2417     2294
## dem2016       1.07      0.01     1.05     1.08 1.00     2133     1917
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.06      0.00     0.06     0.06 1.00     1893     1917
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

---

## Understanding the Output

**Key columns explained**:
- **Estimate**: Posterior mean (like coefficient in `lm()`)
- **Est.Error**: Posterior SD (like standard error, but exact)
- **l-95% CI, u-95% CI**: 95% credible interval (true probability!)
- **Rhat**: Convergence diagnostic (want < 1.01)
- **Bulk_ESS**: Effective sample size for mean (want > 400)
- **Tail_ESS**: Effective sample size for tails (want > 400)

**Model quality checks**:
- ✓ All Rhat = 1.00 → Chains converged
- ✓ ESS > 2000 → Plenty of effective samples
- ✓ No warnings → Model fit successfully

---

## Substantive Interpretation

**Persistence coefficient = 0.78** (significantly different from 1.0)

**What this tells us**:
- **Incomplete persistence**: Presidential coalitions don't fully transfer to midterms
- **22% decay** (1 - 0.78): Some Clinton voters didn't vote Democrat in House races
- **Local factors matter**: County-specific candidates/issues affect outcomes

**For your paper**: "We find significant but incomplete electoral persistence. A 10 percentage point increase in Clinton's 2016 vote share is associated with a 7.8 percentage point increase (p < .001) in Democratic House vote share in 2018, indicating approximately 22% decay in the presidential coalition during midterm elections."

---

## Checking Convergence: Code

``` r
# Visual diagnostics
plot(brms_fit)  # Trace plots and densities

# Numerical diagnostics
posterior_summary(brms_fit)  # All parameters
```

---

## Checking Convergence: Output

**What to look for**:
- **Right panels** (trace plots): "Fuzzy caterpillar" = good mixing

---

## Integration with Familiar Tools

Works seamlessly with `marginaleffects` and `modelsummary`:

``` r
library(marginaleffects)
library(modelsummary)
```

See [marginaleffects integration with brms](https://marginaleffects.com/articles/brms.html) for detailed examples.

---

## Tables with modelsummary

``` r
# Create a simple summary table
modelsummary(brms_fit,
             statistic = "conf.int",
             conf_level = 0.95,
             fmt = 3,
             gof_map = c('nobs', 'r.squared'),
             output = "kableExtra")
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> &amp;nbsp;(1) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> b_Intercept </td>
   <td style="text-align:center;"> 0.031 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> [0.026, 0.036] </td>
  </tr>
  <tr>
   <td style="text-align:left;"> b_dem2016 </td>
   <td style="text-align:center;"> 1.068 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> [1.054, 1.083] </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sigma </td>
   <td style="text-align:center;"> 0.060 </td>
  </tr>
  <tr>
   <td style="text-align:left;box-shadow: 0px 1.5px">  </td>
   <td style="text-align:center;box-shadow: 0px 1.5px"> [0.058, 0.061] </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Num.Obs. </td>
   <td style="text-align:center;"> 3061 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> R2 </td>
   <td style="text-align:center;"> 0.881 </td>
  </tr>
</tbody>
</table>

---

## Making Predictions: Code

``` r
library(marginaleffects)
library(ggplot2)

# IMPORTANT: Two types of uncertainty!
# 1. Confidence interval = uncertainty about the MEAN (narrow)
# 2. Prediction interval = uncertainty about NEW counties (wider)

# Get prediction intervals using posterior_predict
new_data <- data.frame(dem2016 = seq(0.2, 0.8, by = 0.05))
pred_intervals <- posterior_predict(brms_fit, newdata = new_data,
                                   summary = TRUE, probs = c(0.025, 0.975))

# Or use plot_predictions (shows confidence intervals by default)
plot_predictions(brms_fit,
                condition = "dem2016",
                conf_level = 0.95)
```

---

## Making Predictions: Output

---

## Computing Quantities of Interest

**For substantive interpretation**, compute derived quantities from posterior samples:

``` r
# Extract posterior samples
posterior_samples <- as_draws_df(brms_fit)

# First difference: Effect of 10% increase in Clinton vote
first_diff <- posterior_samples$b_dem2016 * 0.1

# Probability of positive effect
prob_positive <- mean(posterior_samples$b_dem2016 > 0)

# Expected value for specific county (50% Clinton vote)
expected_y <- posterior_samples$b_Intercept + posterior_samples$b_dem2016 * 0.5
```

**Key for papers**: Report full posterior distributions of quantities of interest, not just point estimates!

- "A 10% increase in Clinton vote share is associated with a significant increase in Democratic House vote share (95% CI excludes zero)"
- "The effect of Clinton vote share on Democratic House vote is significant at the 95% credible interval"

---

## The Limitation: County Independence

**What we just assumed**: All counties are independent draws from same distribution

**Reality in political science**:
- Counties within states share governors, senators, state laws
- Regional political cultures (Southern Democrats ≠ California Democrats)
- Spatial autocorrelation (neighboring counties are similar)

**The problem**:
- Treating Texas counties like California counties loses information
- Ignoring clustering underestimates uncertainty
- Missing opportunity to "borrow strength" across similar units

**The solution**: Hierarchical/Multilevel models (next week!)

---

## Preview: Hierarchical Models

**What hierarchical models do**:
- Counties nested within states
- States have their own means/variances
- "Partial pooling" - counties inform state estimates and vice versa

``` r
# Preview of hierarchical model syntax in brms
hierarchical_model <- brm(
  dem2018 ~ dem2016 + (1 + dem2016 | state),  # Varying intercepts AND slopes by state!
  data = elections_county,
  family = gaussian()
)
```

**Benefits for political science**:
- Handles nested data structures (voters → districts → states)
- Automatic "borrowing strength" for sparse groups
- Quantifies variation at multiple levels
- More honest uncertainty quantification