Matching

class: center, middle, inverse, title-slide

.title[
# Matching
]
.subtitle[
## EDLD 650: Week 8
]
.author[
### David D. Liebowitz
]

---

# Agenda
### 1. Roadmap and Goals (9:00-9:10)
### 2. Discussion Questions (9:10-10:20)
- Diaz & Handa
- Murnane & Willett, Ch. 12

### 3. Break (10:20-10:30)
### 4. Applied matching (10:30-11:40)
- PSM and CEM

### 5. Wrap-up (11:40-11:50)

---
# Roadmap

<img src="causal_id.jpg" width="1707" style="display: block; margin: auto;" />
---
# Goals

### 1. Describe conceptual approach to matching analysis

### 2. Assess validity of matching approach and what selection on observable assumptions implies

### 3. Conduct matching analysis in simplified data using both propensity-score matching and coarsened-exact matching (CEM)

---
class: middle, inverse

# So random...

---
class: middle, inverse

# Break

---
class: middle, inverse

# Matching:
## Propensity scores

---
# Recall Catholic school data

<div class="datatables html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-0fd1e2a4181dd345fd31" style="width:100%;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-0fd1e2a4181dd345fd31">{"x":{"filter":"none","vertical":false,"fillContainer":false,"data":[["1","2","3","4","5"],[124902,180625,702949,710976,1425490],[49.7700004577637,51.5099983215332,48.2799987792969,53.0099983215332,65.3499984741211],[1,1,0,0,1],[50.2700004577637,41.310001373291,45.75,46.0499992370605,66.6900024414062],[10,11,11,9,10]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>id<\/th>\n      <th>math12<\/th>\n      <th>catholic<\/th>\n      <th>math8<\/th>\n      <th>faminc8<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":5,"columnDefs":[{"className":"dt-right","targets":[1,2,3,4,5]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100]},"selection":{"mode":"multiple","selected":null,"target":"row","selectable":null}},"evals":[],"jsHooks":[]}</script>

---
## Are Catholic HS higher-performing?

```r
catholic %>% group_by(catholic) %>%
  summarise(n_students = n(),
   mean_math = mean(math12), SD_math = sd(math12))
```

```
#> # A tibble: 2 x 4
#>   catholic  n_students mean_math SD_math
#>   <dbl+lbl>      <int>     <dbl>   <dbl>
#> 1 0 [no]          5079      50.6    9.53
#> 2 1 [yes]          592      54.5    8.46
```

---
## Are Catholic HS higher-performing?
<img src="EDLD_650_8_match_2_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
---
## Are Catholic HS higher-performing?

```r
ols1 <- lm(math12 ~ catholic, data=catholic)
summary(ols1)
```

```
...
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  50.6447     0.1323 382.815   <2e-16 ***
#> catholic      3.8949     0.4095   9.512   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9.428 on 5669 degrees of freedom
#> Multiple R-squared:  0.01571,	Adjusted R-squared:  0.01554 
#> F-statistic: 90.48 on 1 and 5669 DF,  p-value: < 2.2e-16
...
```

.blue[**What is wrong with all of these approaches?**]

---
## Are Catholic attendees different?

.small[

```r
table <- tableby(catholic ~ faminc8 + math8 + white + female, 
        numeric.stats=c("meansd"), cat.stats=c("N", "countpct"), 
        digits=2, data=catholic)
mylabels <- list(faminc8 = "Family income level in 8th grade", 
        math8 = "8th grade math score")
summary(table, labelTranslations = mylabels)
```

|                                     |  0 (N=5079)  |  1 (N=592)   | Total (N=5671) | p value|
|:------------------------------------|:------------:|:------------:|:--------------:|-------:|
|**Family income level in 8th grade** |              |              |                | < 0.001|
|&nbsp;&nbsp;&nbsp;Mean (SD)          | 9.43 (2.25)  | 10.36 (1.68) |  9.53 (2.22)   |        |
|**8th grade math score**             |              |              |                | < 0.001|
|&nbsp;&nbsp;&nbsp;Mean (SD)          | 51.24 (9.75) | 53.66 (8.83) |  51.49 (9.68)  |        |
|**student is white?**                |              |              |                | < 0.001|
|&nbsp;&nbsp;&nbsp;Mean (SD)          | 0.68 (0.47)  | 0.80 (0.40)  |  0.69 (0.46)   |        |
|**student is female?**               |              |              |                |   0.253|
|&nbsp;&nbsp;&nbsp;Mean (SD)          | 0.52 (0.50)  | 0.54 (0.50)  |  0.52 (0.50)   |        |
]

---
# Implementing matching
### Reminder of key assumptions/issues:

.pull-left[ .large[
1. Selection on observables
2. Treatment is as-good-as-random, conditional on known set of observables
3. Tradeoff between bias, variance and generalizability
]
]
.pull-right[
<img src="EDLD_650_8_match_2_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />
]
---
# Practical considerations
Can implement this various ways. Pedagogically, we'll implement matching using a combination of the `MatchIt` package (which is similar to the `cem` package for Coarsened Exact Matching), the `fixest` implementation of logistic regression and data manipulation by hand.<sup>[1]</sup>

```r
# install.packages("MatchIt")
# install.packages("gtools")
```

.footnote[[1] Most of the coarsening we'll do can be done directly within the `MatchIt` package, but it's good to get your hands into the data to truly understand what it is you're doing!]
---
## Phase I: Generate propensities
## Step 1: Estimate selection model

```r
pscores <- feglm(catholic ~ inc8 + math8 + mathfam, 
                 family=c("logit"), data=catholic)
summary(pscores)
```

```
#> GLM estimation, family = binomial(link = "logit"), Dep. Var.: catholic
#> Observations: 5,671 
#> Standard-errors: IID 
#>              Estimate Std. Error  t value   Pr(>|t|)    
#> (Intercept) -5.208846   0.586532 -8.88075  < 2.2e-16 ***
#> inc8         0.061803   0.014058  4.39633 1.1009e-05 ***
#> math8        0.042959   0.011138  3.85707 1.1476e-04 ***
#> mathfam     -0.000734   0.000262 -2.80586 5.0183e-03 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Log-Likelihood: -1,837.6   Adj. Pseudo R2: 0.030071
#>            BIC:  3,709.8     Squared Cor.: 0.018645
```
---
## Phase I: Generate propensities
## Step 2: Predict selection likelihood

```r
pscore_df <- data.frame(p_score = predict(pscores, type="response"),
                     catholic = catholic$catholic)
head(pscore_df)
```

```
#>      p_score catholic
#> 1 0.09094085        1
#> 2 0.09312787        1
#> 3 0.08635750        1
#> 4 0.08478468        1
#> 5 0.13309352        1
#> 6 0.07903282        1
```

*Note*: to apply Inverse-Probability Weights (IPW), you would take these propensities and assign weights of `$1/\hat{p}$` to treatment and `$1/(1-\hat{p})$` to control units.

---
## Phase I: Generate propensities
### Step 3: Common support (pre-match)
<img src="EDLD_650_8_match_2_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
---
## Phase 2: PS Matching
### Step 1: Assign nearest-neighbor match<sup>[1]</sup>

```r
matched <- matchit(catholic ~ math8 + inc8, method="nearest", 
                   replace=T, discard="both", data=catholic)
df_match <- match.data(matched)

# How many rows/columns in resulting dataframe?
dim(df_match)
```

```
#> [1] 1118   30
```

.footnote[[1] As you might anticipate, there are *lots* of different ways besides "nearest-neighbor with replacement" to create these matches.]

This is the **NOT** same number of observations as were in the original sample... .blue[what happened?]

---
## Phase 2: PS Matching 
### Step 2: Common support (post-match)
<img src="EDLD_650_8_match_2_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" />
---
## Phase 2: PS Matching
### Step 3: Examine balance
*(doesn't really fit on screen)*

```r
summary(matched)
```

```
#> 
#> Call:
#> matchit(formula = catholic ~ math8 + inc8, data = catholic, method = "nearest", 
#>     discard = "both", replace = T)
#> 
#> Summary of Balance for All Data:
#>          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
#> distance        0.1216        0.1024          0.4351     1.0216    0.1343
#> math8          53.6604       51.2365          0.2746     0.8201    0.0751
#> inc8           39.5346       31.8548          0.4714     0.8886    0.0777
#>          eCDF Max
#> distance   0.2142
#> math8      0.1550
#> inc8       0.1934
#> 
#> Summary of Balance for Matched Data:
#>          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
#> distance        0.1216        0.1216          0.0001     1.0000    0.0002
#> math8          53.6604       53.4416          0.0248     0.9497    0.0119
#> inc8           39.5346       39.6698         -0.0083     1.0407    0.0045
#>          eCDF Max Std. Pair Dist.
#> distance   0.0068          0.0005
#> math8      0.0304          0.5183
#> inc8       0.0118          0.1738
#> 
#> Sample Sizes:
#>               Control Treated
#> All           5079.       592
#> Matched (ESS)  469.79     592
#> Matched        526.       592
#> Unmatched     4465.         0
#> Discarded       88.         0
```
---
## Phase 2: PS Matching
### Step 3: Examine balance

**Summary of balance for .red-pink[all] data:**

Variable |   Means Treated | Means Control |  Std. Mean Diff
-----------|-----------------| --------------|--------------
distance   |     0.1216      |   0.1024      |  0.4351  
math8      |     53.6604     |   51.2365     |  0.2746  
inc8       |    39.5346      |   21.8548     |  0.4714

**Summary of balance for .red-pink[matched] data:**

Variable |   Means Treated | Means Control |  Std. Mean Diff
-----------|-----------------| --------------|--------------
distance   |     0.1216      |   0.1216      |  0.0000  
math8      |     53.6604     |   53.4416     |  0.0248  
inc8       |    39.5346      |  39.6698      | -0.0083

---
## Phase 2: PS Matching

Could get even closer with fuller model:

```r
matched2 <- matchit(catholic ~ math8 + inc8 + inc8sq + mathfam, 
       method="nearest", replace=T, discard="both", data=catholic)
```

<img src="EDLD_650_8_match_2_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />
---
## Phase 2: Estimate effects

```r
psmatch2 <- lm(math12 ~ catholic + math8 + inc8 + inc8sq + mathfam, 
               weights = weights, data=df_match)
#Notice how we have matched on just math8 and inc8 but are now 
#   adjusting for more in our estimation. This is fine! 
# Very important to include weights!
summary(psmatch2)
```

```
...
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.3079490  2.4457425   0.535 0.592905    
#> catholic     1.5990422  0.3144335   5.085 4.30e-07 ***
#> math8        0.9065628  0.0468521  19.349  < 2e-16 ***
#> inc8         0.3701132  0.0663303   5.580 3.02e-08 ***
#> inc8sq      -0.0015921  0.0005686  -2.800 0.005194 ** 
#> mathfam     -0.0040694  0.0010783  -3.774 0.000169 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5.244 on 1112 degrees of freedom
#> Multiple R-squared:  0.6323,	Adjusted R-squared:  0.6307 
#> F-statistic: 382.5 on 5 and 1112 DF,  p-value: < 2.2e-16
...
```

---
## Can you interpret these results?

<br>

> In a matched sample of students who had nearly identical 8th grade math test scores and family income levels and were equally likely to attend private school based on these observable conditions, the effect of attending parochial high school was to increase 12th grade math test scores by 1.59 scale score points [95% CI: 0.98, 2.22]. To the extent that families' selection into Catholic high school is based entirely on their children's 8th grade test scores and their family income, we can interpret this a credibly causal estimate of the effect of Catholic high school attendance, purged of observable variable bias.

---
class: middle, inverse

# Matching:

## Coarsened Exact Matching (CEM)

---
# A different approach: CEM

### Some concerns with PSM:
* Model (rather than theory) dependent
* Lacks transparency
* Can exclude large portions of data
* Potential for bias
* *We'll return to these at the end!*

`$\rightarrow$` more transparent (?) approach ... .blue[**Coarsened Exact Matching**] ... literally what the words say!

### Basic intuition: 
* Create bins of observations by covariates and require observation to match exactly within these bins. 
* Can require some bins be as fine-grained as original variables (then, it's just exact matching).

---
# Creating bins

```r
table(catholic$faminc8)
```

```
#> 
#>    1    2    3    4    5    6    7    8    9   10   11   12 
#>   18   42   84   85  144  175  447  441  655 1267 1419  894
```
--

```r
catholic <- mutate(catholic, coarse_inc=ifelse(faminc8<5,1,faminc8))
catholic$coarse_inc <- as.ordered(catholic$coarse_inc)
levels(catholic$coarse_inc)
```

```
#> [1] "1"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
```
--

```r
summary(catholic$math8)
```

```
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   34.48   43.45   50.45   51.49   58.55   77.20
```

```r
mathcuts <- c(43.45, 51.49, 58.55)
```

---
# CEM matches

```r
cem <- matchit(catholic ~ coarse_inc + math8, 
      cutpoints=list(math8=mathcuts), method="cem", data=catholic)
df_cem <- match.data(cem)
table(df_cem$catholic)
```

```
#> 
#>    0    1 
#> 5079  592
```

This is the same number of observations as were in the original sample. .blue[What does this imply?]

---
# Quality of matches

```r
summary(cem)
```

```
#> 
#> Call:
#> matchit(formula = catholic ~ coarse_inc + math8, data = catholic, 
#>     method = "cem", cutpoints = list(math8 = mathcuts))
#> 
#> Summary of Balance for All Data:
#>              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
#> coarse_inc1         0.0135        0.0435         -0.2598          .    0.0300
#> coarse_inc5         0.0101        0.0272         -0.1701          .    0.0170
#> coarse_inc6         0.0101        0.0333         -0.2310          .    0.0231
#> coarse_inc7         0.0338        0.0841         -0.2783          .    0.0503
#> coarse_inc8         0.0524        0.0807         -0.1273          .    0.0284
#> coarse_inc9         0.0794        0.1197         -0.1491          .    0.0403
#> coarse_inc10        0.2196        0.2239         -0.0103          .    0.0043
#> coarse_inc11        0.3345        0.2404          0.1994          .    0.0941
#> coarse_inc12        0.2466        0.1473          0.2305          .    0.0993
#> math8              53.6604       51.2365          0.2746     0.8201    0.0751
#>              eCDF Max
#> coarse_inc1    0.0300
#> coarse_inc5    0.0170
#> coarse_inc6    0.0231
#> coarse_inc7    0.0503
#> coarse_inc8    0.0284
#> coarse_inc9    0.0403
#> coarse_inc10   0.0043
#> coarse_inc11   0.0941
#> coarse_inc12   0.0993
#> math8          0.1550
#> 
#> Summary of Balance for Matched Data:
#>              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
#> coarse_inc1         0.0135        0.0135          0.0000          .    0.0000
#> coarse_inc5         0.0101        0.0101          0.0000          .    0.0000
#> coarse_inc6         0.0101        0.0101          0.0000          .    0.0000
#> coarse_inc7         0.0338        0.0338          0.0000          .    0.0000
#> coarse_inc8         0.0524        0.0524          0.0000          .    0.0000
#> coarse_inc9         0.0794        0.0794          0.0000          .    0.0000
#> coarse_inc10        0.2196        0.2196         -0.0000          .    0.0000
#> coarse_inc11        0.3345        0.3345          0.0000          .    0.0000
#> coarse_inc12        0.2466        0.2466          0.0000          .    0.0000
#> math8              53.6604       53.8447         -0.0209     0.8948    0.0106
#>              eCDF Max Std. Pair Dist.
#> coarse_inc1    0.0000          0.0000
#> coarse_inc5    0.0000          0.0000
#> coarse_inc6    0.0000          0.0000
#> coarse_inc7    0.0000          0.0000
#> coarse_inc8    0.0000          0.0000
#> coarse_inc9    0.0000          0.0000
#> coarse_inc10   0.0000          0.0000
#> coarse_inc11   0.0000          0.0000
#> coarse_inc12   0.0000          0.0000
#> math8          0.0431          0.3851
#> 
#> Sample Sizes:
#>               Control Treated
#> All           5079.       592
#> Matched (ESS) 3943.87     592
#> Matched       5079.       592
#> Unmatched        0.         0
#> Discarded        0.         0
```

---
# Quality of matches

**Summary of balance for .red-pink[all] data:**

.small[
  Variable     |  Means Treated | Means Control | Std. Mean Diff 
 ------------- | -------------- | ------------- | --------- 
coarse_inc1    |     0.0135     |   0.0435      | -0.2598  
coarse_inc5    |     0.0101     |   0.0272      | -0.1701  
coarse_inc6    |      0.101     |    0.0333     | -0.2310 
coarse_inc7    |      0.0338    |    0.0841     | -0.2783  
coarse_inc8    |      0.0524    |    0.0807     | -0.1273  
coarse_inc9    |      0.0794    |    0.1197     | -0.1491 
coarse_inc10   |      0.2196    |    0.2239     | -0.0103  
coarse_inc11   |      0.3345    |    0.2404     | -0.1994  
coarse_inc12   |      0.2466    |    0.1473     | -0.2305 
math8          |     53.6604    |   51.2365     | -0.2746 
]

---
# Common support?

```r
df_cem1 <- df_cem %>% group_by(catholic, subclass) %>% 
            summarise(count= n())  
df_cem1 <- df_cem1 %>%  mutate(attend = count / sum(count))
```
<img src="EDLD_650_8_match_2_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />

---
# Different cuts?

Can generate different quantiles, e.g., quintiles

```r
math8_quints <- gtools::quantcut(catholic$math8, 5)
table(math8_quints)
```

```
#> math8_quints
#> [34.5,42.1] (42.1,47.6] (47.6,53.3] (53.3,60.6] (60.6,77.2] 
#>        1136        1133        1134        1134        1134
```

You might also have a substantive reason for the cuts:

```r
mathcuts2 <- c(40, 45, 50, 55, 60, 65, 70)
```
---
# Different cuts: Balance

```
#> 
#> Call:
#> matchit(formula = catholic ~ coarse_inc + math8, data = catholic, 
#>     method = "cem", cutpoints = list(math8 = mathcuts2))
#> 
#> Summary of Balance for All Data:
#>              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
#> coarse_inc1         0.0135        0.0435         -0.2598          .    0.0300
#> coarse_inc5         0.0101        0.0272         -0.1701          .    0.0170
#> coarse_inc6         0.0101        0.0333         -0.2310          .    0.0231
#> coarse_inc7         0.0338        0.0841         -0.2783          .    0.0503
#> coarse_inc8         0.0524        0.0807         -0.1273          .    0.0284
#> coarse_inc9         0.0794        0.1197         -0.1491          .    0.0403
#> coarse_inc10        0.2196        0.2239         -0.0103          .    0.0043
#> coarse_inc11        0.3345        0.2404          0.1994          .    0.0941
#> coarse_inc12        0.2466        0.1473          0.2305          .    0.0993
#> math8              53.6604       51.2365          0.2746     0.8201    0.0751
#>              eCDF Max
#> coarse_inc1    0.0300
#> coarse_inc5    0.0170
#> coarse_inc6    0.0231
#> coarse_inc7    0.0503
#> coarse_inc8    0.0284
#> coarse_inc9    0.0403
#> coarse_inc10   0.0043
#> coarse_inc11   0.0941
#> coarse_inc12   0.0993
#> math8          0.1550
#> 
#> Summary of Balance for Matched Data:
#>              Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
#> coarse_inc1         0.0119        0.0119          0.0000          .     0.000
#> coarse_inc5         0.0085        0.0085          0.0000          .     0.000
#> coarse_inc6         0.0102        0.0102          0.0000          .     0.000
#> coarse_inc7         0.0339        0.0339          0.0000          .     0.000
#> coarse_inc8         0.0525        0.0525          0.0000          .     0.000
#> coarse_inc9         0.0797        0.0797         -0.0000          .     0.000
#> coarse_inc10        0.2203        0.2203         -0.0000          .     0.000
#> coarse_inc11        0.3356        0.3356          0.0000          .     0.000
#> coarse_inc12        0.2475        0.2475          0.0000          .     0.000
#> math8              53.5927       53.4289          0.0186     0.9794     0.006
#>              eCDF Max Std. Pair Dist.
#> coarse_inc1    0.0000          0.0000
#> coarse_inc5    0.0000          0.0000
#> coarse_inc6    0.0000          0.0000
#> coarse_inc7    0.0000          0.0000
#> coarse_inc8    0.0000          0.0000
#> coarse_inc9    0.0000          0.0000
#> coarse_inc10   0.0000          0.0000
#> coarse_inc11   0.0000          0.0000
#> coarse_inc12   0.0000          0.0000
#> math8          0.0225          0.1863
#> 
#> Sample Sizes:
#>               Control Treated
#> All           5079.       592
#> Matched (ESS) 3801.07     590
#> Matched       4866.       590
#> Unmatched      213.         2
#> Discarded        0.         0
```
---
# Big improvements!

**Summary of balance for .red-pink[matched] data:**

.small[
  Variable     |  Means Treated | Means Control | Std. Mean Diff 
 ------------- | -------------- | ------------- | --------- 
coarse_inc1    |     0.0269     |   0.0269      | -0.000  
coarse_inc5    |     0.0203     |   0.0203      | 0.000  
coarse_inc6    |      0.0251    |    0.0251     | -0.000 
coarse_inc7    |      0.0799    |    0.0799     | 0.000  
coarse_inc8    |      0.0744    |    0.0744     | 0.000  
coarse_inc9    |      0.1171    |    0.1171     | 0.000 
coarse_inc10   |      0.2322    |    0.2322     | 0.000  
coarse_inc11   |      0.2601    |    0.2601     | 0.000  
coarse_inc12   |      0.1639    |    0.1639     | -0.000  
math8          |     51.6351    |   51.3938     | 0.026  
]

.small[We've forced T/C to be identical within income bins. The *original* **math8** variable still has some imbalance (but it's much better). Within **mathcuts2**, T/C would be identical.]

---
# Minimal sample loss

Sample sizes:

Category | Control | Treated
---------- | ------- | -------
All        |   5079  |    592
Matched    |   4866  |    590
Unmatched  |    213  |      2

--
Common support?
--
<img src="EDLD_650_8_match_2_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" />
---
# Estimating effects

```r
att2 <- lm(math12 ~ catholic + coarse_inc + math8, 
           data=df_cem2, weights = weights)
summary(att2)
```

```
#> 
#> Call:
#> lm(formula = math12 ~ catholic + coarse_inc + math8, data = df_cem2, 
#>     weights = weights)
#> 
#> Weighted Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -28.504  -3.144  -0.064   3.192  26.186 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  10.26504    0.44843  22.891  < 2e-16 ***
#> catholic      1.50497    0.22886   6.576 5.28e-11 ***
#> coarse_inc.L  2.85163    0.50177   5.683 1.39e-08 ***
#> coarse_inc.Q  0.02882    0.43027   0.067    0.947    
#> coarse_inc.C -0.19212    0.47399  -0.405    0.685    
#> coarse_inc^4 -0.26505    0.48358  -0.548    0.584    
#> coarse_inc^5 -0.03738    0.47728  -0.078    0.938    
#> coarse_inc^6  0.01615    0.48925   0.033    0.974    
#> coarse_inc^7 -0.30582    0.43986  -0.695    0.487    
#> coarse_inc^8  0.20155    0.35084   0.574    0.566    
#> math8         0.78009    0.00821  95.019  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5.25 on 5445 degrees of freedom
#> Multiple R-squared:  0.6443,	Adjusted R-squared:  0.6436 
#> F-statistic: 986.3 on 10 and 5445 DF,  p-value: < 2.2e-16
```
---
# Let's look across estimates

<table style="text-align:center"><tr><td colspan="7" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td>OLS</td><td colspan="3">PSM</td><td colspan="2">CEM</td></tr>
<tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td><td>(4)</td><td>(5)</td><td>(6)</td></tr>
<tr><td colspan="7" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Attend catholic school</td><td>3.895<sup>***</sup> (0.409)</td><td>1.612<sup>***</sup> (0.318)</td><td>1.599<sup>***</sup> (0.314)</td><td>1.688<sup>***</sup> (0.306)</td><td>1.561<sup>***</sup> (0.228)</td><td>1.505<sup>***</sup> (0.229)</td></tr>
<tr><td colspan="7" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>5,671</td><td>1,118</td><td>1,118</td><td>1,126</td><td>5,671</td><td>5,456</td></tr>
<tr><td style="text-align:left">R<sup>2</sup></td><td>0.016</td><td>0.623</td><td>0.632</td><td>0.651</td><td>0.656</td><td>0.644</td></tr>
<tr><td colspan="7" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="6" style="text-align:left">*p<0.05; **p<0.01; ***p<0.001. Models 2-3 and 5-6 match on income and math score. Model 3 adjusts for higher-order terms and interactions post matching; Model 4 includes them in matching algorithm. Model 6 uses narrower bins to match than Model 5. All CEM and PSM estimates are doubly-robust. Outcome mean (SD) for treated = 54.5 (8.5)</td></tr>
</table>
---
## How would you describe results?

<br>

> A na&iuml;ve estimate of 12th grade math test score outcomes suggests that students who attend Catholic high school scored nearly 4 scale score points higher than those who attended public school (almost half a standard deviation). However, the characteristics of students who attended Catholic high school were substantially different. They had higher family income, scored higher on 8th grade math tests and were more likely to be White, among other distinguishing characteristics. We theorize that the primary driver of Catholic school attendance is student 8th grade performance and family income. Conditional on these two characteristics, we implement two separate matching algorithms: Propensity Score Matching and Coarsened Exact Matching. Both sets of estimates indicate that the benefits of Catholic school are overstated in the full sample, but the attenuated results are still large in magnitude (just under one-fifth of a SD) and statistically significant.

---
## Strengths/limits of approaches
.small[

| Approach             |  Strengths                     |  Limitations
|------------------------------------------------------------------------
| Propensity-score nearest <br> neighbor matching w/ calipers and replacement | - Simulates ideal randomized experiment <br> - Limits dimensionality problem <br> - Calipers restrict poor matches <br> - Replacement takes maximal advantage of available data | - May generate poor matches <br> - Model dependent <br> Lacks transparency; PS in aribtrary units <br> - Potential for bias [(King & Nielsen, 2019)](https://www.cambridge.org/core/journals/political-analysis/article/abs/why-propensity-scores-should-not-be-used-for-matching/94DDE7ED8E2A796B693096EB714BE68B)
| Propensity-score stratification | - Simulates block-randomized experiment <br> - Limits dimensionality problem | - May produce worse matches than NN <br> - Lacks transparency; stratum arbitrary
| Inverse probability <br> (PS) matching | - Retains all original sample data <br> - Corrects bias of estimate with greater precision than matching/stratification | - Non-transparent/a-theoretical
| Coarsened Exact Matching | - Matching variables can be pre-specified (and pre-registered) <br> - Matching substantively driven <br> - Transparent matching process <br> - Eliminates same bias as propensity score if SOO occurs | - May generate poor matches depending on how coarsened variables are <br> - May lead to discarding large portions of sample

]
---
class: middle, inverse
# Synthesis and wrap-up

---
# Goals

### 1. Describe conceptual approach to matching analysis

### 2. Assess validity of matching approach

### 3. Conduct matching analysis in simplified data using both propensity-score matching and CEM

---
# Can you explain this figure?

---
# To-Dos

### Week 9: Matching, presenting and...?

### Readings: 
- Umansky & Dumont (2021)

### Assignments Due
**DARE 4**
- Due 11:59pm, March 3

**Final Research Project**
- Presentation, March 11
- Paper, March 20 (submit March 13 for feedback)

---
# Feedback

## Plus/Deltas

Front side of index card

## Clear/Murky

On back