Relationships between categorical variables

class: center, middle, inverse, title-slide

.title[
# Relationships between categorical variables
]
.subtitle[
## EDUC 641: Unit 2
]
.author[
### David D. Liebowitz
]

---

# Roadmap

<img src="Roadmap_2.png" width="90%" style="display: block; margin: auto;" />
                                                          
---
# Goals of the unit

- Describe relationships between quantitative data that are categorical
- Calculate an index of the strength of the relationship between two categorical variables, the chi-squared ( `$\chi^2$`) statistic
- Write R scripts to conduct these analyses
- Formulate and describe the purpose of a null hypothesis
- Conceptually describe the criteria to make a statistical inference from a sample to a population
- Interpret and report the results of a contingency-table analysis and a statistical inference from a chi-squared statistic

---
## Reminder of motivating question

.blue[**Were convicted murderers more likely to be sentenced to death in Georgia if they killed someone Black or if they killed someone white?**]

---
# Materials

.large[
1. Death penalty data (in file called deathpenalty.csv)
2. Codebook describing the contents of said data
3. R script to conduct the data analytic tasks of the unit
]

---
# What we've done

### Up until now, we've been examining each variable by itself...

---
class: middle, inverse
# Relationships between variables

---
# Two-way tables
.small[Now we seek to create a *joint display* of the values of *RVICTIM* and *DEATHPEN*]

```r
table(df$deathpen, df$rvictim)
```

```
#>      
#>       Black White
#>   No   1483   863
#>   Yes    23   106
```

--
.small[Could do this other ways...]

```r
xtabs(formula = ~ deathpen + rvictim, data = df)
```

```
#>         rvictim
#> deathpen Black White
#>      No   1483   863
#>      Yes    23   106
```

.small[Would be helpful to have these in percentage terms. Proportion of convicted murderers sentenced to death in case of Black victim:] 
`$$\frac{23}{1483 + 23}*100= 1.53\%$$`
---
# Two-way tables

#### Can ask R for this too:

```r
round(prop.table(table(df$deathpen, df$rvictim), margin=2)*100, 2)
```

```
#>      
#>       Black White
#>   No  98.47 89.06
#>   Yes  1.53 10.94
```

```r
# the margin option 2 asks for the proportion of the columns
# if you want the proportion by rows, specify margin=1
```

--
### Putting it into words

> In our sample of convicted murderers in Georgia, when a **Black person** was a victim...

> In our sample of convicted murderers in Georgia, when a **white person** was a victim...

---
# Grouped charts
We can visualize these counts:

---
# What is "related"?

To answer whether *DEATHPEN* and *RVICTIM* are related in our observed sample...

it might be helpful to imagine what the proportion of defendants sentenced to death would look like **if there were NO relationship**

.pull-left[
####Frequencies that we OBSERVE

```
#>      
#>       Black White  Sum
#>   No   1483   863 2346
#>   Yes    23   106  129
#>   Sum  1506   969 2475
```
]

.pull-right[
####Frequencies that we would EXPECT if there were NO relationship

|deathpen | Black | White | Sum
|-------------------------------
|No       |       |       | 2346
|Yes      |       |       | 129
|Sum      | 1506  |  969  | 2475
]

---
# Observed vs. Expected

.pull-left[
####Frequencies that we OBSERVE

```
#>      
#>       Black White  Sum
#>   No   1483   863 2346
#>   Yes    23   106  129
#>   Sum  1506   969 2475
```
]

.pull-right[
####Frequencies that we would EXPECT if there were NO relationship

|deathpen   | Black | White | Sum   | Proport.
|------------------------------------------------
|No         |       |       | 2346  | 0.948
|Yes        |       |       | 129   | 0.052
|Sum        | 1506  |  969  | 2475  | 1.000
|Proportion | 0.608 | 0.392 | 1.000 |
]

<br>

.small[*Note*: The proportions above are rounded, so if you use them to calculate the **EXPECTED** values, they will differ slightly from those on the next slide. If you calculate the proportions by hand] (i.e., `$1506/2474 = 0.60\overline{84}$`), .small[you will get the exact values, and then they will align with the rounded **EXPECTED** values on the next slide.]

---
# Observed vs. Expected

.pull-left[
####Frequencies that we OBSERVE

```
#>      
#>       Black White  Sum
#>   No   1483   863 2346
#>   Yes    23   106  129
#>   Sum  1506   969 2475
```
]

.pull-right[
####Frequencies that we would EXPECT if there were NO relationship 
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> deathpen </th>
   <th style="text-align:right;"> Black </th>
   <th style="text-align:right;"> White </th>
   <th style="text-align:right;"> Sum </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 1428 </td>
   <td style="text-align:right;"> 918 </td>
   <td style="text-align:right;"> 2346 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 78 </td>
   <td style="text-align:right;"> 51 </td>
   <td style="text-align:right;"> 129 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sum </td>
   <td style="text-align:right;"> 1506 </td>
   <td style="text-align:right;"> 969 </td>
   <td style="text-align:right;"> 2475 </td>
  </tr>
</tbody>
</table>
]

<br>
> What do you think? Is there a relationship between *DEATHPEN* and *RVICTIM*?

---
# A desired index...?

.pull-left[
####Frequencies that we OBSERVE

```
#>      
#>       Black White  Sum
#>   No   1483   863 2346
#>   Yes    23   106  129
#>   Sum  1506   969 2475
```
]
.pull-right[
####Frequencies that we would EXPECT if there were NO relationship 
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> deathpen </th>
   <th style="text-align:right;"> Black </th>
   <th style="text-align:right;"> White </th>
   <th style="text-align:right;"> Sum </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 1428 </td>
   <td style="text-align:right;"> 918 </td>
   <td style="text-align:right;"> 2346 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 78 </td>
   <td style="text-align:right;"> 51 </td>
   <td style="text-align:right;"> 129 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sum </td>
   <td style="text-align:right;"> 1506 </td>
   <td style="text-align:right;"> 969 </td>
   <td style="text-align:right;"> 2475 </td>
  </tr>
</tbody>
</table>
]

<br>
#### It would be nice to have an index of the **NET DISCREPANCY** between the **OBSERVED** and **EXPECTED** frequencies in the sample

---
# The Chi-Squared `$\chi^2$` statistic

For a moment, assume that there is a powerful statistic that allows us to summarize the **NET DISCREPANCY** between the tables of **OBSERVED** and **EXPECTED** frequencies. Let's call this statistic the Pearson Chi-Squared `$(\chi^2)$` statistic

.pull-left[
####Frequencies that we OBSERVE

```
#>      
#>       Black White  Sum
#>   No   1483   863 2346
#>   Yes    23   106  129
#>   Sum  1506   969 2475
```
]
.pull-right[
####Frequencies that we would EXPECT if NO relationship 
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> deathpen </th>
   <th style="text-align:right;"> Black </th>
   <th style="text-align:right;"> White </th>
   <th style="text-align:right;"> Sum </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> No </td>
   <td style="text-align:right;"> 1428 </td>
   <td style="text-align:right;"> 918 </td>
   <td style="text-align:right;"> 2346 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Yes </td>
   <td style="text-align:right;"> 78 </td>
   <td style="text-align:right;"> 51 </td>
   <td style="text-align:right;"> 129 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sum </td>
   <td style="text-align:right;"> 1506 </td>
   <td style="text-align:right;"> 969 </td>
   <td style="text-align:right;"> 2475 </td>
  </tr>
</tbody>
</table>
]

<br>
$$ \chi^2 = \frac{(1483-1428)^2}{1428} + \frac{(863-918)^2}{918} + \frac{(23-78)^2}{78} + \frac{(106-51)^2}{51} $$
$$ \chi^2 = 103.8 $$
--

Yay! We got an answer, but what does it mean...?

---
class: middle, inverse
# Hypothesis testing and statistical inference

---
# Big or small?
We can summarize the **NET DISCREPANCY** between the tables of **OBSERVED** and **EXPECTED** frequencies, using a statistic called the Pearson Chi-Squared ( `$\chi^2$`) statistic

.pull-left[
####Frequencies that we OBSERVE

<br>

$$ \chi^2 = 103.8 $$
--

Decision rule: If `$\chi^2$` is big, then declare that there is a relationship between *DEATHPEN* and *RVICTIM*; if `$\chi^2$` is zero (or close), then declare there is no relationship between *DEATHPEN* and *RVICTIM*...

--
but what is **BIG**, what is **close to zero**, and is 103.8 **big** or **close to zero**?

--
For that we will use this statistic to conduct a `$\chi^2$` .red-pink[**goodness-of-fit**] test.

---
# Statistical inference

Let's take a step back to capture the nature of the problem
- We've looked at some data on some convicted murderers in the state of Georgia
- We're not interested in only *these* murderers, but we're interested in a broader *population* of murderers from which our *sample* was drawn
  + In fact, even if we could observe outcomes for all murderers in the state of Georgia, our observation of them is imperfect due to *measurement error* and so we only ever observe samples, never populations (more on this later)
- Is there something about sampling from a population that could resolve our problem?
- Is there some way to generalize our conclusions about our *sample* relationship between *DEATHPEN* and *RVICTIM* to the *underlying population*?
  + This is called .red-pink[**statistical inference**] and it is ***the*** critical contribution of quantitative methods to research

---
# Sampling idiosyncrasy

#### When you generalize from a sample back to its underlying population, you must be careful that your empirical study has not been the victim of *sampling idiosyncrasy*

#### Is the following scenario plausible?
- There really is no relationship between *DEATHPEN* and *RVICTIM* **in the population**
- By accident, we have drawn an idiosyncratic sample from the population
- This sampling idiosyncrasy ended up giving us a `$\chi^2$` statistic as large as 103.8 by pure accident

#### How can we assess the plausibility of this scenario?

---
# The Null Hypothesis `$(H_{0})$`

.small[We start by imagining a hypothetical world in which there is **no relationship** between *DEATHPEN* and *RVICTIM* in a true population of convicted murderers. Then, we imagine drawing a series of samples of convicted murders over and over again (say...10,000 times) from this hypothetical population. What values of the] `$\chi^2$` .small[statistic might we observe?]

---
# Testing the null

In this hypothetical example of repeated sampling from a null population, we could record all 10,000 values of the `$\chi^2$` statistic
<img src="EDUC641_4_categrel_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" />

The histogram summarizes the natural variation that could occur in a `$\chi^2$` statistic due to ***random sampling idiosyncrasy***, after drawing repeated samples from a hypothetical population in which there is no relationship between *DEATHPEN* and *RVICTIM*.

---
# Testing the null

.blue[*If this were the histogram that could result from sampling idiosyncrasy, and this were the value of the chi-square statistic, what would you think?*]

---
# Testing the null

In fact, this is the value of the `$\chi^2$` statistic we observed:
<img src="EDUC641_4_categrel_files/figure-html/unnamed-chunk-20-1.svg" style="display: block; margin: auto;" />

.small[.blue[*This is a histogram of possible chi-square values that could result from sampling idiosyncrasy, and the actual value of the chi-squared statistic in our sample. What do you think?*]]

*WOOHOO!* In this thought exercise, you've just engaged in a rudimentary version of .red-pink[**Null-Hypothesis Significance Testing (NHST)**]; the bedrock of most social science research.

---
# Testing the null (*p*-values)

In fact, we don't need to examine the full histogram. Instead, we can say that in a hypothetical exercise of sampling repeatedly from a null population, less than 1 in a 1,000,000,000,000,000 (trillion) of all accidental values of the `$\chi^2$` statistic are larger than a value of 103.8

The statistic that captures the probability of observing a `$\chi^2$` statistic of a magnitude in a particular sample, in the presence of a null population, is called the .red-pink[***p*-value**].

---
# Testing the null (*p*-values)

.blue[*At what p-value would you start to believe that the value of the*] `$\color{blue}{\chi^2}$` .blue[*statistic in your own research was "big" (i.e., was unlikely to have occurred by accident)*]
<img src="EDUC641_4_categrel_files/figure-html/unnamed-chunk-22-1.svg" style="display: block; margin: auto;" />

---
# Testing the null (*p*-values)

.small[In social science research, it is customary to (arbitrarily) set that threshold at **5 percent (*p*<0.05)**. In other words, we say that if the difference between our observed data and our expected data would have happened in fewer than 1 out of 20 randomly drawn samples, that the difference reflects a true difference in the population.]

---
# Testing the null (*p*-values)

.blue[*In social science research, it is customary to (arbitrarily) set an alpha-threshold and conduct a Null-Hypothesis Significance Test*]
<img src="EDUC641_4_categrel_files/figure-html/unnamed-chunk-24-1.svg" style="display: block; margin: auto;" />

.blue[**Is this the right thing to do? At the end of the course, we will revisit this concept.**]

---
class: middle, inverse
# Incorporating a third variable

---
# Sub-sample comparisons

.small[There are more complex ways of doing this, but one approach is to replicate the original contingency table analysis in interesting "slices" of the sample, defined by a third variable.]

---
# Cases with Black murderers

```
#>      
#>       Black White
#>   No   1304    63
#>   Yes    18    50
```

.pull-left[
#### Black murderers, Black victims
When a Black victim is killed by a Black murderer,
$$ \frac{18}{18+1304} = 1.36 \% $$
of the murderers are sentenced to death.
]

.pull-right[
#### Black murderers, White victims
When a White victim is killed by a Black murderer,
$$ \frac{50}{50+63} = 44.25 \% $$
of the murderers are sentenced to death.
]

The percentage of Black murderers sentenced to death for killing a white victim is about 32.5 times the percentage of Black murderers sentenced to death for killing a Black victim, in Georgia.

.red-pink[**I've subset my data to only cases with Black defendants. See the [accompanying R script](./EDUC641_4_code.R) for how to do this.**]

---
# A statistical test

.pull-left[

Observed:

```
#>              df_b$rvictim
#> df_b$deathpen Black White
#>           No   1304    63
#>           Yes    18    50
```
Expected:

```
#>              df_b$rvictim
#> df_b$deathpen Black White
#>           No   1259   108
#>           Yes    63     5
```
`$\chi^2$` statistic:

```
#> X-squared 
#>  414.7031
```
*p*-value

```
#> [1] 3.470593e-92
```
]

.pull-right[
- `$H_{0}$`: *DEATHPEN* and *RVICTIM* are unrelated in the population of convicted Black murderers in GA
- `$\chi^2$` statistic: 414.7
- *p*-value: <0.0001
- Decision: Reject `$H_{0}$`
- Conclusion: There is a statistically significant relationship between the assignment of the death penalty and the race of the victim, on average, in the population of Black murderers in GA.
]

---
# Cases with White murderers

```
#>      
#>       Black White
#>   No    179   800
#>   Yes     5    56
```

.pull-left[
#### White murderers, Black victims
When a Black victim is killed by a White murderer,
$$ \frac{5}{5+179} = 2.71 \% $$
of the murderers are sentenced to death.
]

.pull-right[
#### White murderers, White victims
When a White victim is killed by a White murderer,
$$ \frac{56}{56+800} = 6.89 \% $$
of the murderers are sentenced to death.
]

The percentage of White murderers sentenced to death for killing a White victim is about 2.5 times the percentage of White murderers sentenced to death for killing a Black victim, in Georgia.

---
# A statistical test

.pull-left[

Observed:

```
#>              df_w$rvictim
#> df_w$deathpen Black White
#>           No    179   800
#>           Yes     5    56
```
Expected:

```
#>              df_w$rvictim
#> df_w$deathpen Black White
#>           No    173   806
#>           Yes    11    50
```
`$\chi^2$` statistic:

```
#> X-squared 
#>  3.349547
```
*p*-value

```
#> [1] 0.06722351
```
]

.pull-right[
- `$H_{0}$`: *DEATHPEN* and *RVICTIM* are unrelated in the population of convicted White murderers in GA
- `$\chi^2$` statistic: 3.35
- *p*-value: 0.067
- Decision: Fail to reject `$H_{0}$`
- Conclusion: There is not a statistically significant relationship between the assignment of the death penalty and the race of the victim, on average, in the population of white murderers in GA.

- Note that we **NEVER** accept the null-hypothesis. We only ever *fail to reject* it.
]

---
# Putting it all together

### Basic steps of classical statistical inference
1. State a research question, including a null hypothesis `$(H_{0})$` which states there exists no relationship between our variables of interest, ***on average in the population***
2. Display and describe the observed data
3. Summarize the observed data in relationship to an expected value
4. Set a threshold at which we will no longer believe that the discrepancy between the observed and expected relationship is due to sampling idiosyncrasy
5. Estimate the *p*-value
6. Reject or fail to reject the null hypothesis
7. Interpret your findings drawing explicitly on plots, summary statistics and test statistics

---
# Our intepretation

> In the population of convicted murderers in Georgia, the imposition of the death penalty and the race of the victim are, on average, related ( `$\chi^2$` = 103.8, *p*< 0.001). The percentage of convicted murderers who were sentenced to death after killing a White victim was more than 8 times the percentage of convicted murderers who were sentenced to death after killing a Black victim. In Figure 1, we show...

> This phenomenon is largely driven by the imposition of the death penalty on Black defendants. Courts sentenced Black defendants to death for killing white victims at more than 32 times the frequency than when they were convicted of killing Black defendants ( `$\chi^2$` = 414.7, *p*<0.001); whereas, we detect no statistical difference on average between white defendants convicted of murdering white -- compared to Black -- victims ( `$\chi^2$` = 3.3, *p*=0.067). In Table 1, we show...

---

# Estimating the `$\chi^2$` statistic in R

```r
chi_df <- chisq.test(df$deathpen, df$rvictim)
chi_df
```

```
#> 
#> 	Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  df$deathpen and df$rvictim
#> X-squared = 103.82, df = 1, p-value < 2.2e-16
```

```r
chi_df$expected
```

```
#>            df$rvictim
#> df$deathpen      Black     White
#>         No  1427.50545 918.49455
#>         Yes   78.49455  50.50545
```

```r
round(chi_df$p.value, 5)
```

```
#> [1] 0
```

---
class: middle, inverse
# Synthesis and wrap-up

---
# Goals of the unit

---
# To-Dos

### Reading
- LSWR Chapter 11: hypothesis testing
- LSWR Chapter 12: categorical data analysis (chi-square test focus)
    + Please do not worry about fully understanding the discussions on sampling distributions, degrees of freedom, one- vs. two-sided tests, or variations of chi-squared calculations **(Sections 11.3, 11.4.3, 11.7, 11.8, 12.1.4-12.1.8, 12.3-12.9)**. We will (partially) cover these topics in future classes.
- Clayton (2020)
    + Last name A-L: Evans; Last Name M-Z: Clayton
    + Prep to summarize main ideas and key details

### Optional follow-up
- Complete R Bootcamp Module 6 (matrices)
- Complete R Bootcamp Module 7 (lists)

### Assignments
- Quiz on Units 1 & 2 on Oct. 17
- Assignment #2 Due October 25, 11:59pm