Examining Relationships of Continuous Variables

class: center, middle, inverse, title-slide

.title[
# Examining Relationships of Continuous Variables
]
.subtitle[
## EDUC 641: Unit 4 Part 2
]
.author[
### David D. Liebowitz
]

---

# Roadmap

---
# Goals of the unit

- Describe relationships between quantitative data that are continuous
- Visualize and substantively describe the relationship between two continuous variables 
- Describe and interpret a fitted bivariate regression line
- Describe and interpret components of a fitted bivariate linear regression model

.grey-light[
- Visualize and substantively interpret residuals resulting from a bivariate regression model
- Conduct a statistical inference test of the slope and intercept of a bivariate regression model
]

- Write R scripts to conduct these analyses

---
## Reminder of motivating question

#### We learned a lot about the distribution of life expectancy in countries, now we are turning to thinking about relationships between life expectancy and other variables. In particular:

#### .blue[Do individuals living in countries with more total years of attendance in school experience, on average, higher life expectancy?]

#### In other words, we are asking whether the variables *SCHOOLING* and *LIFE_EXPECTANCY* are related.

---
# Materials

.large[

1. Life expectancy data (in file called life_expectancy.csv)
2. Codebook describing the contents of said data
3. R script to conduct the data analytic tasks of the unit (in file called EDUC641_13_code.R)

]

---
class: middle, inverse

# Our continuous relationship 
### (and some data-cleaning)

---
# Reading our data in

```r
who <- read.csv(here("data/life_expectancy.csv")) %>%
  # first making variable names take a common format
  janitor::clean_names() %>% 
  # filtering to focus only on 2015
  filter(year == 2015) %>%
  # selecting only the variables we need
  select(country, status, schooling, life_expectancy) %>% 
  # renaming one of the variables that is really misnamed
  rename(region = country) %>% 
  # rounding life expectancy to nearest year
  mutate(life_expectancy = round(life_expectancy, digits = 0))
```

---
# First data cleaning step:

### Identify missingness

```r
sum(is.na(who$life_expectancy))
```

```
#> [1] 0
```

```r
sum(is.na(who$schooling))
```

```
#> [1] 10
```

```r
### For the really ambitious...
sapply(who, function(x) sum(is.na(x)))
```

```
#>          region          status       schooling life_expectancy 
#>               0               0              10               0
```

.blue[**So some missingness...what do we do?**]

---
# Listwise vs. pairwise deletion

- .red-pink[**Listwise**]: any observations with any missingness (NA) for any of the variables to be used in our analysis are dropped. Analysis only conducted on observations that have complete data

- .red-pink[**Pairwise**]: observations with missingness for some of the variables to be used in analysis are retained and included in sample when the particular analysis does rely on that variable, but are necessarily excluded in analyses that rely on the variable with missingness.

```r
mean(who$life_expectancy, na.rm = T)
```

```
#> [1] 71.63934
```

```r
mean(who$schooling, na.rm = T)
```

```
#> [1] 12.92717
```

.blue[**How have we handled our missing data in estimating these univariate measures of central tendency?**]

---
# The chainsaw approach

.pull-left[
- Generally, we want to have a stable analytic sample so that differences across estimation strategies reflect differences in our models rather than sample differences
- However, simply dropping these observations may (severely) limit our desired external generalizability
- There are imputation methods that you will explore in EDUC 645
- With large data and a small amount of missingness, it generally doesn't matter what you do
- For now, we're going to employ .red-pink[**listwise**] deletion
]

.pull-right[
<img src="bttf.gif" style="display: block; margin: auto;" />
]

```r
who <- filter(who, !is.na(schooling))
nrow(who)
```

```
#> [1] 173
```

---
# A reminder of our relationship

```r
biv <- ggplot(data = who, aes(x = schooling, y = life_expectancy)) + 
        geom_point() 
```
<img src="EDUC641_14_regression_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" />

---
class: middle, inverse
# A gentle introduction to bivariate regression: <br>

## Ordinary-Least Squares (OLS)-fitted regression lines

---
# OLS-fitted regression line

```r
biv + geom_smooth(method = lm, se = F)
```

.red-pink[*The fitted regression line tells us the best prediction for the values of LIFE_EXPECTANCY.*]

---
# Some intuition
<img src="EDUC641_14_regression_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" />

Can think of the OLS-fitted regression line as a stick held in place by thumbtacks and elastic bands from each of the data points

---
# A visualization
<iframe src="https://daviddl.shinyapps.io/line_ss/?showcase=0" width="100%" height="550px" data-external="1"></iframe>

---
# Pictures to equations

So, the Ordinary-Least Squares .red-pink[**line of best fit**] minimizes the distance between it and all observations in the point cloud. Critically helpful to us: this line of best fit provides a two-number .red-pink[**summary of the relationship**] between our two continuous variables.

As with any straight line, it can be characterized by a simple algebraic equation. Recall the .red-pink[**slope-intercept**] form of a linear equation from 7<sup>th</sup> grade:
`$$y=mx + b$$`
--

.blue[**What do each of these terms represent?**]
<img src="EDUC641_14_regression_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" />

---
# Pictures to equations

.small[.red[**HOWEVER**], we don't represent lines of best fit with equations in slope-intercept form! .blue[**Why not?**]]
<img src="EDUC641_14_regression_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" />

.small[The slope-intercept form represents a deterministic relationship (*y* equals exactly *mx*+*b*). In statistics, we use the line of best fit to approximate the relationship. The line is straight ("smooth"), but there is a lot of variation ("roughness") around it, so we write this equation differently. We'll learn the formal way to represent this relationship in 643. For now, we'll use this slope-intercept form for convenience.]

.small[We can, in fact, calculate by hand the slope and the y-intercept of the line of best fit, using each (*x*, *y*) pairing for each observation. However, as you can guess, this is much more straightforward to do using a statistical software package.]

--
.small[Turn the page to observe the wonders of our first regression fit...!]

---
# Fitting a regression in R

```r
fit <- lm(life_expectancy ~ schooling, data=who)
summary(fit)
```

```
#> 
#> Call:
#> lm(formula = life_expectancy ~ schooling, data = who)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -16.3270  -2.6565   0.1581   3.3095  10.9758 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  42.8501     1.5976   26.82   <2e-16 ***
#> schooling     2.2348     0.1206   18.53   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.606 on 171 degrees of freedom
#> Multiple R-squared:  0.6676,	Adjusted R-squared:  0.6657 
#> F-statistic: 343.5 on 1 and 171 DF,  p-value: < 2.2e-16
```
---
# Interpreting the results

```
#> 
#> Call:
#> lm(formula = life_expectancy ~ schooling, data = who)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -16.3270  -2.6565   0.1581   3.3095  10.9758 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
*#> (Intercept)  42.8501     1.5976   26.82   <2e-16 ***
*#> schooling     2.2348     0.1206   18.53   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.606 on 171 degrees of freedom
#> Multiple R-squared:  0.6676,	Adjusted R-squared:  0.6657 
#> F-statistic: 343.5 on 1 and 171 DF,  p-value: < 2.2e-16
```

These .red-pink[**coefficients**] tell you where the fitted trend line should be drawn:
$$
\small{
\left[ \textrm{Predicted value of } LIFE \\_ EXPECTANCY \right]  = 
\left( 42.85 \right) + 2.23 * \left[ \textrm{Observed value of }SCHOOLING  \right] 
}
$$
---
# Fitted values

Can substitute values for the "predictor" `$(SCHOOLING)$` into the fitted equation to compute the *predicted* values of `$LIFE\_EXPECTANCY$`. 
<img src="EDUC641_14_regression_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" />

Can do this for our old friend Chile ... and all others...

---
# Fitted values

So we can re-construct the line of best fit from the fitted values:

---
# Fitted values

Note that the fitted line always goes through the average of the predictors

```r
mean(who$schooling)
```

```
#> [1] 12.92717
```

```r
mean(who$life_expectancy)
```

```
#> [1] 71.73988
```

---
# Fitted values

Note that the fitted line always goes through the average of the predictors
<img src="EDUC641_14_regression_files/figure-html/unnamed-chunk-19-1.svg" style="display: block; margin: auto;" />

---
class: middle, inverse
# Synthesis and wrap-up

---
# Goals of the unit

- Write R scripts to conduct these analyses

---
# To Dos

### Reading
- LSWR Chapter 15.1 and 15.2: bivariate regression by Nov. 21 class

### Assignments
- Assignment #4 due Dec. 2 at 11:59PM

### Quizzes
- Quiz #4 (now!) due 11/15 at 5pm
- Quiz #5 (last one!!!) due 11/27 at 5pm

---
class: middle, inverse

# Quiz