Homework 6

Enter your name and EID here

This homework is due on Mar. 9, 2020 at 12:00pm. Please submit as a PDF file on Canvas.

For this homework, you will work with a dataset collected by John Holcomb from the North Carolina State Center for Health and Environmental Statistics. This data set contains 1409 birth records from North Carolina in 2001.

NCbirths <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
## Parsed with column specification:
## cols(
##   Plural = col_double(),
##   Sex = col_double(),
##   MomAge = col_double(),
##   Weeks = col_double(),
##   Gained = col_double(),
##   Smoke = col_double(),
##   BirthWeightGm = col_double(),
##   Low = col_double(),
##   Premie = col_double(),
##   Marital = col_double()
## )
head(NCbirths)
## # A tibble: 6 x 10
##   Plural   Sex MomAge Weeks Gained Smoke BirthWeightGm   Low Premie Marital
##    <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>         <dbl> <dbl>  <dbl>   <dbl>
## 1      1     1     32    40     38     0         3147.     0      0       0
## 2      1     2     32    37     34     0         3289.     0      0       0
## 3      1     1     27    39     12     0         3912.     0      0       0
## 4      1     1     27    39     15     0         3856.     0      0       0
## 5      1     1     25    39     32     0         3430.     0      0       0
## 6      1     1     28    43     32     0         3317.     0      0       0

The column contents are as follows:

Problem 1: (5 pts)

a. (1 pt) Make a logistic regression model that predicts premature births (Premie) from birth weight (BirthWeightGm), plural births (Plural), and weight gained during pregnancy (Gained) in the NCbirths data set. Show the summary (using summary()) of your model below.

# Your R code here

b. (1 pt) Make a plot to show how the model separates premature births from regular births. Your plot should use the the fitted probabilities and/or the linear predictors, and you should color your geom by the indicator of premature births.

# Your R code here

c. (3 pts) Use the probability cut-off of 0.50 to classify a birth as premature or non-premature. Calculate the true positive rate and the false positive rate and interpret these rates in the context of the NCbirths dataset. Your answer should mention something about premature births and the three predictors in part a.

# Your R code here

Your answer here.

Problem 1: (5 pts)
a. (1 pt) Plot an ROC curve for the model that you created in problem 1a. Does the model perform better than a model in which you randomly classify a birth as premature or non-premature? Explain your answer in 1-2 sentences.

HINT: Random classification would lie along y = x.

# Your R code here

Your answer here

b. (4 pts) Use the mothers’ marital status (Marital) and the mothers’ age (MomAge) as a new set of predictor variables for premature births, and create a logistic regression model. Plot an ROC curve for your newly-created model and, on the same plot, add an ROC curve from your model in problem 1a. What can you conclude from your plot? Which model performs better and why? Support your conclusions with AUC values for each model.

# Your R code here

Your answer here.