Enter your name and EID here
This homework is due on Mar. 12, 2019 at 4:00pm. Please submit as a PDF file on Canvas.
For this homework, you will work with a dataset collected by John Holcomb from the North Carolina State Center for Health and Environmental Statistics. This data set contains 1409 birth records from North Carolina in 2001.
NCbirths <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
## Parsed with column specification:
## cols(
## Plural = col_integer(),
## Sex = col_integer(),
## MomAge = col_integer(),
## Weeks = col_integer(),
## Gained = col_integer(),
## Smoke = col_integer(),
## BirthWeightGm = col_double(),
## Low = col_integer(),
## Premie = col_integer(),
## Marital = col_integer()
## )
head(NCbirths)
## # A tibble: 6 x 10
## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int>
## 1 1 1 32 40 38 0 3147. 0 0 0
## 2 1 2 32 37 34 0 3289. 0 0 0
## 3 1 1 27 39 12 0 3912. 0 0 0
## 4 1 1 27 39 15 0 3856. 0 0 0
## 5 1 1 25 39 32 0 3430. 0 0 0
## 6 1 1 28 43 32 0 3317. 0 0 0
The column contents are as follows:
Problem 1: (5 pts)
a. (1 pt) Make a logistic regression model that predicts premature births (Premie
) from birth weight (BirthWeightGm
), plural births (Plural
), and weight gained during pregnancy (Gained
) in the NCbirths
data set. Show the summary (using summary
) of your model below.
# your R code goes here
b. (1 pt) Make a plot of the fitted probability as a function of the linear predictor, colored by the indicator of premature births.
# your R code goes here
c. (3 pts) Use the probability cut-off of 0.50 to classify a birth as premature or non-premature. Calculate the true positive rate and the false positive rate and interpret these rates in the context of the NCbirths
dataset. Your answer should mention something about premature births and the three predictors in part a.
# your R code goes here
Your answer goes here. 2-3 sentences only.
Problem 2: (5 pts)
a. (1 pt) Plot an ROC curve for the model that you created in problem 1a. Does the model perform better than a model in which you randomly classify a birth as premature or non-premature? Explain your answer in 2-3 sentences.
# your R code goes here
Your answer goes here. 2-3 sentences only.
b. (4 pts) Use the mothers’ marital status (Marital
) and the mothers’ age (MomAge
) as a new set of predictor variables for premature births, and create a logistic regression model. Plot an ROC curve for your newly-created model and, on the same plot, add an ROC curve from your model in problem 1a. What can you conclude from your plot? Which model performs better and why? Support your conclusions with AUC values for each model.
# your R code goes here
Your answer goes here. 2-3 sentences only.