Homework 5

Enter your name and EID here

This homework is due on Feb. 27, 2018 at 7:00pm. Please submit as a PDF file on Canvas.

For this homework, you will work with a dataset collected by John Holcomb from the North Carolina State Center for Health and Environmental Statistics. This data set contains 1409 birth records from North Carolina in 2001.

NCbirths <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
head(NCbirths)
##   Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## 1      1   1     32    40     38     0       3146.85   0      0       0
## 2      1   2     32    37     34     0       3288.60   0      0       0
## 3      1   1     27    39     12     0       3912.30   0      0       0
## 4      1   1     27    39     15     0       3855.60   0      0       0
## 5      1   1     25    39     32     0       3430.35   0      0       0
## 6      1   1     28    43     32     0       3316.95   0      0       0

The column contents are as follows:

Problem 1 (3 pts): We are interested in assessing the relationships between the variables in the dataset NCbirths and sex of the baby and premature births. Perform a principal components analysis (PCA) on the dataset NCbirths. Remove the columns Sex, Premie, and Weeks prior to performing PCA (these contain information on sex of the baby and the time of gestation). Create a scatterplot of PC1 vs. PC2. First, color each point by sex, and then color each point by the indicator of premature birth. What do you observe? Visually, and without doing any calculations, do the birth terms cluster together in principal-component space? Do the sexes cluster together?

NCbirths %>% select(-Premie, -Sex, -Weeks) %>% scale() %>% prcomp() -> pca

pca_data <- data.frame(pca$x, NCbirths)

ggplot(pca_data, aes(x=PC1, y=PC2, color=factor(Sex))) + geom_point() + scale_color_colorblind()

ggplot(pca_data, aes(x=PC1, y=PC2, color=factor(Premie))) + geom_point() + scale_color_colorblind()