## Homework 5

Enter your name and EID here

This homework is due on Mar. 2, 2020 at 12:00pm. Please submit as a PDF file on Canvas.

For this homework, recall the birth record dataset collected by John Holcomb from the North Carolina State Center for Health and Environmental Statistics. This data set contains 1409 birth records from North Carolina in 2001.

``NCbirths <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")``
``````## Parsed with column specification:
## cols(
##   Plural = col_double(),
##   Sex = col_double(),
##   MomAge = col_double(),
##   Weeks = col_double(),
##   Gained = col_double(),
##   Smoke = col_double(),
##   BirthWeightGm = col_double(),
##   Low = col_double(),
##   Premie = col_double(),
##   Marital = col_double()
## )``````
``head(NCbirths)``
``````## # A tibble: 6 x 10
##   Plural   Sex MomAge Weeks Gained Smoke BirthWeightGm   Low Premie Marital
##    <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>         <dbl> <dbl>  <dbl>   <dbl>
## 1      1     1     32    40     38     0         3147.     0      0       0
## 2      1     2     32    37     34     0         3289.     0      0       0
## 3      1     1     27    39     12     0         3912.     0      0       0
## 4      1     1     27    39     15     0         3856.     0      0       0
## 5      1     1     25    39     32     0         3430.     0      0       0
## 6      1     1     28    43     32     0         3317.     0      0       0``````

The column contents are as follows:

• Plural: 1=single birth, 2=twins, 3=triplets.
• Sex: Sex of the baby 1=male 2=female.
• MomAge: Motherâ€™s age (in years).
• Weeks: Completed weeks of gestation.
• Gained: Weight gained during pregnancy (in pounds).
• Smoke: Mother is a smoker: 1=yes, 0=no.
• BirthWeightGm: Birth weight in grams.
• Low: Indicator for low birth weight, 1=2500 grams or less, 0=otherwise.
• Premie: Indicator for premature birth, 1=36 weeks or sooner, 0=otherwise.
• Marital: Marital status: 0=married or 1=not married.

Problem 1 (3 pts): We are interested in assessing the relationships between the variables in the dataset `NCbirths` and the mothersâ€™ marital status, the mothersâ€™ smoking habits, and plural births. Perform a principal components analysis (PCA) on the dataset `NCbirths`. Remove the columns `Marital`, `Smoke`, and `Plural` prior to performing PCA. Create a scatterplot of PC1 vs.Â PC2. First, color each point bythe motherâ€™s marital status, then color each point by the motherâ€™s smoking habit, and then color each point by the indicator of plural births. What do you observe? Visually, and without doing any calculations, do the different types of births group together in principal-component space? Do the smokers or married mothers cluster together?

``````pca <- NCbirths %>%
select(-Marital, -Smoke, -Plural) %>%
scale() %>%
prcomp()

pca_data <- data.frame(pca\$x, NCbirths)

ggplot(pca_data, aes(x = PC1, y = PC2, color = factor(Marital))) +
geom_point() +
scale_color_manual(values = color_palette)``````

``````ggplot(pca_data, aes(x = PC1, y = PC2, color = factor(Smoke))) +
geom_point() +
scale_color_manual(values = color_palette)``````