Enter your name and EID here

This homework is due on April 12, 2021 at 11:00pm. Please submit as a pdf file on Canvas.

For all problems in this homework, we will work with the `penguins_clean` dataset, which is a cleaned-up version of the `penguins` dataset from the palmerpenguins package.

Note: This homework is about the contents of the plots. Donâ€™t worry about styling. Itâ€™s OK to use the default theme and plot labeling.

``````library(palmerpenguins)

penguins_clean <- penguins %>%
select(-year) %>% # remove the year column as it is distracting here
na.omit()         # remove any rows with missing values

penguins_clean``````
``````## # A tibble: 333 x 7
##    species island bill_length_mm bill_depth_mm flipper_length_â€¦ body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
##  1 Adelie  Torgeâ€¦           39.1          18.7              181        3750
##  2 Adelie  Torgeâ€¦           39.5          17.4              186        3800
##  3 Adelie  Torgeâ€¦           40.3          18                195        3250
##  4 Adelie  Torgeâ€¦           36.7          19.3              193        3450
##  5 Adelie  Torgeâ€¦           39.3          20.6              190        3650
##  6 Adelie  Torgeâ€¦           38.9          17.8              181        3625
##  7 Adelie  Torgeâ€¦           39.2          19.6              195        4675
##  8 Adelie  Torgeâ€¦           41.1          17.6              182        3200
##  9 Adelie  Torgeâ€¦           38.6          21.2              191        3800
## 10 Adelie  Torgeâ€¦           34.6          21.1              198        4400
## # â€¦ with 323 more rows, and 1 more variable: sex <fct>``````

Problem 1: (2 pts)

Perform a PCA of the `penguins_clean` dataset and make two plots: 1. A rotation plot of components 1 and 2; 2. A plot of the eigenvalues, showing the amount of variance explained by the various components.

``````# perform PCA
pca_fit <- penguins_clean %>%
select(where(is.numeric)) %>% # retain only numeric columns
scale() %>%                   # scale to zero mean and unit variance
prcomp()

# make rotation plot
arrow_style <- arrow(
angle = 20, length = grid::unit(8, "pt"),
ends = "first", type = "closed"
)
pca_fit %>%
# extract rotation matrix
tidy(matrix = "rotation") %>%
pivot_wider(
names_from = "PC", values_from = "value",
names_prefix = "PC"
) %>%
ggplot(aes(PC1, PC2)) +
geom_segment(
xend = 0, yend = 0,
arrow = arrow_style
) +
geom_text(aes(label = column), hjust = 0, vjust = 1) +
coord_fixed(xlim = c(-0.5, 1.2), ylim = c(-.9, .1))``````

``````# make variance explained plot
pca_fit %>%
# extract eigenvalues
tidy(matrix = "eigenvalues") %>%
ggplot(aes(PC, percent)) +
geom_col() +
scale_x_continuous(
# create one axis tick per PC
breaks = 1:4
) +
scale_y_continuous(
name = "variance explained",
# format y axis ticks as percent values
label = scales::label_percent(accuracy = 1)
)``````

Problem 2: (4 pts) Make a scatter plot of PC 2 versus PC 1 and color by penguin species. Then use the rotation plot from Problem 1 to describe the physical characteristics by which the different penguin species differ. Finally, make one more scatter plot of the raw data that can support your interpretation of the PC analysis.

``````pca_fit %>%
# add PCs to the original dataset
augment(penguins_clean) %>%
ggplot(aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(color = species))``````

All three species separate along PC 1. Penguins with high PC 1 have on average higher body mass, longer flippers, and longer but narrower bills. It approximately represents the overall size of the penguins. Therefore, Gentoo penguins are substantially larger than the other two species. However, Chinstrap donâ€™t differ from Adelie in body mass. Instead, they have longer bills.

Supporting plot:

``````penguins_clean %>%
ggplot(aes(body_mass_g, bill_length_mm)) +
geom_point(aes(color = species))``````