In 1898, Hermon Bumpus, an American biologist working at Brown University, collected data on one of the first examples of natural selection directly observed in nature. Immediately following a bad winter storm, he collected 136 English house sparrows, Passer domesticus, and brought them indoors. Of these birds, 64 had died during the storm, but 72 recovered and survived. By comparing measurements of physical traits, Bumpus demonstrated physical differences between the dead and living birds. He interpreted this finding as evidence for natural selection as a result of this storm:
bumpus <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/bumpus_full.csv")
## Parsed with column specification:
## cols(
## Sex = col_character(),
## Age = col_character(),
## Survival = col_character(),
## Length = col_double(),
## Wingspread = col_double(),
## Weight = col_double(),
## Skull_Length = col_double(),
## Humerus_Length = col_double(),
## Femur_Length = col_double(),
## Tarsus_Length = col_double(),
## Sternum_Length = col_double(),
## Skull_Width = col_double()
## )
head(bumpus)
## # A tibble: 6 x 12
## Sex Age Survival Length Wingspread Weight Skull_Length Humerus_Length
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Male Adult Alive 154 241 24.5 31.2 17.4
## 2 Male Adult Alive 160 252 26.9 30.8 18.7
## 3 Male Adult Alive 155 243 26.9 30.6 18.6
## 4 Male Adult Alive 154 245 24.3 31.7 18.8
## 5 Male Adult Alive 156 247 24.1 31.5 18.2
## 6 Male Adult Alive 161 253 26.5 31.8 19.8
## # … with 4 more variables: Femur_Length <dbl>, Tarsus_Length <dbl>,
## # Sternum_Length <dbl>, Skull_Width <dbl>
The data set has three categorical variables (Sex
, with levels Male
and Female
, Age
, with levels Adult
and Young
, and Survival
, with levels Alive
and Dead
) and nine numerical variables that hold various aspects of the birds’ anatomy, such as wingspread, weight, etc.
Question 1: Perform a PCA on the numerical columns of this data set. Then make three plots potting the data as PC2 vs. PC1, colored by (i) sex, (ii) age, (iii) survival.
# do PCA
pca <- bumpus %>%
select(-Sex, -Age, -Survival) %>%
scale() %>%
prcomp()
pca
## Standard deviations (1, .., p=9):
## [1] 2.3106280 1.0027180 0.8144054 0.7377845 0.6802200 0.6403176 0.5074763
## [8] 0.4459542 0.3447601
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## Length 0.3113182 -0.48175513 0.02229922 -0.41815857 0.19719494
## Wingspread 0.3519669 -0.28971389 0.32621226 -0.21858032 0.25216077
## Weight 0.3140960 -0.33493529 -0.21062751 -0.15604080 -0.72876548
## Skull_Length 0.3328845 0.13688694 -0.31044883 0.30038515 -0.29893156
## Humerus_Length 0.3851414 0.21428192 0.24369845 -0.04486213 0.05060431
## Femur_Length 0.3630277 0.41484276 0.16164946 -0.03621275 0.10202003
## Tarsus_Length 0.3392712 0.46086907 0.18672729 -0.12495881 -0.16734831
## Sternum_Length 0.2962715 -0.34123813 0.19778895 0.80186706 0.14007820
## Skull_Width 0.2944947 0.07876015 -0.76996029 -0.02926814 0.46526730
## PC6 PC7 PC8 PC9
## Length -0.38191614 -0.51623648 0.15746650 0.139991140
## Wingspread -0.01722238 0.58909210 -0.38269685 -0.283946038
## Weight 0.40902438 0.03600302 0.09143718 -0.111320803
## Skull_Length -0.73394265 0.23323278 -0.03753325 -0.001420609
## Humerus_Length 0.14570457 0.30000698 0.40887962 0.680713192
## Femur_Length 0.04554965 -0.15236298 0.47981073 -0.634632805
## Tarsus_Length 0.08364904 -0.39844161 -0.63887140 0.141767330
## Sternum_Length 0.18447218 -0.23988512 -0.04919212 0.007200079
## Skull_Width 0.28903106 0.03545977 -0.10488329 0.033222346
pca$scale # information on how the variables in the dataset were scaled
## Length Wingspread Weight Skull_Length Humerus_Length
## 3.5608315 5.5122709 1.4773290 0.7054777 0.5904685
## Femur_Length Tarsus_Length Sternum_Length Skull_Width
## 0.6094387 1.0296262 1.0045593 0.3805969
pca$center # information on how the variables were centered around 0
## Length Wingspread Weight Skull_Length Humerus_Length
## 3.281280e-15 1.200531e-15 -1.119802e-15 -1.963299e-15 -2.657737e-15
## Femur_Length Tarsus_Length Sternum_Length Skull_Width
## -1.976360e-15 1.238631e-15 -7.000119e-17 -1.787173e-15
pca$rotation # resulting rotation matrix containing eigenvectors
## PC1 PC2 PC3 PC4 PC5
## Length 0.3113182 -0.48175513 0.02229922 -0.41815857 0.19719494
## Wingspread 0.3519669 -0.28971389 0.32621226 -0.21858032 0.25216077
## Weight 0.3140960 -0.33493529 -0.21062751 -0.15604080 -0.72876548
## Skull_Length 0.3328845 0.13688694 -0.31044883 0.30038515 -0.29893156
## Humerus_Length 0.3851414 0.21428192 0.24369845 -0.04486213 0.05060431
## Femur_Length 0.3630277 0.41484276 0.16164946 -0.03621275 0.10202003
## Tarsus_Length 0.3392712 0.46086907 0.18672729 -0.12495881 -0.16734831
## Sternum_Length 0.2962715 -0.34123813 0.19778895 0.80186706 0.14007820
## Skull_Width 0.2944947 0.07876015 -0.76996029 -0.02926814 0.46526730
## PC6 PC7 PC8 PC9
## Length -0.38191614 -0.51623648 0.15746650 0.139991140
## Wingspread -0.01722238 0.58909210 -0.38269685 -0.283946038
## Weight 0.40902438 0.03600302 0.09143718 -0.111320803
## Skull_Length -0.73394265 0.23323278 -0.03753325 -0.001420609
## Humerus_Length 0.14570457 0.30000698 0.40887962 0.680713192
## Femur_Length 0.04554965 -0.15236298 0.47981073 -0.634632805
## Tarsus_Length 0.08364904 -0.39844161 -0.63887140 0.141767330
## Sternum_Length 0.18447218 -0.23988512 -0.04919212 0.007200079
## Skull_Width 0.28903106 0.03545977 -0.10488329 0.033222346
head(pca$x) # these were the values that were rotated; I've wrapped it with the head() function to keep knit from printing all 136 rows
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -3.8856548 -1.3120204 -0.5218476 1.1129461 -0.1905058 -0.2336805
## [2,] 0.8224474 -0.3998802 0.7775967 -0.8865862 -0.2099604 1.2483988
## [3,] -0.6426387 0.2143840 0.1084914 0.1627538 -0.6951568 1.9420813
## [4,] -1.2227485 0.7674720 1.0376675 0.8433835 -0.5152572 -0.2262541
## [5,] -1.6146988 0.3682892 1.3968497 0.2316305 -0.3493214 -0.6703105
## [6,] 2.7783623 -0.1149411 1.0431605 0.4265468 0.3594881 0.5671766
## PC7 PC8 PC9
## [1,] 0.96374801 0.1525549 -0.57521792
## [2,] 0.02392220 -1.0953482 -0.04083636
## [3,] -0.01924521 -0.3374622 0.10601678
## [4,] 0.88126091 -0.7067789 0.74563716
## [5,] 0.56087560 -0.5365881 -0.44692206
## [6,] 0.68771676 0.7389925 0.20013441
bumpus_pca <- data.frame(bumpus, pca$x) # combine original data frame with PCs
# plot by sex
ggplot(bumpus_pca, aes(x = PC1, y = PC2, color = Sex)) +
geom_point()
# plot by age
ggplot(bumpus_pca, aes(x = PC1, y = PC2, color = Age)) +
geom_point()
# plot by survival
ggplot(bumpus_pca, aes(x = PC1, y = PC2, color = Survival)) +
geom_point()
Question 2: Now visualize the rotation matrix of the PCA obtained under Question 1.
Hint: Look at yesterday’s in class worksheet to find the code for visualizing the rotation matrix with arrows.
# capture the rotation matrix in a data frame
rotation_data <- data.frame(pca$rotation) %>%
rownames_to_column(var = "bird_feature")
# define a pleasing arrow style
arrow_style <- arrow(
length = unit(0.05, "inches"),
type = "closed"
)
# now plot, using geom_segment() for arrows and geom_text for labels
ggplot(rotation_data) +
geom_segment(aes(xend = PC1, yend = PC2), x = 0, y = 0, arrow = arrow_style) +
geom_text(aes(x = PC1, y = PC2, label = bird_feature), hjust = 0, size = 3, color = "red") +
xlim(-1., 1.25) +
ylim(-1., 1.) +
coord_fixed() # fix aspect ratio to 1:1
Question 3: Given the four plots from Questions 1 and 2, how do you interpret PC1 and PC2? What does PC1 tell you about a data point? What does PC2 tell you about a data point?
PC1 seems to measure the overall body size of the birds. All variables contribute positively to PC1, hence the larger an animal the larger its PC1 value.
PC2 seems to measure the difference between male and female birds. Most female birds score positively on PC2, and most male birds score negatively on PC2.
Question 4: What percentage of the variation in the data does PC1 explain?
Hint: Look at yesterday’s in class worksheet to find the calculation for relative variance of each principal component.
perc_variation <- 100 * (pca$sdev^2 / sum(pca$sdev^2))
perc_df <- data.frame(percent = perc_variation, princ_comp = 1:length(perc_variation))
ggplot(perc_df, aes(x = princ_comp, y = percent)) +
geom_col() +
geom_text(aes(label = round(percent, 2)), size = 4, vjust = -0.5) +
ylim(0, 70)
PC1 explains 59% of the variation in the data.
Question 5: Does the PCA suggest any specific physical characteristics for birds that survived? Consider only PC1 and PC2 for your answer.
Not really, dead and alive birds seem to be sprinkled all over the PC2-vs-PC1 plot. There is maybe a minor tendency for dead birds to score more negative on PC2.