Feb 18, 2020
In this worksheet, we will use the libraries tidyverse, patchwork, grid, and ggthemes:
library(tidyverse)
theme_set(theme_bw(base_size=12)) # set default ggplot2 theme
library(patchwork) # required to arrange plots side-by-side
library(grid) # required to draw arrows
library(ggthemes) # for colorblind color scale
The iris
dataset has four measurements per observational unit (iris plant):
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns:
p1 <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p2 <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p3 <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
scale_color_colorblind()
p4 <- ggplot(iris, aes(x = Sepal.Width, y = Petal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p1 + p2 + p3 + p4 + plot_layout(ncol = 2) # arrange in a grid
In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four.
The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function prcomp()
:
iris %>%
select(-Species) %>% # remove Species column
scale() %>% # scale to 0 mean and unit variance
prcomp() -> # do PCA
pca # store result as `pca`
# now display the results from the PCA analysis
pca
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
The main results from PCA are the standard deviations and the rotation matrix. We will talk about them below. First, however, let’s plot the data in the principal components. Specifically, we will plot PC2 vs. PC1. The rotated data are available as pca$x
:
head(pca$x)
## PC1 PC2 PC3 PC4
## [1,] -2.257141 -0.4784238 0.12727962 0.024087508
## [2,] -2.074013 0.6718827 0.23382552 0.102662845
## [3,] -2.356335 0.3407664 -0.04405390 0.028282305
## [4,] -2.291707 0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825 0.006586116
As we can see, these data don’t tell us to which species which observation belongs. We have to add the species information back in:
# add species information back into PCA data
pca_data <- data.frame(pca$x, Species = iris$Species)
head(pca_data)
## PC1 PC2 PC3 PC4 Species
## 1 -2.257141 -0.4784238 0.12727962 0.024087508 setosa
## 2 -2.074013 0.6718827 0.23382552 0.102662845 setosa
## 3 -2.356335 0.3407664 -0.04405390 0.028282305 setosa
## 4 -2.291707 0.5953999 -0.09098530 -0.065735340 setosa
## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870 setosa
## 6 -2.068701 -1.4842053 -0.02687825 0.006586116 setosa
Now we can plot as usual:
ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) +
geom_point() +
scale_color_colorblind()
In the PC2 vs PC1 plot, versicolor and virginica are much better separated.
Next, let’s look at the rotation matrx:
pca$rotation
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
It tells us how much each variable contributes to each principal component. For example, Sepal.Width
contributes little to PC1 but makes up much of PC2. Often it is helpful to plot the rotation matrix as arrows. This can be done as follows:
# capture the rotation matrix in a data frame
rotation_data <- data.frame(
pca$rotation,
variable = row.names(pca$rotation)
)
# define a pleasing arrow style
arrow_style <- arrow(
length = unit(0.05, "inches"),
type = "closed"
)
# now plot, using geom_segment() for arrows and geom_text() for labels
ggplot(rotation_data) +
geom_segment(aes(xend = PC1, yend = PC2), x = 0, y = 0, arrow = arrow_style) +
geom_text(aes(x = PC1, y = PC2, label = variable), hjust = 0, size = 3, color = "red") +
xlim(-1., 1.25) +
ylim(-1., 1.) +
coord_fixed() # fix aspect ratio to 1:1
We can now see clearly that Petal.Length
, Petal.Width
, and Sepal.Length
all contribute to PC1, and Sepal.Width
dominates PC2.
Finally, we want to look at the percent variance explained. The prcomp()
function gives us standard deviations (stored in pca$sdev
). To convert them into percent variance explained, we square them and then divide by the sum over all squared standard deviations:
percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent
## [1] 72.9624454 22.8507618 3.6689219 0.5178709
The first component explains 73% of the variance, the second 23%, the third 4% and the last 0.5%. We can visualize these results nicely in a bar plot:
perc_data <- data.frame(percent = percent, PC = 1:length(percent))
ggplot(perc_data, aes(x = PC, y = percent)) +
geom_col() +
geom_text(aes(label = round(percent, 2)), size = 4, vjust = -0.5) +
ylim(0, 80)
The biopsy data set contains data from 683 patients who had a breast biopsy performed. Each tissue sample was scored according to 9 different characteristics, each on a scale from 1 to 10. Also, for each patient the final outcome (benign/malignant) was known:
biopsy <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/biopsy.csv")
## Parsed with column specification:
## cols(
## clump_thickness = col_double(),
## uniform_cell_size = col_double(),
## uniform_cell_shape = col_double(),
## marg_adhesion = col_double(),
## epithelial_cell_size = col_double(),
## bare_nuclei = col_double(),
## bland_chromatin = col_double(),
## normal_nucleoli = col_double(),
## mitoses = col_double(),
## outcome = col_character()
## )
biopsy
## # A tibble: 683 x 10
## clump_thickness uniform_cell_si… uniform_cell_sh… marg_adhesion
## <dbl> <dbl> <dbl> <dbl>
## 1 5 1 1 1
## 2 5 4 4 5
## 3 3 1 1 1
## 4 6 8 8 1
## 5 4 1 1 3
## 6 8 10 10 8
## 7 1 1 1 1
## 8 2 1 2 1
## 9 2 1 1 1
## 10 4 2 1 1
## # … with 673 more rows, and 6 more variables: epithelial_cell_size <dbl>,
## # bare_nuclei <dbl>, bland_chromatin <dbl>, normal_nucleoli <dbl>,
## # mitoses <dbl>, outcome <chr>
Use PCA to predict the outcome (benign/malignant) from the scored characteristics.
The pottery data set contains the chemical composition of ancient pottery found at four sites in Great Britain:
pottery <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/pottery.csv")
## Parsed with column specification:
## cols(
## Site = col_character(),
## Al = col_double(),
## Fe = col_double(),
## Mg = col_double(),
## Ca = col_double(),
## Na = col_double()
## )
pottery
## # A tibble: 26 x 6
## Site Al Fe Mg Ca Na
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Llanedyrn 14.4 7 4.3 0.15 0.51
## 2 Llanedyrn 13.8 7.08 3.43 0.12 0.17
## 3 Llanedyrn 14.6 7.09 3.88 0.13 0.2
## 4 Llanedyrn 11.5 6.37 5.64 0.16 0.14
## 5 Llanedyrn 13.8 7.06 5.34 0.2 0.2
## 6 Llanedyrn 10.9 6.26 3.47 0.17 0.22
## 7 Llanedyrn 10.1 4.26 4.26 0.2 0.18
## 8 Llanedyrn 11.6 5.78 5.91 0.18 0.16
## 9 Llanedyrn 11.1 5.49 4.52 0.290 0.3
## 10 Llanedyrn 13.4 6.92 7.23 0.28 0.2
## # … with 16 more rows
Use PCA to see whether pottery found at different sites has different chemical composition.