In-class worksheet 9

Feb 18, 2020

In this worksheet, we will use the libraries tidyverse, patchwork, grid, and ggthemes:

library(tidyverse)
theme_set(theme_bw(base_size=12)) # set default ggplot2 theme
library(patchwork) # required to arrange plots side-by-side
library(grid) # required to draw arrows
library(ggthemes) # for colorblind color scale

1. PCA of the iris data set

The iris dataset has four measurements per observational unit (iris plant):

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns:

p1 <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() +
  scale_color_colorblind()
p2 <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point() +
  scale_color_colorblind()
p3 <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point() +
  scale_color_colorblind()
p4 <- ggplot(iris, aes(x = Sepal.Width, y = Petal.Width, color = Species)) +
  geom_point() +
  scale_color_colorblind()
p1 + p2 + p3 + p4 + plot_layout(ncol = 2) # arrange in a grid

In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four.

The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function prcomp():

iris %>% 
  select(-Species) %>%   # remove Species column
  scale() %>%            # scale to 0 mean and unit variance
  prcomp() ->            # do PCA
  pca                    # store result as `pca`

# now display the results from the PCA analysis
pca
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

The main results from PCA are the standard deviations and the rotation matrix. We will talk about them below. First, however, let’s plot the data in the principal components. Specifically, we will plot PC2 vs. PC1. The rotated data are available as pca$x:

head(pca$x)
##            PC1        PC2         PC3          PC4
## [1,] -2.257141 -0.4784238  0.12727962  0.024087508
## [2,] -2.074013  0.6718827  0.23382552  0.102662845
## [3,] -2.356335  0.3407664 -0.04405390  0.028282305
## [4,] -2.291707  0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825  0.006586116

As we can see, these data don’t tell us to which species which observation belongs. We have to add the species information back in:

# add species information back into PCA data
pca_data <- data.frame(pca$x, Species = iris$Species)
head(pca_data)
##         PC1        PC2         PC3          PC4 Species
## 1 -2.257141 -0.4784238  0.12727962  0.024087508  setosa
## 2 -2.074013  0.6718827  0.23382552  0.102662845  setosa
## 3 -2.356335  0.3407664 -0.04405390  0.028282305  setosa
## 4 -2.291707  0.5953999 -0.09098530 -0.065735340  setosa
## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870  setosa
## 6 -2.068701 -1.4842053 -0.02687825  0.006586116  setosa

Now we can plot as usual:

ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) + 
  geom_point() +
  scale_color_colorblind()

In the PC2 vs PC1 plot, versicolor and virginica are much better separated.

Next, let’s look at the rotation matrx:

pca$rotation
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

It tells us how much each variable contributes to each principal component. For example, Sepal.Width contributes little to PC1 but makes up much of PC2. Often it is helpful to plot the rotation matrix as arrows. This can be done as follows:

# capture the rotation matrix in a data frame
rotation_data <- data.frame(
  pca$rotation, 
  variable = row.names(pca$rotation)
)

# define a pleasing arrow style
arrow_style <- arrow(
  length = unit(0.05, "inches"),
  type = "closed"
)

# now plot, using geom_segment() for arrows and geom_text() for labels
ggplot(rotation_data) + 
  geom_segment(aes(xend = PC1, yend = PC2), x = 0, y = 0, arrow = arrow_style) + 
  geom_text(aes(x = PC1, y = PC2, label = variable), hjust = 0, size = 3, color = "red") + 
  xlim(-1., 1.25) + 
  ylim(-1., 1.) +
  coord_fixed() # fix aspect ratio to 1:1

We can now see clearly that Petal.Length, Petal.Width, and Sepal.Length all contribute to PC1, and Sepal.Width dominates PC2.

Finally, we want to look at the percent variance explained. The prcomp() function gives us standard deviations (stored in pca$sdev). To convert them into percent variance explained, we square them and then divide by the sum over all squared standard deviations:

percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent
## [1] 72.9624454 22.8507618  3.6689219  0.5178709

The first component explains 73% of the variance, the second 23%, the third 4% and the last 0.5%. We can visualize these results nicely in a bar plot:

perc_data <- data.frame(percent = percent, PC = 1:length(percent))
ggplot(perc_data, aes(x = PC, y = percent)) + 
  geom_col() + 
  geom_text(aes(label = round(percent, 2)), size = 4, vjust = -0.5) + 
  ylim(0, 80)

2. Now do it yourself: The biopsy data set

The biopsy data set contains data from 683 patients who had a breast biopsy performed. Each tissue sample was scored according to 9 different characteristics, each on a scale from 1 to 10. Also, for each patient the final outcome (benign/malignant) was known:

biopsy <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/biopsy.csv")
## Parsed with column specification:
## cols(
##   clump_thickness = col_double(),
##   uniform_cell_size = col_double(),
##   uniform_cell_shape = col_double(),
##   marg_adhesion = col_double(),
##   epithelial_cell_size = col_double(),
##   bare_nuclei = col_double(),
##   bland_chromatin = col_double(),
##   normal_nucleoli = col_double(),
##   mitoses = col_double(),
##   outcome = col_character()
## )
biopsy
## # A tibble: 683 x 10
##    clump_thickness uniform_cell_si… uniform_cell_sh… marg_adhesion
##              <dbl>            <dbl>            <dbl>         <dbl>
##  1               5                1                1             1
##  2               5                4                4             5
##  3               3                1                1             1
##  4               6                8                8             1
##  5               4                1                1             3
##  6               8               10               10             8
##  7               1                1                1             1
##  8               2                1                2             1
##  9               2                1                1             1
## 10               4                2                1             1
## # … with 673 more rows, and 6 more variables: epithelial_cell_size <dbl>,
## #   bare_nuclei <dbl>, bland_chromatin <dbl>, normal_nucleoli <dbl>,
## #   mitoses <dbl>, outcome <chr>

Use PCA to predict the outcome (benign/malignant) from the scored characteristics.

3. If this was easy

The pottery data set contains the chemical composition of ancient pottery found at four sites in Great Britain:

pottery <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/pottery.csv")
## Parsed with column specification:
## cols(
##   Site = col_character(),
##   Al = col_double(),
##   Fe = col_double(),
##   Mg = col_double(),
##   Ca = col_double(),
##   Na = col_double()
## )
pottery
## # A tibble: 26 x 6
##    Site         Al    Fe    Mg    Ca    Na
##    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Llanedyrn  14.4  7     4.3  0.15   0.51
##  2 Llanedyrn  13.8  7.08  3.43 0.12   0.17
##  3 Llanedyrn  14.6  7.09  3.88 0.13   0.2 
##  4 Llanedyrn  11.5  6.37  5.64 0.16   0.14
##  5 Llanedyrn  13.8  7.06  5.34 0.2    0.2 
##  6 Llanedyrn  10.9  6.26  3.47 0.17   0.22
##  7 Llanedyrn  10.1  4.26  4.26 0.2    0.18
##  8 Llanedyrn  11.6  5.78  5.91 0.18   0.16
##  9 Llanedyrn  11.1  5.49  4.52 0.290  0.3 
## 10 Llanedyrn  13.4  6.92  7.23 0.28   0.2 
## # … with 16 more rows

Use PCA to see whether pottery found at different sites has different chemical composition.