```{r global_options, include=FALSE}
library(knitr)
opts_chunk$set(fig.align="center", fig.height=3, fig.width=4)
```
## In-class worksheet 9
**Feb 19, 2019**
In this worksheet, we will use the libraries tidyverse, egg, grid, and ggthemes:
```{r message=FALSE}
library(tidyverse)
theme_set(theme_bw(base_size=12)) # set default ggplot2 theme
library(egg) # required to arrange plots side-by-side
library(grid) # required to draw arrows
library(ggthemes) # for colorblind color scale
```
# 1. PCA of the iris data set
The `iris` dataset has four measurements per observational unit (iris plant):
```{r}
head(iris)
```
If we want to find out which characteristics are most distinguishing between iris plants, we have to make many individual plots and hope we can see distinguishing patterns:
```{r fig.height=6, fig.width=8}
p1 <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p2 <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
p3 <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point() +
scale_color_colorblind()
p4 <- ggplot(iris, aes(x = Sepal.Width, y = Petal.Width, color = Species)) +
geom_point() +
scale_color_colorblind()
ggarrange(p1, p2, p3, p4, ncol = 2) # arrange in a grid
```
In this particular case, it seems that petal length and petal width are most distinct for the three species. Principal Components Analysis (PCA) allows us to systematically discover such patterns, and it works also when there are many more variables than just four.
The basic steps in PCA are to (i) prepare a data frame that holds only the numerical columns of interest, (ii) scale the data to 0 mean and unit variance, and (iii) do the PCA with the function `prcomp()`:
```{r}
iris %>%
select(-Species) %>% # remove Species column
scale() %>% # scale to 0 mean and unit variance
prcomp() -> # do PCA
pca # store result as `pca`
# now display the results from the PCA analysis
pca
```
The main results from PCA are the standard deviations and the rotation matrix. We will talk about them below. First, however, let's plot the data in the principal components. Specifically, we will plot PC2 vs. PC1. The rotated data are available as `pca$x`:
```{r}
head(pca$x)
```
As we can see, these data don't tell us to which species which observation belongs. We have to add the species information back in:
```{r}
# add species information back into PCA data
pca_data <- data.frame(pca$x, Species = iris$Species)
head(pca_data)
```
Now we can plot as usual:
```{r}
ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) +
geom_point() +
scale_color_colorblind()
```
In the PC2 vs PC1 plot, versicolor and virginica are much better separated.
Next, let's look at the rotation matrx:
```{r}
pca$rotation
```
It tells us how much each variable contributes to each principal component. For example, `Sepal.Width` contributes little to PC1 but makes up much of PC2. Often it is helpful to plot the rotation matrix as arrows. This can be done as follows:
```{r}
# capture the rotation matrix in a data frame
rotation_data <- data.frame(
pca$rotation,
variable = row.names(pca$rotation)
)
# define a pleasing arrow style
arrow_style <- arrow(
length = unit(0.05, "inches"),
type = "closed"
)
# now plot, using geom_segment() for arrows and geom_text() for labels
ggplot(rotation_data) +
geom_segment(aes(xend = PC1, yend = PC2), x = 0, y = 0, arrow = arrow_style) +
geom_text(aes(x = PC1, y = PC2, label = variable), hjust = 0, size = 3, color = "red") +
xlim(-1., 1.25) +
ylim(-1., 1.) +
coord_fixed() # fix aspect ratio to 1:1
```
We can now see clearly that `Petal.Length`, `Petal.Width`, and `Sepal.Length` all contribute to PC1, and `Sepal.Width` dominates PC2.
Finally, we want to look at the percent variance explained. The `prcomp()` function gives us standard deviations (stored in `pca$sdev`). To convert them into percent variance explained, we square them and then divide by the sum over all squared standard deviations:
```{r}
percent <- 100*pca$sdev^2 / sum(pca$sdev^2)
percent
```
The first component explains 73% of the variance, the second 23%, the third 4% and the last 0.5%.
We can visualize these results nicely in a bar plot:
```{r}
perc_data <- data.frame(percent = percent, PC = 1:length(percent))
ggplot(perc_data, aes(x = PC, y = percent)) +
geom_col() +
geom_text(aes(label = round(percent, 2)), size = 4, vjust = -0.5) +
ylim(0, 80)
```
# 2. Now do it yourself: The biopsy data set
The biopsy data set contains data from 683 patients who had a breast biopsy performed. Each tissue sample was scored according to 9 different characteristics, each on a scale from 1 to 10. Also, for each patient the final outcome (benign/malignant) was known:
```{r}
biopsy <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/biopsy.csv")
head(biopsy)
```
Use PCA to predict the outcome (benign/malignant) from the scored characteristics.
# 3. If this was easy
The pottery data set contains the chemical composition of ancient pottery found at four sites in Great Britain:
```{r}
pottery <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/pottery.csv")
head(pottery)
```
Use PCA to see whether pottery found at different sites has different chemical composition.