Dimension reduction 1

Author

Claus O. Wilke

Introduction

In this worksheet, we will discuss how to perform principal components analysis (PCA).

First we need to load the required R packages. Please wait a moment until the live R session is fully set up and all packages are loaded.

Next we set up the data.

We will be working with the dataset blue_jays. It contains various measurements taken on blue jay birds.

Performing a PCA

We can perform a PCA with the function prcomp(). However, we first need to prepare the data. PCA can only take numeric columns, and it is best to scale all variables to zero mean and unit variance.

We can select all numeric columns in a dataset with select(where(is.numeric)) and we can scale an entire dataset consisting of only numeric columns with scale(). Try this on the blue_jays dataset. Modify the dataset so it is entirely numeric and properly scaled.

Hint
blue_jays |>
  select(where(is.numeric)) |>
  ___
Solution
blue_jays |>
  select(where(is.numeric)) |>
  scale()

Next run prcomp() on this modified dataset.

Hint
blue_jays |>
  select(where(is.numeric)) |>
  scale() |>
  ___
Solution
blue_jays |>
  select(where(is.numeric)) |>
  scale() |>
  prcomp()

In practice, we store the output from prcomp() in a variable for subsequent downstream manipulations:

Then we can extract useful data from this model fit object by running various functions from the broom package on it. For example, the tidy() function extracts model parameters in a tidy format. It takes an argument matrix that can take the values "scores", "rotation", and "eigenvalues". See what these different options do.

Solution
pca_fit |>
  tidy(matrix = "scores")

pca_fit |>
  tidy(matrix = "rotation")

pca_fit |>
  tidy(matrix = "eigenvalues")

We can also add the original dataset back into the PCA coordinates via the augment() function. This is helpful for example when we want to plot values from the original dataset (such as some of the categorical variables removed at the first step of the analysis) in the transformed coordinate system. Try out how augment() works.

Hint
pca_fit |>
  augment(___)
Solution
pca_fit |>
  augment(blue_jays)

Making a PCA plot

When plotting the results from a PCA, we usually make three separate plots: (i) we plot the individual data points in PC coordinates, (ii) we plot the rotation matrix, and (iii) we plot the variance explained by each components. Let’s discuss each of these in turn.

Plotting individual data points in PC coordinates

In the previous subsection, we used augment() to add the original dataset back into the PCA coordinates. The result from this computation can be used directly in ggplot. Try this out by plotting PC 2 versus PC 1 and coloring by the sex of the birds. Remember: The columns containing PC coordinates are called .fittedPC1, .fittedPC2, etc.

Hint
pca_fit |>
  augment(blue_jays) |>
  ggplot(aes(.fittedPC1, .fittedPC2, color = sex)) +
  ___
Solution
pca_fit |>
  augment(blue_jays) |>
  ggplot(aes(.fittedPC1, .fittedPC2, color = sex)) +
  geom_point() +
  coord_fixed()

Try also plotting other PC coordinates, e.g. PC 3 vs PC 2 or PC 3 vs PC 1.

Plotting the rotation matrix

To plot the rotation matrix we require a bit of boiler-plate code. It is always the same, so it’s fine to copy-and-paste when needed. This is the baseline for a rotation-matrix plot:

# define an arrow style
arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)

pca_fit |>
  tidy(matrix = "rotation") |>  # extract rotation matrix
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column)) +
  coord_fixed(
    # you will generally have to set the limits appropriately
    xlim = ___,
    ylim = ___
  )

Use the above code to plot the rotation matrix for the blue jays PCA analysis. Make two customizations: 1. Change the x and y limits to appropriate values. Use hjust and/or vjust in geom_text() to aligne the text labels appropriately.

Solution
# define an arrow style
arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)

pca_fit |>
  tidy(matrix = "rotation") |>  # extract rotation matrix
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column), hjust = 1) +
  coord_fixed(
    xlim = c(-1.7, .5),
    ylim = c(-1, 1)
  )

Now do the same for PC 2 versus PC 3. (Hint: This means putting PC 3 on the x axis.)

Solution
# define an arrow style
arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)

pca_fit |>
  tidy(matrix = "rotation") |>  # extract rotation matrix
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC3, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column), hjust = c(1, 0, 1, 1, 1, 1)) +
  coord_fixed(
    xlim = c(-1.3, 1.8),
    ylim = c(-1, .8)
  )

Plotting the eigenvalues (variance explained)

To plot the variance explained, we extract the eigenvalues with the function tidy(), as discussed above. As a reminder, do this one more time and inspect the structure of the output.

Hint
pca_fit |>
  tidy(___)
Solution
pca_fit |>
  tidy(matrix = "eigenvalues")

Now make a bar plot of the percent variance explained (column percent) by each PC.

Hint
pca_fit |>
  tidy(matrix = "eigenvalues") |>
  ggplot(___) + 
  ___
Solution
pca_fit |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_col() +
  scale_x_continuous(breaks = 1:6) +
  scale_y_continuous(labels = scales::label_percent())