Clustering

Author

Claus O. Wilke

Introduction

In this worksheet, we will discuss how to perform k-means clustering.

First we need to load the required R packages. Please wait a moment until the live R session is fully set up and all packages are loaded.

Next we set up the data.

We will be working with two datasets, spirals and iris. The dataset spirals contains made-up data in two dimensions that forms three intertwined spirals.

The dataset iris contains measurements on the leaves of flowers of three Iris species.

Hint: Pay attention to the column names in the iris dataset. They are all capitalized (e.g., Species), and the first four use a point as a separator (e.g., Sepal.Length). It is easy to misspell them and then the R code doesn’t work correctly.

Clustering the iris dataset

We perform k-means clustering in R with the function kmeans(). It takes two important arguments, the number of clusters we want to generate (centers) and the number of times we want to re-run the clustering algorithm with different random starting points (nstart). Similarly to a PCA, we need to remove all non-numeric data columns before we can run the analysis.

The output from the fitted object (km_fit) gives us various pieces of information in human-readable form, such as the cluster sizes, the cluster means, and the assignment of rows in the original data table to the different clusters (“clustering vector”).

At the end of the output, you see a list of “available components.” These are various pieces of information about the clustering result that you can extract from the fitted object. For example, km_fit$cluster gives you information about which row in the original data table belongs to which cluster.

Similarly, centers will give you the positions of the cluster centers and tot.withinss will give you the total within sum-of-squares. Try this out. Also, see if you can figure out what the component size represents.

Hint
km_fit$cluster
___
___
Solution
km_fit$cluster      # assignment of original data rows to clusters
km_fit$tot.withinss # total within sum-of-squares
km_fit$size         # cluster sizes (number of observations in each cluster)

Next we move on to plotting the clustering output. The k-means algorithm is a stochastic algorithm that produces slightly different output each time it is run. This is particularly apparent when you set nstart = 1. In this case, you will get possibly quite different results for different random seeds. You can set the random seed via set.seed().

In the example below, try various seeds, including 2356, 2357, 2358, 2359, and see what the results are.

Now set nstart = 10 and try the same random seeds once more.

Finding the appropriate number of clusters

To get a sense of the correct number of clusters for a given dataset, we can plot the total within sum-of-squares as a function of the cluster number and look for a bend (“elbow”) in the curve. Remember, the total within sum-of-squares can be obtained from the fitted object via km_fit$tot.withinss.

The following code sets up a function calc_withinss that calculates the total within sum-of-squares for an arbitrary data set and number of clusters, and then applies it to the spirals dataset.

Now take this code and use it to make a plot of the total within sum-of-squares against cluster number.

Hint
tibble(centers = 1:15) |>
  mutate(
    within_sum_squares = map_dbl(
      centers, ~calc_withinss(spirals, .x)
    )
  ) |>
  ggplot(aes(___)) +
  ___
Solution
tibble(centers = 1:15) |>
  mutate(
    within_sum_squares = map_dbl(
      centers, ~calc_withinss(spirals, .x)
    )
  ) |>
  ggplot(aes(centers, within_sum_squares)) +
  geom_point() +
  geom_line() +
  theme_bw()

The plot suggests that the correct number of clusters should be around 3 or 4. Now cluster the spirals dataset with this number of clusters and then plot it colored by cluster id.

Hint
km_fit <- ___ |> 
  select(where(is.numeric)) |>
  kmeans(centers = 3, nstart = 10)

km_fit |>
  augment(___) |>
  ___
Hint
km_fit <- spirals |> 
  select(where(is.numeric)) |>
  kmeans(centers = 3, nstart = 10)

km_fit |>
  augment(spirals) |>
  ggplot() +
  aes(___) +
  geom_point(aes(color = ___, shape = ___))
Solution
km_fit <- spirals |> 
  select(where(is.numeric)) |>
  kmeans(centers = 3, nstart = 10)

km_fit |>
  augment(spirals) |>
  ggplot() +
  aes(x, y) +
  geom_point(aes(color = .cluster, shape = group))

Try a few different cluster numbers to see how the algorithm behaves. Do you think k-means clustering works on this dataset?

Combining k-means and PCA

In practice, we often perform PCA first on a dataset and then cluster the transformed coordinates. Try this out on the iris dataset. Run a PCA, then cluster the PCA coordinates, and then plot the clusters in PCA space.

As a reminder, this is how we would do a PCA on this dataset:

Now modify this example so you perform a k-means clustering analysis and then color by clusters rather than by species.

Hint
pca_fit <- iris |> 
  select(where(is.numeric)) |> # retain only numeric columns
  scale() |>                   # scale to zero mean and unit variance
  prcomp()

# combine iris data with PCA data (needed for plot)
iris_pca <- augment(pca_fit, iris)

# perform k-means
km_fit <- augment(pca_fit) |>
  select(-.rownames) |>
  ___
Hint
pca_fit <- iris |> 
  select(where(is.numeric)) |> # retain only numeric columns
  scale() |>                   # scale to zero mean and unit variance
  prcomp()

# combine iris data with PCA data (needed for plot)
iris_pca <- augment(pca_fit, iris)

# perform k-means
km_fit <- augment(pca_fit) |>
  select(-.rownames) |>
  kmeans(centers = 3, nstart = 10)

km_fit |>
  # combine with original data and the PCA coordinates
  ___ |>
  # plot
  ___
Hint
pca_fit <- iris |> 
  select(where(is.numeric)) |> # retain only numeric columns
  scale() |>                   # scale to zero mean and unit variance
  prcomp()

# combine iris data with PCA data (needed for plot)
iris_pca <- augment(pca_fit, iris)

# perform k-means
km_fit <- augment(pca_fit) |>
  select(-.rownames) |>
  kmeans(centers = 3, nstart = 10)

km_fit |>
  # combine with original data and the PCA coordinates
  augment(iris_pca) |>
  ggplot() +
  ___
Hint
pca_fit <- iris |> 
  select(where(is.numeric)) |> # retain only numeric columns
  scale() |>                   # scale to zero mean and unit variance
  prcomp()

# combine iris data with PCA data (needed for plot)
iris_pca <- augment(pca_fit, iris)

# perform k-means
km_fit <- augment(pca_fit) |>
  select(-.rownames) |>
  kmeans(centers = 3, nstart = 10)

km_fit |>
  # combine with original data and the PCA coordinates
  augment(iris_pca) |>
  ggplot() +
  aes(x = .fittedPC1, .fittedPC2) +
  geom_point(
    aes(color = .cluster, shape = Species)
  )
Solution
pca_fit <- iris |> 
  select(where(is.numeric)) |> # retain only numeric columns
  scale() |>                   # scale to zero mean and unit variance
  prcomp()

# combine iris data with PCA data (needed for plot)
iris_pca <- augment(pca_fit, iris)

# perform k-means
km_fit <- augment(pca_fit) |>
  select(-.rownames) |>
  kmeans(centers = 3, nstart = 10)

km_fit |>
  # combine with original data and the PCA coordinates
  augment(iris_pca) |>
  ggplot() +
  aes(x = .fittedPC1, .fittedPC2) +
  geom_point(
    aes(color = .cluster, shape = Species)
  ) +
  geom_point(
    data = tidy(km_fit),
    aes(fill = cluster),
    shape = 21, color = "black", size = 4
  ) +
  guides(color = "none")

Change which components you plot on the x and the y axis to try to get a sense of how the clusters are located in the 4-dimensional PC space.