Dimension reduction 2
Introduction
In this worksheet, we will discuss how to perform t-SNE (t-distributed stochastic neighbor embedding), a type of non-linear dimension reduction.
First we need to load the required R packages. Please wait a moment until the live R session is fully set up and all packages are loaded.
Next we set up the data.
We will be working with two datasets, spirals
and blue_jays
. The dataset spirals
contains made-up data in two dimensions that forms three intertwined spirals.
The dataset blue_jays
contains various measurements taken on blue jay birds.
Performing t-SNE on the spirals
dataset
We start by taking a closer look at the spirals
dataset. It has three columns, x
, y
, and group
. When we create a scatterplot of y
against x
and color by group
we see three intertwined spirals.
We perform t-SNE on this dataset with the function Rtsne()
. Data preparation is similar to PCA: First, we discard all non-numeric columns. Then, we scale the variables to zero mean and unit variance.
The result looks quite similar to the plot of the raw data. That is the case because we have not customized t-SNE. The main parameter that we change when running t-SNE is the perplexity value (perplexity
), and its default of 30 is relativley large for the spirals data. We can also change the random seed and the number of iterations until the algorithm is considered converged (max_iter
, higher is better).
Now, to see how the parameter settings change the t-SNE results, run the above code for a few different values of the three custom config parameters we have set up. Pay attention to how the output changes as you change each of these parameters.
# random seed
set.seed(1255)
# run t-SNE with different perplexity and total number of iterations
<- spirals |>
tsne_fit select(where(is.numeric)) |>
scale() |>
Rtsne(perplexity = 8, max_iter = 1000)
# extract coordinates from the `tsne_fit` object and plot
$Y |>
tsne_fitas.data.frame() |>
# put non-numeric data columns back in to the dataset
cbind(select(spirals, -where(is.numeric))) |>
ggplot(aes(V1, V2, color = group)) +
geom_point()
Performing t-SNE on the blue_jays
dataset
Next we will perform t-SNE on the blue_jays
dataset. See if you can adapt the code from the spirals data to work with the blue_jays
dataset.
# random seed
set.seed(1255)
# run t-SNE with different perplexity and total number of iterations
<- ___ |>
tsne_fit select(where(is.numeric)) |>
scale() |>
Rtsne(perplexity = 8, max_iter = 1000)
# extract coordinates from the `tsne_fit` object and plot
$Y |>
tsne_fitas.data.frame() |>
# put non-numeric data columns back in to the dataset
cbind(select(___, -where(is.numeric))) |>
ggplot(aes(V1, V2, color = ___)) +
geom_point()
# random seed
set.seed(1255)
# run t-SNE with different perplexity and total number of iterations
<- blue_jays |>
tsne_fit select(where(is.numeric)) |>
scale() |>
Rtsne(perplexity = 8, max_iter = 1000)
# extract coordinates from the `tsne_fit` object and plot
$Y |>
tsne_fitas.data.frame() |>
# put non-numeric data columns back in to the dataset
cbind(select(blue_jays, -where(is.numeric))) |>
ggplot(aes(V1, V2, color = sex)) +
geom_point()
As before, change the t-SNE configuration parameters and see what effect different choices have on the results you obtain.
# random seed
set.seed(___)
# run t-SNE with different perplexity and total number of iterations
<- blue_jays |>
tsne_fit select(where(is.numeric)) |>
scale() |>
Rtsne(
perplexity = ___,
max_iter = ___
)
# extract coordinates from the `tsne_fit` object and plot
$Y |>
tsne_fitas.data.frame() |>
# put non-numeric data columns back in to the dataset
cbind(select(blue_jays, -where(is.numeric))) |>
ggplot(aes(V1, V2, color = sex)) +
geom_point()
# random seed
set.seed(9327)
# run t-SNE with different perplexity and total number of iterations
<- blue_jays |>
tsne_fit select(where(is.numeric)) |>
scale() |>
Rtsne(
perplexity = 6,
max_iter = 2000
)
# extract coordinates from the `tsne_fit` object and plot
$Y |>
tsne_fitas.data.frame() |>
# put non-numeric data columns back in to the dataset
cbind(select(blue_jays, -where(is.numeric))) |>
ggplot(aes(V1, V2, color = sex)) +
geom_point()