class: center, middle, title-slide .title[ # Dimension reduction 2 ] .author[ ### Claus O. Wilke ] .date[ ### last updated: 2023-04-10 ] --- ## What if a rotation cannot disentangle the data? -- .center[ ![](dimension-reduction-2_files/figure-html/spirals-1.svg)<!-- --> ] --- ## PCA analysis of intertwined spirals is not useful .center[ ![](dimension-reduction-2_files/figure-html/pca-spirals-1.svg)<!-- --> ] --- ## One possible approach: Kernel PCA -- - Kernel PCA performs PCA in a hypothetical, higher-dimensional space -- - With more dimensions, data points become more separable -- - Importantly, the space is never explicitly constructed ([kernel trick](https://en.wikipedia.org/wiki/Kernel_method#Mathematics:_the_kernel_trick)) -- - Results from kernel PCA depend on choice of kernel --- ## Kernel PCA can separate the spirals .center[ ![](dimension-reduction-2_files/figure-html/kpca-spirals-1.svg)<!-- --> ] Gaussian kernel, sigma = 64 --- ## But we need to choose the right sigma value .center[ ![](dimension-reduction-2_files/figure-html/kpca-spirals-grid-1.svg)<!-- --> ] --- ## Other approaches -- - t-SNE: t-distributed stochastic neighbor embedding -- - UMAP: Uniform manifold approximation and projection -- Both algorithms look at the local distances between points in the original data space and try to reproduce them in the low-dimensional representation --- ## t-SNE can separate the spirals .center[ ![](dimension-reduction-2_files/figure-html/tsne-spirals-1.svg)<!-- --> ] --- ## t-SNE results depend on the perplexity value .center[ ![](dimension-reduction-2_files/figure-html/tsne-spirals-grid-1.svg)<!-- --> ] --- ## t-SNE results depend on the random starting point .center[ ![](dimension-reduction-2_files/figure-html/tsne-spirals-grid2-1.svg)<!-- --> ] --- ## UMAP can separate the spirals .center[ ![](dimension-reduction-2_files/figure-html/umap-spirals-1.svg)<!-- --> ] --- ## UMAP results depend on the number of neighbors .center[ ![](dimension-reduction-2_files/figure-html/umap-spirals-grid-1.svg)<!-- --> ] --- ## Random starting point has some impact on results .center[ ![](dimension-reduction-2_files/figure-html/umap-spirals-grid2-1.svg)<!-- --> ] --- ## What is the meaning of the tuning parameters? -- Tuning parameters define when points are close in the original data space -- This implicitly defines the number of clusters generated -- These have comparable effects: - sigma (Gaussian kernel PCA) - perplexity (t-SNE) - number of neighbors (UMAP) --- class: center middle ## How do these methods perform<br>on the blue jays dataset? --- ## UMAP of blue jays .center[ ![](dimension-reduction-2_files/figure-html/umap-blue-jays-gray-1.svg)<!-- --> ] --- ## UMAP of blue jays .center[ ![](dimension-reduction-2_files/figure-html/umap-blue-jays-sex-1.svg)<!-- --> ] --- ## Kernel PCA of blue jays .center[ ![](dimension-reduction-2_files/figure-html/kpca-blue-jays-1.svg)<!-- --> ] --- ## Nonlinear methods have important downsides -- - Results depend on parameter fine tuning -- - Low-dimensional embedding cannot be interpreted<br> (no rotation matrix plot) -- Use only when linear methods clearly aren't working [//]: # "segment ends here" --- class: middle center ## Doing nonlinear dimension reduction in R --- ## Getting the data We'll be working with the `blue_jays` dataset: .tiny-font[ ```r blue_jays <- read_csv("https://wilkelab.org/SDS375/datasets/blue_jays.csv") blue_jays ``` ``` # A tibble: 123 × 8 bird_id sex bill_depth_mm bill_width_mm bill_l…¹ head_…² body_…³ skull…⁴ <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0000-00000 M 8.26 9.21 25.9 56.6 73.3 30.7 2 1142-05901 M 8.54 8.76 25.0 56.4 75.1 31.4 3 1142-05905 M 8.39 8.78 26.1 57.3 70.2 31.2 4 1142-05907 F 7.78 9.3 23.5 53.8 65.5 30.3 5 1142-05909 M 8.71 9.84 25.5 57.3 74.9 31.8 6 1142-05911 F 7.28 9.3 22.2 52.2 63.9 30 7 1142-05912 M 8.74 9.28 25.4 57.1 75.1 31.8 8 1142-05914 M 8.72 9.94 30 60.7 78.1 30.7 9 1142-05917 F 8.2 9.01 22.8 52.8 64 30.0 10 1142-05920 F 7.67 9.31 24.6 54.9 67.3 30.3 # … with 113 more rows, and abbreviated variable names ¹bill_length_mm, # ²head_length_mm, ³body_mass_g, ⁴skull_size_mm ``` ] --- ## Doing nonlinear dimension reduction in R -- - All these methods require special packages:<br> **kernlab** (kernel PCA)<br> **Rtsne** (t-SNE)<br> **umap** (UMAP) -- - Code examples are somewhat messy -- - Will do UMAP as example --- ## Doing UMAP in R .tiny-font[ ```r library(umap) # set up UMAP parameters custom.config <- umap.defaults custom.config$n_neighbors <- 16 # number of neighbors custom.config$n_epochs <- 500 # number of iterations for convergence custom.config$random_state <- 1234 # random seed # calculate UMAP fit object umap_fit <- blue_jays %>% select(where(is.numeric)) %>% # retain only numeric columns scale() %>% # scale to zero mean and unit variance umap(config = custom.config) # perform UMAP ``` ] --- ## Doing UMAP in R .pull-left.tiny-font[ ```r # extract data and plot umap_fit$layout %>% as.data.frame() %>% mutate(sex = blue_jays$sex) %>% ggplot(aes(V1, V2, color = sex)) + geom_point() ``` ] .pull-right[ ![](dimension-reduction-2_files/figure-html/umap-ggplot-demo-out-1.svg)<!-- --> ] [//]: # "segment ends here" --- ## Further reading - Wikipedia: [Nonlinear dimensionality reduction](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction) - Wikipedia: [t-distributed stochastic neighbor embedding](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) - Wikipedia: [Kernel principal component analysis](https://en.wikipedia.org/wiki/Kernel_principal_component_analysis) - **kernlab** reference documentation (for kernel PCA): [pdf document](https://cran.r-project.org/web/packages/kernlab/kernlab.pdf) - **Rtsne** reference documentation: [pdf document](https://cran.r-project.org/web/packages/Rtsne/Rtsne.pdf) - **umap** vignette: [Uniform Manifold Approximation and Projection in R](https://cran.r-project.org/web/packages/umap/vignettes/umap.html)