class: center, middle, title-slide .title[ # Visualizing distributions 2 ] .author[ ### Claus O. Wilke ] .date[ ### last updated: 2024-01-29 ] --- ## Reminder: Density estimates visualize distributions .pull-left.small-font[ Mean temperatures in Lincoln, NE, in January 2016: .center[ |date | mean temp| |:----------|---------:| |2016-01-01 | 24| |2016-01-02 | 23| |2016-01-03 | 23| |2016-01-04 | 17| |2016-01-05 | 29| |2016-01-06 | 33| |2016-01-07 | 30| |2016-01-08 | 25| |2016-01-09 | 9| |2016-01-10 | 11| |2016-01-11 | 28| |2016-01-12 | 24| |2016-01-13 | 33| |2016-01-14 | 40| |2016-01-15 | 29| |2016-01-16 | 19| |2016-01-17 | 5| |2016-01-18 | 11| |2016-01-19 | 22| |2016-01-20 | 28| |2016-01-21 | 25| |2016-01-22 | 22| |2016-01-23 | 28| |2016-01-24 | 30| |2016-01-25 | 26| |2016-01-26 | 29| |2016-01-27 | 33| |2016-01-28 | 41| |2016-01-29 | 41| |2016-01-30 | 39| |2016-01-31 | 35| ]] -- .pull-right[ <img src="visualizing-distributions-2_files/figure-html/temps_densities_january-1.svg" width="100%" /> ] --- ## Reminder: Density estimates visualize distributions .pull-left.small-font[ Mean temperatures in Lincoln, NE, in January 2016: .center[ |date | mean temp| |:----------|---------:| |2016-01-01 | 24| |2016-01-02 | 23| |2016-01-03 | 23| |2016-01-04 | 17| |2016-01-05 | 29| |2016-01-06 | 33| |2016-01-07 | 30| |2016-01-08 | 25| |2016-01-09 | 9| |2016-01-10 | 11| |2016-01-11 | 28| |2016-01-12 | 24| |2016-01-13 | 33| |2016-01-14 | 40| |2016-01-15 | 29| |2016-01-16 | 19| |2016-01-17 | 5| |2016-01-18 | 11| |2016-01-19 | 22| |2016-01-20 | 28| |2016-01-21 | 25| |2016-01-22 | 22| |2016-01-23 | 28| |2016-01-24 | 30| |2016-01-25 | 26| |2016-01-26 | 29| |2016-01-27 | 33| |2016-01-28 | 41| |2016-01-29 | 41| |2016-01-30 | 39| |2016-01-31 | 35| ]] .pull-right[ <img src="visualizing-distributions-2_files/figure-html/temps_densities_january2-1.svg" width="100%" /> How can we compare distributions across months? ] --- ## A bad idea: Many overlapping density plots .center[ <img src="visualizing-distributions-2_files/figure-html/temps_densities_overlapping-1.svg" width="70%" /> ] --- ## Another bad idea: Stacked density plots .center[ <img src="visualizing-distributions-2_files/figure-html/temps_densities_stacked-1.svg" width="70%" /> ] --- ## Somewhat better: Small multiples .center[ <img src="visualizing-distributions-2_files/figure-html/temps_densities-1.svg" width="80%" /> ] --- ## Instead: Show values along y, conditions along x .center[ <img src="visualizing-distributions-2_files/figure-html/temps_boxplots-1.svg" width="70%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) -- A boxplot is a crude way of visualizing a distribution. --- ## How to read a boxplot .center[ <img src="visualizing-distributions-2_files/figure-html/boxplot-schematic-1.svg" width="70%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## If you like density plots, consider violins .center[ <img src="visualizing-distributions-2_files/figure-html/temps_violins-1.svg" width="70%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) -- A violin plot is a density plot rotated 90 degrees and then mirrored. --- ## How to read a violin plot .center[ <img src="visualizing-distributions-2_files/figure-html/violin-schematic-1.svg" width="70%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## For small datasets, you can also use a strip chart Advantage: Can see raw data points instead of abstract representation. .center[ <img src="visualizing-distributions-2_files/figure-html/temps_stripchart-1.svg" width="60%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) -- Horizontal jittering may be necessary to avoid overlapping points. --- ## For small datasets, you can also use a strip chart Advantage: Can see raw data points instead of abstract representation. .center[ <img src="visualizing-distributions-2_files/figure-html/temps_stripchart2-1.svg" width="60%" /> ] Horizontal jittering may be necessary to avoid overlapping points. ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## For small datasets, you can also use a strip chart Advantage: Can see raw data points instead of abstract representation. .center[ <img src="visualizing-distributions-2_files/figure-html/temps_stripchart3-1.svg" width="60%" /> ] Horizontal jittering may be necessary to avoid overlapping points. ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## We can also jitter points into violins .center[ <img src="visualizing-distributions-2_files/figure-html/temps_sina-1.svg" width="60%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) -- Such plots are called sina plots, to honor [Sina Hadi Sohi.](https://clauswilke.com/dataviz/boxplots-violins.html#fig:lincoln-temp-sina) --- ## But maybe there's hope for overlapping density plots? .center[ <img src="visualizing-distributions-2_files/figure-html/temps_densities_overlapping2-1.svg" width="65%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) -- How about we stagger the densities vertically? --- ## Vertically staggered density plots are called ridgelines .center[ <img src="visualizing-distributions-2_files/figure-html/lincoln-ridgeline-polished-1.svg" width="65%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) -- Notice the single fill color. More colors would be distracting. --- class: center middle ## Making boxplots, violins, etc. in **ggplot2** --- ## Getting the data All examples will use the `lincoln_temps` dataset: .tiny-font[ ```r lincoln_temps <- readRDS(url("https://wilkelab.org/SDS375/datasets/lincoln_temps.rds")) ``` ] --- ## Making boxplots, violins, etc. in **ggplot2** .small-font.center[ Plot type | Geom | Notes :----------- | :----------------- | :------------------------- boxplot | `geom_boxplot()` | violin plot | `geom_violin()` | strip chart | `geom_point()` | Jittering requires `position_jitter()` sina plot | `geom_sina()` | From package **ggforce** ridgeline | `geom_density_ridges()` | From package **ggridges** ] --- ## Examples: Boxplot .tiny-font[ ```r ggplot(lincoln_temps, aes(x = month, y = mean_temp)) + geom_boxplot(fill = "skyblue") ``` ] .center[ <img src="visualizing-distributions-2_files/figure-html/temps-examples-boxplot-out-1.svg" width="55%" /> ] --- ## Examples: Violins .tiny-font[ ```r ggplot(lincoln_temps, aes(x = month, y = mean_temp)) + geom_violin(fill = "skyblue") ``` ] .center[ <img src="visualizing-distributions-2_files/figure-html/temps-examples-violin-out-1.svg" width="55%" /> ] --- ## Examples: Strip chart (no jitter) .tiny-font[ ```r ggplot(lincoln_temps, aes(x = month, y = mean_temp)) + geom_point(size = 0.75) # reduce point size to minimize overplotting ``` ] .center[ <img src="visualizing-distributions-2_files/figure-html/temps-examples-stripchart-out-1.svg" width="55%" /> ] --- ## Examples: Strip chart (w/ jitter) .tiny-font[ ```r ggplot(lincoln_temps, aes(x = month, y = mean_temp)) + geom_point(size = 0.75, # reduce point size to minimize overplotting position = position_jitter( width = 0.15, # amount of jitter in horizontal direction height = 0 # amount of jitter in vertical direction (0 = none) ) ) ``` ] .center[ <img src="visualizing-distributions-2_files/figure-html/temps-examples-stripchart-jitter-out-1.svg" width="55%" /> ] --- ## Examples: Sina plot .tiny-font[ ```r library(ggforce) # for geom_sina() ggplot(lincoln_temps, aes(x = month, y = mean_temp)) + geom_violin(fill = "skyblue", color = NA) + # violins in background geom_sina(size = 0.75) # sina jittered points in foreground ``` ] .center[ <img src="visualizing-distributions-2_files/figure-html/temps-examples-sina-out-1.svg" width="55%" /> ] --- ## Examples: Ridgeline plot .tiny-font[ ```r library(ggridges) # for geom_density_ridges ggplot(lincoln_temps, aes(x = mean_temp, y = month_long)) + geom_density_ridges() ``` ] .center[ <img src="visualizing-distributions-2_files/figure-html/temps-examples-ridgeline-out-1.svg" width="55%" /> ] [//]: # "segment ends here" --- ## Further reading - Fundamentals of Data Visualization: [Chapter 7: Visualizing many distributions at once](https://clauswilke.com/dataviz/boxplots-violins.html) - **ggplot2** reference documentation: [`geom_boxplot()`](https://ggplot2.tidyverse.org/reference/geom_histogram), [`geom_violin()`](https://ggplot2.tidyverse.org/reference/geom_violin), [`position_jitter()`](https://ggplot2.tidyverse.org/reference/position_jitter.html) - **ggforce** reference documentation: [`geom_sina()`](https://ggforce.data-imaginist.com/reference/geom_sina.html) - **ggridges** reference documentation: [`geom_density_ridges()`](https://wilkelab.org/ggridges/reference/geom_density_ridges.html)