class: center, middle, title-slide .title[ # Visualizing distributions 1 ] .author[ ### Claus O. Wilke ] .date[ ### last updated: 2022-08-25 ] --- class: center middle ## Histograms and density plots --- ## Passengers on the Titanic .center.small-font[ <table> <thead> <tr> <th style="text-align:right;"> age </th> <th style="text-align:left;"> sex </th> <th style="text-align:left;"> class </th> <th style="text-align:left;"> survived </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.17 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 0.33 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 0.80 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 0.83 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 0.83 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 0.92 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 1st </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.00 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.00 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.00 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.00 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> </tbody> </table> <table> <thead> <tr> <th style="text-align:right;"> age </th> <th style="text-align:left;"> sex </th> <th style="text-align:left;"> class </th> <th style="text-align:left;"> survived </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1.0 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.5 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 1.5 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 1st </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> </tbody> </table> <table> <thead> <tr> <th style="text-align:right;"> age </th> <th style="text-align:left;"> sex </th> <th style="text-align:left;"> class </th> <th style="text-align:left;"> survived </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> </tbody> </table> ] --- ## Histogram: Define bins and count cases .pull-left.small-font[ <table> <thead> <tr> <th style="text-align:left;"> age range </th> <th style="text-align:right;"> count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 0–5 </td> <td style="text-align:right;"> 36 </td> </tr> <tr> <td style="text-align:left;"> 6–10 </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;"> 11–15 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:left;"> 16–20 </td> <td style="text-align:right;"> 99 </td> </tr> <tr> <td style="text-align:left;"> 21–25 </td> <td style="text-align:right;"> 139 </td> </tr> <tr> <td style="text-align:left;"> 26–30 </td> <td style="text-align:right;"> 121 </td> </tr> <tr> <td style="text-align:left;"> 31–35 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:left;"> 36–40 </td> <td style="text-align:right;"> 74 </td> </tr> </tbody> </table> <table> <thead> <tr> <th style="text-align:left;"> age range </th> <th style="text-align:right;"> count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 41–45 </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> 46–50 </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:left;"> 51–55 </td> <td style="text-align:right;"> 26 </td> </tr> <tr> <td style="text-align:left;"> 56–60 </td> <td style="text-align:right;"> 22 </td> </tr> <tr> <td style="text-align:left;"> 61–65 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> 66–70 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> 71–75 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> 76–80 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] -- .pull-right[ data:image/s3,"s3://crabby-images/1edbb/1edbbd1b88d8ffab75517163206b3524c2e26f29" alt=""<!-- --> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- # Histograms depend on the chosen bin width .center[ data:image/s3,"s3://crabby-images/be24c/be24c93cf9894b59ee05f77b40216af97ee59d5d" alt=""<!-- --> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Alternative to histogram: Kernel density estimate (KDE) .pull-left[ data:image/s3,"s3://crabby-images/cfe29/cfe29d8170c7682ce5a5565a425950298ed06bbc" alt=""<!-- --> ] -- .pull-right[ data:image/s3,"s3://crabby-images/9b9cf/9b9cfa19d35196929bcb6b1b69848cfaafa2a267" alt=""<!-- --> ] -- Histograms show raw counts, KDEs show proportions. (Total area = 1) ??? Figures redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## KDEs also depend on parameter settings .center[ data:image/s3,"s3://crabby-images/79f56/79f567892f69176108f9e373bf973e80a818376d" alt=""<!-- --> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Careful: KDEs can show non-sensical data .center[ data:image/s3,"s3://crabby-images/28ec7/28ec7c6027c1c7f69951425dd71185ae89ab591d" alt=""<!-- --> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Careful: Are bars stacked or overlapping? .pull-left[ data:image/s3,"s3://crabby-images/e1a6e/e1a6e1ff6965a31d82e8947b9147657cab13dc0a" alt=""<!-- --> ] -- .pull-right[ data:image/s3,"s3://crabby-images/6b371/6b371440379e759d81c1ef9b281737974b98e002" alt=""<!-- --> ] -- Stacked or overlapping histograms are rarely a good choice. ??? Figures redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Alternatively: Age pyramid .center[ data:image/s3,"s3://crabby-images/113c1/113c10d32fbfe20dcbdfa41c7869c0ab05974eec" alt=""<!-- --> ] ??? Figures redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Alternatively: KDEs showing proportions of total .center[ data:image/s3,"s3://crabby-images/d0576/d05767c1118b7b50033f5bac93b0d688cde909fc" alt=""<!-- --> ] ??? Figures redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) [//]: # "segment ends here" --- class: center middle ## Histograms and density plots in **ggplot2** --- ## Getting the data All examples will use the `titanic` dataset: .tiny-font[ ```r titanic <- read_csv("https://wilkelab.org/DSC385/datasets/titanic.csv") %>% select(age, sex, class, survived) ``` ] --- ## Making histograms with ggplot: `geom_histogram()` .small-font[ ```r ggplot(titanic, aes(age)) + geom_histogram() ``` ] -- .center.small-font[ ``` `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` data:image/s3,"s3://crabby-images/db77f/db77fdeda4303c62f36bd129bc8b7bad2a86274d" alt=""<!-- --> ] --- ## Setting the bin width .small-font[ ```r ggplot(titanic, aes(age)) + geom_histogram(binwidth = 5) ``` ] .center[ data:image/s3,"s3://crabby-images/3f1a4/3f1a401188d298cc00743796994d97f0b8d22bb0" alt=""<!-- --> ] -- Do you like the bin placement? --- ## Always set the center as well .small-font[ ```r ggplot(titanic, aes(age)) + geom_histogram( binwidth = 5, # width of the bins center = 2.5 # center of the bin containing that value ) ``` ] .center[ data:image/s3,"s3://crabby-images/77213/77213ca8534c3b592b106596b8d7b45712a66692" alt=""<!-- --> ] --- ## Always set the center as well .small-font[ ```r ggplot(titanic, aes(age)) + geom_histogram( binwidth = 5, # width of the bins center = 12.5 # center of the bin containing that value ) ``` ] .center[ data:image/s3,"s3://crabby-images/8c8ed/8c8ed4342f755603e3df24694cb4edaf07633a93" alt=""<!-- --> ] --- ## Making density plots with ggplot: `geom_density()` .small-font[ ```r ggplot(titanic, aes(age)) + geom_density(fill = "skyblue") ``` ] -- .center[ data:image/s3,"s3://crabby-images/d4c0d/d4c0d753d41d9415651f73ba2e75bcd7d1cb42d7" alt=""<!-- --> ] --- ## Making density plots with ggplot: `geom_density()` .small-font[ ```r ggplot(titanic, aes(age)) + geom_density() # without fill ``` ] .center[ data:image/s3,"s3://crabby-images/3164d/3164de33ec88afa5fd46b0a677550c5aa387a940" alt=""<!-- --> ] --- ## Modifying bandwidth (`bw`) and kernel parameters .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_density( fill = "skyblue", bw = 0.5, # a small bandwidth kernel = "gaussian" # Gaussian kernel (the default) ) ``` ] .center[ data:image/s3,"s3://crabby-images/2aef7/2aef76488d35b5a1682605e40eeceb6ca92a391a" alt=""<!-- --> ] --- ## Modifying bandwidth (`bw`) and kernel parameters .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_density( fill = "skyblue", bw = 2, # a moderate bandwidth kernel = "rectangular" # rectangular kernel ) ``` ] .center[ data:image/s3,"s3://crabby-images/8c3fe/8c3fe6f641d7a83ed36fb0a1d7b84acab0a94f55" alt=""<!-- --> ] [//]: # "segment ends here" --- class: center middle ## Setting stats explicitly in **ggplot2** --- ## Statistical transformations (stats) can be set explicitly .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_density( stat = "density", # the default for geom_density() fill = "skyblue" ) ``` ] .center[ data:image/s3,"s3://crabby-images/bb323/bb3236dfa7003e0482de05fe858b85ce2e2810d0" alt=""<!-- --> ] --- ## Statistical transformations (stats) can be set explicitly .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_area( # geom_area() does not normally use stat = "density" stat = "density", fill = "skyblue" ) ``` ] .center[ data:image/s3,"s3://crabby-images/be440/be440879f8bef1954286a12c9a0de51a12c6aa4c" alt=""<!-- --> ] --- ## Statistical transformations (stats) can be set explicitly .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_line( # neither does geom_line() stat = "density" ) ``` ] .center[ data:image/s3,"s3://crabby-images/3941d/3941d404c0c259a71bebd10cfef5ec0b9fcfb150" alt=""<!-- --> ] --- ## Statistical transformations (stats) can be set explicitly .tiny-font[ ```r ggplot(titanic, aes(age)) + # we can use multiple geoms on top of each other geom_area(stat = "density", fill = "skyblue") + geom_line(stat = "density") ``` ] .center[ data:image/s3,"s3://crabby-images/ba26f/ba26fda83381c8d82b78c655011cae2123ddde5d" alt=""<!-- --> ] --- ## Parameters are handed through to the stat .pull-left.tiny-font[ ```r ggplot(titanic, aes(age)) + geom_line(stat = "density", bw = 3) ``` .center[ data:image/s3,"s3://crabby-images/393ee/393eec8d6d88c9bce5d962a1b2184db9aa2a00f7" alt=""<!-- --> ]] .pull-right.tiny-font[ ```r ggplot(titanic, aes(age)) + geom_line(stat = "density", bw = 0.3) ``` .center[ data:image/s3,"s3://crabby-images/b56bb/b56bb59a9905647c1e921e9c53ceb454dd16dc3a" alt=""<!-- --> ]] -- Here, `bw` is a parameter of `stat_density()`, not of `geom_line()`. --- ## We can explicitly map results from stat computations .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_tile( # geom_tile() draws rectangular colored areas aes( y = 1, # draw all tiles at the same y location fill = after_stat(density) # use computed density for fill ), stat = "density", n = 20 # number of points calculated by stat_density() ) ``` ] .center[ data:image/s3,"s3://crabby-images/e45cb/e45cb60cb572723b43bb7289d5bb007d6c18158f" alt=""<!-- --> ] --- ## We can explicitly map results from stat computations .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_tile( # geom_tile() draws rectangular colored areas aes( y = 1, # draw all tiles at the same y location fill = after_stat(density) # use computed density for fill ), stat = "density", n = 200 # number of points calculated by stat_density() ) ``` ] .center[ data:image/s3,"s3://crabby-images/77c0a/77c0a637df797dd57f35d539be7de4e853f2b37d" alt=""<!-- --> ] [//]: # "segment ends here" --- ## Further reading - Fundamentals of Data Visualization: [Chapter 7: Visualizing distributions](https://clauswilke.com/dataviz/histograms-density-plots.html) - Data Visualization—A Practical Introduction: [4.6 Histograms and density plots](https://socviz.co/groupfacettx.html#histograms) - **ggplot2** reference documentation: [`geom_histogram()`](https://ggplot2.tidyverse.org/reference/geom_histogram) - **ggplot2** reference documentation: [`geom_density()`](https://ggplot2.tidyverse.org/reference/geom_density)