class: center, middle, title-slide .title[ # Visualizing distributions 1 ] .author[ ### Claus O. Wilke ] .date[ ### last updated: 2024-01-29 ] --- class: center middle ## Histograms and density plots --- ## Passengers on the Titanic .center.small-font[ <table> <thead> <tr> <th style="text-align:right;"> age </th> <th style="text-align:left;"> sex </th> <th style="text-align:left;"> class </th> <th style="text-align:left;"> survived </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.17 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 0.33 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 0.80 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 0.83 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 0.83 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 0.92 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 1st </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.00 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.00 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.00 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.00 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> </tbody> </table> <table> <thead> <tr> <th style="text-align:right;"> age </th> <th style="text-align:left;"> sex </th> <th style="text-align:left;"> class </th> <th style="text-align:left;"> survived </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1.0 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 1.5 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 1.5 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 1st </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> died </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 2.0 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> </tbody> </table> <table> <thead> <tr> <th style="text-align:right;"> age </th> <th style="text-align:left;"> sex </th> <th style="text-align:left;"> class </th> <th style="text-align:left;"> survived </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 2nd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> female </td> <td style="text-align:left;"> 3rd </td> <td style="text-align:left;"> survived </td> </tr> </tbody> </table> ] --- ## Histogram: Define bins and count cases .pull-left.small-font[ <table> <thead> <tr> <th style="text-align:left;"> age range </th> <th style="text-align:right;"> count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 0–5 </td> <td style="text-align:right;"> 36 </td> </tr> <tr> <td style="text-align:left;"> 6–10 </td> <td style="text-align:right;"> 19 </td> </tr> <tr> <td style="text-align:left;"> 11–15 </td> <td style="text-align:right;"> 18 </td> </tr> <tr> <td style="text-align:left;"> 16–20 </td> <td style="text-align:right;"> 99 </td> </tr> <tr> <td style="text-align:left;"> 21–25 </td> <td style="text-align:right;"> 139 </td> </tr> <tr> <td style="text-align:left;"> 26–30 </td> <td style="text-align:right;"> 121 </td> </tr> <tr> <td style="text-align:left;"> 31–35 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:left;"> 36–40 </td> <td style="text-align:right;"> 74 </td> </tr> </tbody> </table> <table> <thead> <tr> <th style="text-align:left;"> age range </th> <th style="text-align:right;"> count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 41–45 </td> <td style="text-align:right;"> 54 </td> </tr> <tr> <td style="text-align:left;"> 46–50 </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:left;"> 51–55 </td> <td style="text-align:right;"> 26 </td> </tr> <tr> <td style="text-align:left;"> 56–60 </td> <td style="text-align:right;"> 22 </td> </tr> <tr> <td style="text-align:left;"> 61–65 </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> 66–70 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> 71–75 </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> 76–80 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] -- .pull-right[ <img src="visualizing-distributions-1_files/figure-html/titanic-age-hist-1.svg" width="100%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- # Histograms depend on the chosen bin width .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-age-hist-binwidth-1.svg" width="75%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Alternative to histogram: Kernel density estimate (KDE) .pull-left[ ![](visualizing-distributions-1_files/figure-html/titanic-age-hist2-1.svg)<!-- --> ] -- .pull-right[ ![](visualizing-distributions-1_files/figure-html/titanic-age-kde-1.svg)<!-- --> ] -- Histograms show raw counts, KDEs show proportions. (Total area = 1) ??? Figures redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## KDEs also depend on parameter settings .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-age-kde-grid-1.svg" width="75%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Careful: KDEs can show non-sensical data .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-age-kde-wrong-1.svg" width="70%" /> ] ??? Figure redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Careful: Are bars stacked or overlapping? .pull-left[ ![](visualizing-distributions-1_files/figure-html/titanic-age-hist-stacked-1.svg)<!-- --> ] -- .pull-right[ ![](visualizing-distributions-1_files/figure-html/titanic-age-hist-overlap-1.svg)<!-- --> ] -- Stacked or overlapping histograms are rarely a good choice. ??? Figures redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Alternatively: Age pyramid .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-age-pyramid-1.svg" width="70%" /> ] ??? Figures redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) --- ## Alternatively: KDEs showing proportions of total .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-age-props-1.svg" width="75%" /> ] ??? Figures redrawn from [Claus O. Wilke. Fundamentals of Data Visualization. O'Reilly, 2019.](https://clauswilke.com/dataviz) [//]: # "segment ends here" --- class: center middle ## Histograms and density plots in **ggplot2** --- ## Getting the data All examples will use the `titanic` dataset: .tiny-font[ ```r titanic <- read_csv("https://wilkelab.org/SDS375/datasets/titanic.csv") %>% select(age, sex, class, survived) ``` ] --- ## Making histograms with ggplot: `geom_histogram()` .small-font[ ```r ggplot(titanic, aes(age)) + geom_histogram() ``` ] -- .center.small-font[ ``` `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="visualizing-distributions-1_files/figure-html/titanic-hist-ggplot-demo-out-1.svg" width="50%" /> ] --- ## Setting the bin width .small-font[ ```r ggplot(titanic, aes(age)) + geom_histogram(binwidth = 5) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-hist-ggplot-demo2-out-1.svg" width="50%" /> ] -- Do you like the bin placement? --- ## Always set the center as well .small-font[ ```r ggplot(titanic, aes(age)) + geom_histogram( binwidth = 5, # width of the bins center = 2.5 # center of the bin containing that value ) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-hist-ggplot-demo3-out-1.svg" width="50%" /> ] --- ## Always set the center as well .small-font[ ```r ggplot(titanic, aes(age)) + geom_histogram( binwidth = 5, # width of the bins center = 12.5 # center of the bin containing that value ) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-hist-ggplot-demo4-out-1.svg" width="50%" /> ] --- ## Making density plots with ggplot: `geom_density()` .small-font[ ```r ggplot(titanic, aes(age)) + geom_density(fill = "skyblue") ``` ] -- .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-dens-ggplot-demo-out-1.svg" width="50%" /> ] --- ## Making density plots with ggplot: `geom_density()` .small-font[ ```r ggplot(titanic, aes(age)) + geom_density() # without fill ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-dens-ggplot-demo2-out-1.svg" width="50%" /> ] --- ## Modifying bandwidth (`bw`) and kernel parameters .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_density( fill = "skyblue", bw = 0.5, # a small bandwidth kernel = "gaussian" # Gaussian kernel (the default) ) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-dens-ggplot-demo3-out-1.svg" width="50%" /> ] --- ## Modifying bandwidth (`bw`) and kernel parameters .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_density( fill = "skyblue", bw = 2, # a moderate bandwidth kernel = "rectangular" # rectangular kernel ) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-dens-ggplot-demo4-out-1.svg" width="50%" /> ] [//]: # "segment ends here" --- class: center middle ## Setting stats explicitly in **ggplot2** --- ## Statistical transformations (stats) can be set explicitly .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_density( stat = "density", # the default for geom_density() fill = "skyblue" ) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-stat-demo1-out-1.svg" width="50%" /> ] --- ## Statistical transformations (stats) can be set explicitly .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_area( # geom_area() does not normally use stat = "density" stat = "density", fill = "skyblue" ) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-stat-demo2-out-1.svg" width="50%" /> ] --- ## Statistical transformations (stats) can be set explicitly .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_line( # neither does geom_line() stat = "density" ) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-stat-demo3-out-1.svg" width="50%" /> ] --- ## Statistical transformations (stats) can be set explicitly .tiny-font[ ```r ggplot(titanic, aes(age)) + # we can use multiple geoms on top of each other geom_area(stat = "density", fill = "skyblue") + geom_line(stat = "density") ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-stat-demo4-out-1.svg" width="50%" /> ] --- ## Parameters are handed through to the stat .pull-left.tiny-font[ ```r ggplot(titanic, aes(age)) + geom_line(stat = "density", bw = 3) ``` .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-stat-demo5-out-1.svg" width="90%" /> ]] .pull-right.tiny-font[ ```r ggplot(titanic, aes(age)) + geom_line(stat = "density", bw = 0.3) ``` .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-stat-demo6-out-1.svg" width="90%" /> ]] -- Here, `bw` is a parameter of `stat_density()`, not of `geom_line()`. --- ## We can explicitly map results from stat computations .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_tile( # geom_tile() draws rectangular colored areas aes( y = 1, # draw all tiles at the same y location fill = after_stat(density) # use computed density for fill ), stat = "density", n = 20 # number of points calculated by stat_density() ) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-stat-demo7-out-1.svg" width="90%" /> ] --- ## We can explicitly map results from stat computations .tiny-font[ ```r ggplot(titanic, aes(age)) + geom_tile( # geom_tile() draws rectangular colored areas aes( y = 1, # draw all tiles at the same y location fill = after_stat(density) # use computed density for fill ), stat = "density", n = 200 # number of points calculated by stat_density() ) ``` ] .center[ <img src="visualizing-distributions-1_files/figure-html/titanic-stat-demo8-out-1.svg" width="90%" /> ] [//]: # "segment ends here" --- ## Further reading - Fundamentals of Data Visualization: [Chapter 7: Visualizing distributions](https://clauswilke.com/dataviz/histograms-density-plots.html) - Data Visualization—A Practical Introduction: [4.6 Histograms and density plots](https://socviz.co/groupfacettx.html#histograms) - **ggplot2** reference documentation: [`geom_histogram()`](https://ggplot2.tidyverse.org/reference/geom_histogram) - **ggplot2** reference documentation: [`geom_density()`](https://ggplot2.tidyverse.org/reference/geom_density)