Visualizing distributions 1

Claus O. Wilke

2025-01-05

Histograms and density plots

Passengers on the Titanic

age sex class survived
0.2 female 3rd survived
0.3 male 3rd died
0.8 male 2nd survived
0.8 male 2nd survived
0.8 male 3rd survived
0.9 male 1st survived
1.0 female 2nd survived
1.0 female 3rd survived
1.0 male 2nd survived
1.0 male 2nd survived
1.0 male 3rd survived
1.5 female 3rd died
1.5 female 3rd died
2.0 female 1st died
2.0 female 2nd survived
age sex class survived
2 female 3rd died
2 female 3rd died
2 male 2nd survived
2 male 2nd survived
2 male 2nd survived
3 female 2nd survived
3 female 3rd survived
3 male 2nd survived
3 male 2nd survived
3 male 3rd survived
3 male 3rd survived
4 female 2nd survived
4 female 2nd survived
4 female 3rd survived
4 female 3rd survived
age sex class survived
4 male 1st survived
4 male 3rd died
4 male 3rd survived
5 female 3rd survived
5 female 3rd survived
5 male 3rd died
6 female 2nd survived
6 female 3rd died
6 male 1st survived
6 male 3rd died
6 male 3rd died
7 female 2nd survived
8 female 2nd survived
8 female 2nd survived
8 male 2nd survived

Histogram: Define bins and count cases


age range count
0–4 33
5–9 20
10–14 15
15–19 81
20–24 139
25–29 113
30–34 93
35–39 75
age range count
40–44 47
45–49 59
50–54 31
55–59 23
60–64 19
65–69 4
70–74 4
75–79 0

 

Histograms depend on the chosen bin width

 

Alternative to histogram: Kernel density estimate (KDE)


 

 

Histograms show raw counts, KDEs show proportions.
(KDE total area = 1)

KDEs also depend on parameter settings

 

Careful: KDEs can show non-sensical data

 

Careful: Are bars stacked or overlapping?


 

 

Stacked or overlapping histograms are rarely a good choice.

Alternative: Age pyramid

 

Alternative: Densities showing proportions of total

 

Overlapping density plots usually look fine

 

Histograms and density plots in ggplot2

Getting the data

All examples will use the titanic dataset:

titanic <- read_csv("https://wilkelab.org/SDS366/datasets/titanic.csv") |>
  select(age, sex, class, survived)

titanic
# A tibble: 756 × 4
     age sex    class survived
   <dbl> <chr>  <chr> <chr>   
 1 29    female 1st   survived
 2  2    female 1st   died    
 3 30    male   1st   died    
 4 25    female 1st   died    
 5  0.92 male   1st   survived
 6 47    male   1st   survived
 7 63    female 1st   survived
 8 39    male   1st   died    
 9 58    female 1st   survived
10 71    male   1st   died    
# ℹ 746 more rows

Making histograms with ggplot: geom_histogram()

ggplot(titanic, aes(age)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

 

Setting the bin width

ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5)

 

Do you like the bin placement?

Always set the bin center as well

ggplot(titanic, aes(age)) +
  geom_histogram(
    binwidth = 5,  # width of the bins
    center = 2.5   # center of the bin containing that value
  )

 

Always set the bin center as well

ggplot(titanic, aes(age)) +
  geom_histogram(
    binwidth = 5,  # width of the bins
    center = 12.5   # center of the bin containing that value
  )

 

Making density plots with ggplot: geom_density()

ggplot(titanic, aes(age)) +
  geom_density(fill = "skyblue")

 

Making density plots with ggplot: geom_density()

ggplot(titanic, aes(age)) +
  geom_density() # without fill

 

Modifying bandwidth (bw) and kernel parameters

ggplot(titanic, aes(age)) +
  geom_density(
    fill = "skyblue",
    bw = 0.5,               # a small bandwidth
    kernel = "gaussian"     # Gaussian kernel (the default)
  )

 

Modifying bandwidth (bw) and kernel parameters

ggplot(titanic, aes(age)) +
  geom_density(
    fill = "skyblue",
    bw = 2,                 # a moderate bandwidth
    kernel = "rectangular"  # rectangular kernel
  )

 

Setting stats explicitly in ggplot2

Statistical transformations (stats) can be set explicitly

ggplot(titanic, aes(age)) +
  geom_density(
    stat = "density",    # the default for geom_density()
    fill = "skyblue"
  )

 

Statistical transformations (stats) can be set explicitly

ggplot(titanic, aes(age)) +
  geom_area(  # geom_area() does not normally use stat = "density"
    stat = "density",
    fill = "skyblue"
  )

 

Statistical transformations (stats) can be set explicitly

ggplot(titanic, aes(age)) +
  geom_line(  # neither does geom_line()
    stat = "density"
  )

 

Statistical transformations (stats) can be set explicitly

ggplot(titanic, aes(age)) +
  # we can use multiple geoms on top of each other
  geom_area(stat = "density", fill = "skyblue") +
  geom_line(stat = "density")

 

Parameters are handed through to the stat

ggplot(titanic, aes(age)) +
  geom_line(stat = "density", bw = 3)

 
ggplot(titanic, aes(age)) +
  geom_line(stat = "density", bw = 0.3)

 

Here, bw is a parameter of stat_density(), not of geom_line().

We can explicitly map results from stat computations

ggplot(titanic, aes(age)) +
  geom_tile( # geom_tile() draws rectangular colored areas
    aes(
      y = 1, # draw all tiles at the same y location
      fill = after_stat(density)  # use computed density for fill
    ),
    stat = "density",
    n = 20    # number of points calculated by stat_density() 
  ) 

 

We can explicitly map results from stat computations

ggplot(titanic, aes(age)) +
  geom_tile( # geom_tile() draws rectangular colored areas
    aes(
      y = 1, # draw all tiles at the same y location
      fill = after_stat(density)  # use computed density for fill
    ),
    stat = "density",
    n = 200    # number of points calculated by stat_density() 
  ) 

 

Further reading