Dealing with overplotting
Introduction
In this worksheet, we will discuss how to make 2D density plots and histograms, to effectively visualize scatter plots in which many points lie on top of one another.
First we need to load the required R packages. Please wait a moment until the live R session is fully set up and all packages are loaded.
Next we set up the data. We will be working with data on individual penguins in Antarctica.
penguins
2D density plots
2D density plots are a replacement for and alternative to scatter plots. They visualize the density of points in the 2D plane, and they are particularly useful when the point density is very high, so that many points lie on top of one another. We will demonstrate 2D density plots for the penguins
dataset, specifically for a scatter plot of bill length versus body mass.
To create a 2D density plot, we simply add geom_density_2d()
. Try this out.
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
+
___ geom_point(na.rm = TRUE) +
theme_bw()
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_density_2d(na.rm = TRUE) +
geom_point(na.rm = TRUE) +
theme_bw()
You can change the number of contour lines shown by providing the bins
argument to geom_density_2d()
. For example, try bins = 5
.
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_density_2d(na.rm = TRUE, bins = ___) +
geom_point(na.rm = TRUE) +
theme_bw()
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_density_2d(na.rm = TRUE, bins = 5) +
geom_point(na.rm = TRUE) +
theme_bw()
Also try other values for bins
.
The plots we have made so far did not consider different penguin species, but we know from earlier worksheets that there are three penguins species with quite different values for body mass and bill length. Modify the above plot so that both points and contour lines are colored by species.
ggplot(penguins, aes(body_mass_g, bill_length_mm, color = species)) +
geom_density_2d(na.rm = TRUE, bins = 5) +
geom_point(na.rm = TRUE) +
theme_bw()
Also try faceting by species.
ggplot(penguins, aes(body_mass_g, bill_length_mm, color = species)) +
geom_density_2d(na.rm = TRUE, bins = 5) +
geom_point(na.rm = TRUE) +
theme_bw() +
facet_wrap(~species)
Finally, you can use geom_density_2d_filled()
to draw filled contour bands instead of contour lines. Try this out.
Hints:
- You cannot map penguin species to color when drawing contour bands.
- It helps to set alpha transparency for contour bands, e.g.
alpha = 0.5
. - You may want to make the point size smaller so you can see the contour bands underneath the points.
Note: This example may not work in the live web environment.
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_density_2d_filled(___) +
geom_point(na.rm = TRUE, size = ___) +
theme_bw() +
facet_wrap(~species)
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_density_2d_filled(na.rm = TRUE, bins = 5, alpha = 0.5) +
geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species)
2D histograms
2D histograms are very similar to 2D density plots. They are generated by simply subdividing the 2D plane into regularly shaped regions (rectangles or hexagons), counting how many data points fall into each region, and then coloring each region by its count.
To make rectangular 2D hexagons, you can use geom_bin2d()
. Try this out.
Hint: Set bins = 5
to get a reasonable number of bins for this dataset.
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_bin2d(___) +
geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species)
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_bin2d(na.rm = TRUE, bins = 5) +
geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species)
It helps to make the bins partially transparent (i.e., set alpha = 0.5
). Also use a different sequential color scale.
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_bin2d(na.rm = TRUE, bins = 5, ___) +
geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species) +
scale_fill____
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_bin2d(na.rm = TRUE, bins = 5, alpha = 0.5) +
geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species) +
scale_fill_viridis_c()
You can control bins in a more fine-grained manner by setting the binwidth
argument. It takes a vector of two numbers, where the first is the width of the bins (in data units) and the second is the height (also in data units). Make bins that are 10000 units wide and 5 units tall.
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_bin2d(
na.rm = TRUE,
binwidth = ___,
alpha = 0.5
+
) geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species) +
scale_fill_viridis_c()
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_bin2d(
na.rm = TRUE,
binwidth = c(10000, 5),
alpha = 0.5
+
) geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species) +
scale_fill_viridis_c()
Also try making bins that are 30 units tall and 2500 units wide.
Instead of rectangular bins, you can also make hexbins, with geom_hex()
. It mostly works the same as geom_bin2d()
. Try it out.
Hint: You will need to set the argument bins
to an appropriate value for the hexbins to look good.
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_hex(
___+
) geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species) +
scale_fill_viridis_c()
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_hex(
na.rm = TRUE,
bins = 5,
alpha = 0.5
+
) geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species) +
scale_fill_viridis_c()
Just as was the case with geom_bin2d()
, you can provide an argument binwidth
consisting of two values, one controling the width and the other the height. Try this out.
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_hex(
na.rm = TRUE,
binwidth = ___
alpha = 0.5
+
) geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species) +
scale_fill_viridis_c()
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_hex(
na.rm = TRUE,
binwidth = c(1000, 5),
alpha = 0.5
+
) geom_point(na.rm = TRUE, size = 0.2) +
theme_bw() +
facet_wrap(~species) +
scale_fill_viridis_c()