Working with models
Introduction
In this worksheet, we will discuss how to efficiently fit statistical models (such as linear regressions) to subsets of data and then use for plotting.
First we need to load the required R packages. Please wait a moment until the live R session is fully set up and all packages are loaded.
Next we set up the data. We will be working with data on individual penguins in Antarctica.
Inserting data into text output
There are several utility functions we need to understand before we can fit models, process them with broom, and ultimately plot. These include nesting and unnesting of data tables, using map()
to apply a function to all the values in a data column, and using glue()
to generate generate text. We have discussed nesting/unnesting and map()
in the lecture on functional programming, and you may want to review this material if it is unclear to you. Here, we will discuss how to insert data into text output.
The glue()
function allows you to place variables into a text string. This is frequently useful when we want to process multiple subsets of a larger data table and generate output for each subset.
For example:
This also works for vectorized input.
Try this for yourself. Create variables holding your first and last name and then print out your complete name using glue()
.
<- "Claus"
first_name <- "Wilke"
last_name
glue("My name is ___.")
<- "Claus"
first_name <- "Wilke"
last_name
glue("My name is {first_name} {last_name}.")
Cleaning up models with broom
R has powerful functions to fit statistical models to data, such as lm()
to fit linear regression models. However, many of these functions have been written for interactive use and don’t work well in an automated data processing pipeline. For example, consider the following code to perform a linear regression analysis on the penguins dataset (ignoring for a moment that there are multiple species):
The fit
object stores information about the linear regression, and summary(fit)
shows us this information in a nice, human-readable form. But what if we want the relevant information in a data table? This is where the broom package comes in. The glance()
function extracts model-level summary data from a fitted object, and the tidy()
function extracts information about individual regression coefficients.
Try this yourself. Fit a model of bill length versus bill depth (formula: bill_length_mm ~ bill_depth_mm
), look at the model fit with summary()
, and then look at the model fit via glance()
and tidy()
.
# fit linear model
<- lm(bill_length_mm ~ bill_depth_mm, data = penguins)
fit
# inspect model fit with summary()
summary(___)
# inspect model fit with glance() and tidy()
# fit linear model
<- lm(bill_length_mm ~ bill_depth_mm, data = penguins)
fit
# inspect model fit with summary()
summary(fit)
# inspect model fit with glance() and tidy()
glance(___)
tidy(___)
# fit linear model
<- lm(bill_length_mm ~ bill_depth_mm, data = penguins)
fit
# inspect model fit with summary()
summary(fit)
# inspect model fit with glance() and tidy()
glance(fit)
tidy(fit)
The real power of glance()
and tidy()
becomes apparent in a more complex data analysis pipeline, when we fit a model to subsets of data via map()
and then combine the results from the individual fits into one large table.
Now run this code yourself one line at a time and make sure you understand at each step what is happening. Review the materials from the class on functional programming if anything is unclear. Note: The individual stages of the calculation are provided as hints, so you can just click through the hints one-by-one and run each piece.
|>
penguins nest(data = -species) |> # nest the data table by species
mutate(
# use map() to fit a model to each nested data table
fit = map(data, ~lm(bill_length_mm ~ body_mass_g, data = .x))
)
|>
penguins nest(data = -species) |> # nest the data table by species
mutate(
# use map() to fit a model to each nested data table
fit = map(data, ~lm(bill_length_mm ~ body_mass_g, data = .x)),
# use map to apply glance() to each model fit
glance_out = map(fit, glance)
)
|>
penguins nest(data = -species) |> # nest the data table by species
mutate(
# use map() to fit a model to each nested data table
fit = map(data, ~lm(bill_length_mm ~ body_mass_g, data = .x)),
# use map to apply glance() to each model fit
glance_out = map(fit, glance)
|>
) unnest(cols = glance_out) # unnest output from glance
|>
penguins nest(data = -species) |> # nest the data table by species
mutate(
# use map() to fit a model to each nested data table
fit = map(data, ~lm(bill_length_mm ~ body_mass_g, data = .x)),
# use map to apply glance() to each model fit
glance_out = map(fit, glance)
|>
) unnest(cols = glance_out) |> # unnest output from glance
select(-data, -fit) # remove columns data and fit
Plotting model fits
Finally, we use the results from the model fit to plot a p value on each facet of a regression plot. The plot we’ll be working with is the following:
We can generate the fitted models as in the previous section. We will store them in the penguins_fits
:
Now, do the following. First, use mutate()
, glue()
, and select()
to convert this table into one that has four columns, species
, body_mass_g
, bill_length_mm
, and label
. The species
column holds the penguin species. The next two columns will hold the coordinates of the text label. For example, the values body_mass_g = 5500
and bill_length_mm = 32
will work. The last column will hold labels, generated with glue()
, of the form “p = 7.48e-06”. You can use signif(p.value, 3)
to round p values to three significant digits.
Once you have this table, use geom_text()
to add the labels to the above plot.
# first do the data table manipulation
<- penguins_fits |>
labels_data mutate(
body_mass_g = ___,
bill_length_mm = ___,
label = ___
)
labels_data
# then plot
# first do the data table manipulation
<- penguins_fits |>
labels_data mutate(
body_mass_g = 5500,
bill_length_mm = 32,
label = glue("p = {signif(p.value, 3)}")
|>
) select(___)
labels_data
# then plot
# first do the data table manipulation
<- penguins_fits |>
labels_data mutate(
body_mass_g = 5500,
bill_length_mm = 32,
label = glue("p = {signif(p.value, 3)}")
|>
) select(species, body_mass_g, bill_length_mm, label)
labels_data
# then plot
# first do the data table manipulation
<- penguins_fits |>
labels_data mutate(
body_mass_g = 5500,
bill_length_mm = 32,
label = glue("p = {signif(p.value, 3)}")
|>
) select(species, body_mass_g, bill_length_mm, label)
# then plot
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_point(na.rm = TRUE) +
geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE) +
facet_wrap(vars(species)) +
geom_text(___)
# first do the data table manipulation
<- penguins_fits |>
labels_data mutate(
body_mass_g = 5500,
bill_length_mm = 32,
label = glue("p = {signif(p.value, 3)}")
|>
) select(species, body_mass_g, bill_length_mm, label)
# then plot
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_point(na.rm = TRUE) +
geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE) +
facet_wrap(vars(species)) +
geom_text(
data = labels_data,
aes(___)
)
# first do the data table manipulation
<- penguins_fits |>
labels_data mutate(
body_mass_g = 5500,
bill_length_mm = 32,
label = glue("p = {signif(p.value, 3)}")
|>
) select(species, body_mass_g, bill_length_mm, label)
# then plot
ggplot(penguins, aes(body_mass_g, bill_length_mm)) +
geom_point(na.rm = TRUE) +
geom_smooth(method = "lm", formula = y ~ x, na.rm = TRUE) +
facet_wrap(vars(species)) +
geom_text(
data = labels_data,
aes(label = label)
)
Once you have successfully made the plot, you can try a few more things:
- Place the labels for the different facets in different locations within each facet.
- Use
hjust
andvjust
ingeom_text()
to fine-tune where labels are placed. - Make labels that contain the R2 value in addition to the p value.