Data wrangling 2

Author

Claus O. Wilke

Introduction

In this worksheet, we will continue with basic data manipulations, now moving on to grouping and summarizing, making data tables wider or longer, and joining data tables.

First we need to load the required R packages. Please wait a moment until the live R session is fully set up and all packages are loaded.

Next we set up the data. We will be working with data on individual penguins in Antarctica.

penguins

We’ll also be working with two datasets containing some information about Texas cities.

Analyzing subsets

In many data analysis settings, we want to break a dataset into subsets and then perform some summary calculation on each subset. The simplest example is counting, which we have done previously with the count() function.

This function subdivides the penguins dataset into subsets for each species and then calculates the number n for each subset, which is the number of observations in each subset.

The function count() here does all the work for us, but what if instead of counting we wanted to calculate the mean weight of the penguins for each species, or calculate the mean weight and count at the same time? We need a general framework that allows us to do these kinds of calculations with maximum flexibility.

The tidyverse approach is to first group a dataset with group_by() and then to calculate grouped summaries with summarize().

Grouping

Let’s first consider just grouping. If we look at the raw R output of just the penguins table or the penguins table after running it through group_by(species), we see that the table is the same, except in the second case there is a line # Groups: species [3] which indicates that the table is grouped by species and there are three groups. (Here, we need to pipe the tables into the print() function to see the raw R output instead of a formatted table that would hide the grouping information.)

We can also group by multiple data columns at once, and we can undo any grouping with ungroup().

Now try this yourself. Group the penguins dataset by sex and island.

Now undo the previous grouping.

Also verify what the output looks like when you omit the print() function at the end.

Performing summaries

Once we have set up a grouping for a data table, we can then calculate summary data with the summarize() function. This function works similarly to mutate(), in that we provide it with statements of the form <new column name> = <computation>, where <new column name> stands for the name of the new column that is being created and <computation> stands for the computation that is used to generate the values in the new column. As an example, if we want to calculate the mean weight (body mass) of penguins, we could write summarize(mean_weight = mean(body_mass_g)), and this would create a new column called mean_weight.

Try this out. First group by sex and then calculate the mean weight for each sex.

We see that male penguins on average are heavier than female penguins. We also see that there is a row containing NAs. This happens because there are a few entries in the dataset for which we know neither the penguins’ sex nor their weight.

Next, see if the pattern changes if we subdivide the dataset by species.

When running this code, you see a message that summarise() (the tidyverse uses British spelling internally) has grouped the output by species. This happens because if you group by multiple variables and then perform a summary it usually makes sense to keep all but the innermost groupings. Verify this is the case by piping the output from summarize() into print().

We can perform multiple summaries at once by adding more statements inside the summarize() function. To try this out, calculate the mean bill length in addition to the mean weight.

When performing summaries, we often want to know how many observations there are in each group (i.e., we want to count). We can do this with the function n(), which inside summarize() gives us the group size. So, we can count by adding a statement such as count = n() inside summarize(). Try this out.

Removing missing values

If we try the previous calculation but grouping only by species and not by sex, we notice a problem: Most values have been replaced by NA.

This happens because R does not like to calculate the mean of a set of numbers where at least one is missing. Whenever there are missing values, we need to think carefully whether we can just ignore them or need to do something more sophisticated. In the penguins dataset there are only a handful of missing values, and therefore ignoring them is fine.

We can ignore missing values by setting na.rm = TRUE inside the mean() function. Try this out.

Alternatively, we could filter out all rows that contain NAs in the columns of interest. We test whether a column contains NAs with is.na(<column name>), and to keep rows without NAs we use !is.na(<column name>). Try this out.

Solution

penguins |>
  filter(!is.na(body_mass_g), !is.na(bill_length_mm)) |>
  group_by(species) |>
  summarize(
    mean_weight = mean(body_mass_g),
    mean_bill_length = mean(bill_length_mm)
  )

Making tables wider or longer

For efficient data processing, we usually want tables in long form, where each columns is one variable and each row is one observation. However, in some applications, for example when making a table easier to read for humans, a wide format can be preferred. In a wide format, some variables are displayed as column names, and other variables are distributed over multiple columns.

Consider the following two versions of a summary table. The first is in long format, where sex is one column and the mean weight is another.

The second is in wide format, where the values of the sex variable (female or male) are used as column headings, and the mean weight values are distributed over these two columns.

You can turn a long table into a wide table using the function pivot_wider(), which takes two arguments: names_from specifies the long column from which the new wide column names should be taken (here e.g., "sex"), and values_from specifies the long column from which the new wide values should be taken (here e.g., "mean_weight"). Try this out.

You can similarly turn a wide table into a long one using the function pivot_longer(). It has arguments cols, names_to, and values_to. cols specifies the wide columns on which to operate, names_to specifies into which long columns the names of the wide columns should be written, and values_to specfies into which long columns the values of the wide columns should be written. Note that while names_to and values_to are specified as strings (that is, in quotes, such as names_to = "sex"), the cols argument does not use quotes (e.g., cols = c(female, male)).

Try this out.

Notice how the arguments names_from and values_from in pivot_wider() are exactly equivalent to names_to and values_to in pivot_longer().

Combining datasets with joins

Finally, we sometimes encounter the situation where we have two data sets that contain different pieces of information about the same subjects or objects, and we need to merge these tables for further analysis. In this situation, we need to perform a join, and there are multiple different types of joins available: left_join(), right_join(), inner_join(), full_join(). These joins all differ in how they handle cases where an observation is present in only one of the two tables but missing in the other.

Two explore joins, consider the following two datasets, which contain the population number of three Texas cities and the city areas, respectively. The cities in the two tables are not the same, on purpose.

Try to merge TX_area into TX_population, using left_join().

What happens if you reverse the two arguments?

Now try the same with full_join().

Try also right_join() and inner_join(). See if you can describe what each join function does and how they differ from one another.