Data wrangling 1

Author

Claus O. Wilke

Introduction

In this worksheet, we will discuss how to perform basic data manipulations, such as filtering data rows that meet certain conditions, choosing data columns, and arranging data in ascending or descending order.

First we need to load the required R packages. Please wait a moment until the live R session is fully set up and all packages are loaded.

Next we set up the data. We will be working with data on individual penguins in Antarctica.

penguins

The pipe (symbol `|>`, read “and then”)

When writing complex data analysis pipelines, we frequently use the pipe operator |> to move data from one analysis step to the next. The pipe is pronounced “and then”, and it takes the data on its left and uses it as the first argument for the function on its right.

For example, to see the first few lines of a dataset, we often write head(data). Instead, we can write data |> head().

Try this yourself. Write code that displays the first few lines of the penguins dataset, using |> and head():

In older R code, you may also see %>% (called the “magrittr” pipe) instead of |> (the “native” pipe). While these two operators have subtle differences, for all intents and purposes you can treat them interchangeably. Try the previous exercise with the magrittr pipe instead of the native pipe.

Choosing data rows

The function filter() allows you to find rows in a dataset that meet one or more specific conditions. The syntax is data |> filter(condition), where condition is a logical condition. For example, filter(x > 5) would pick all rows for which the value in column x is greater than 5.

As an example, the following code picks all penguins from the island Biscoe:

Now it’s your turn. Pick all penguins from the island Dream:

You can also state multiple conditions, separated by a comma. For example, filter(x > 5, y < 2) would pick all rows for which the value in the column x is greater than 5 and the value in the column y is less than 2. Note that the conditions are combined via logical AND, both need to be satisfied for the row to be picked.

To try this out, pick all penguins of species Chinstrap from the island Dream:

Choosing data columns

The function select() allows you to pick specific data columns by name. This is frequently useful when a dataset has many more columns than we are interested in at the time. For example, if we are only interested in the penguins’ species, island, and sex, we could select these three columns:

Try this yourself, picking the columns representing the penguin species (species), the bill length (bill_length_mm), and then flipper length (flipper_length_mm).

Another situation that arises frequently is one where we want to remove specific columns. We can also do this with select(), but now write select(-column) to remove one or more columns.

Try this. Remove the column species.

And now remove both species and island.

Sorting data

The function arrange() allows you to sort data by one or more columns. For example, data |> arrange(x) would sort the data by increasing values of x, and data |> arrange(x, y) would sort the data first by x and then, for ties in x, by y.

As an example, the following code sorts penguins by their flipper length:

Now it’s your turn. Sort the penguins by bill length:

To arrange data in descending order, enclose the data column in desc(). For example, data |> arrange(desc(x)) would sort the data by decreasing values of x. (desc stands for “descending”.)

Try this out. Sort the penguins by bill length, from largest to smallest:

Counting

We frequently want to count how many times a particular value or combination of values occurs in a dataset. We do this using the count() function. For example, the following code counts how many penguins of the different species there are in the penguins dataset.

Now try this yourself. Count how many male and female penguins there are.

We can also perform more fine-grained counts, by providing the count() function with more than one column name. See if you can count how many male and female penguins the dataset contains for each species.

Now count how many penguins of each species the dataset contains for each island.

Chaining analysis steps into pipelines

We can chain multiple analysis steps into a pipeline by continuing to add “and then” statements. For example, data |> count(...) |> arrange(...) would first count and then sort the data.

Try this out by counting the number of penguins of each species and then sorting by the number of penguins.

You may remember from before that the default sorting is from the smallest to the largest value. Repeat the previous exercise but now arrange the penguin species from the most frequent to the least frequent:

Creating new data columns

The function mutate() allows you to add new columns to a data table. For example, data |> mutate(sum = x + y) would create a new column sum that is the sum of the columns x and y:

Note that the part to the left of the equals sign (here, sum) is the name of the new column, and the part to the right of the equals sign (here, x + y) is an R expression that evaluates to the values in the new column.

Now apply this concept to the penguins dataset. Add a new column bill_ratio that is the ratio of bill length and bill depth:

Counting with custom conditions

It is quite common that we want to count items that meet a specific condition. For example, let’s say we want to count how many penguins of different species have flippers longer than 200mm. To do this efficiently, we first create a new column that indicates whether the condition is met or not, and we then use count with that indicator column.

The easiest way to create indicator columns is via the function if_else(), which takes three arguments: a condition, a result if the condition is met, and a result if the condition is not met. The following example shows how to create an indicator column showing whether a variable is positive or negative:

Now try this yourself. Count how many penguins of different species have flippers longer than 200mm. Then sort your results from most frequent to least frequent.

Solution

penguins |>
  mutate(
    flipper_length = if_else(flipper_length_mm > 200, "greater than 200mm", "less than 200mm")
  ) |>
  count(species, flipper_length) |>
  arrange(desc(n))

Introduction

The pipe (symbol |>, read “and then”)

Choosing data rows

Choosing data columns

Sorting data

Counting

Chaining analysis steps into pipelines

Creating new data columns

Counting with custom conditions

The pipe (symbol `|>`, read “and then”)