+ - 0:00:00
Notes for current slide
Notes for next slide

Data wrangling 2

Claus O. Wilke

last updated: 2022-09-14

Elementary data manipulations

Previous class:

  • Pick rows: filter()
  • Pick columns: select()
  • Sort rows: arrange()
  • Count things: count()
  • Make new columns: mutate()

Elementary data manipulations

Previous class:

  • Pick rows: filter()
  • Pick columns: select()
  • Sort rows: arrange()
  • Count things: count()
  • Make new columns: mutate()

This class:

  • Analyze subsets:
    group_by() and summarize()

Elementary data manipulations

Previous class:

  • Pick rows: filter()
  • Pick columns: select()
  • Sort rows: arrange()
  • Count things: count()
  • Make new columns: mutate()

This class:

  • Analyze subsets:
    group_by() and summarize()
  • Reshape:
    pivot_wider(), pivot_longer()

Elementary data manipulations

Previous class:

  • Pick rows: filter()
  • Pick columns: select()
  • Sort rows: arrange()
  • Count things: count()
  • Make new columns: mutate()

This class:

  • Analyze subsets:
    group_by() and summarize()
  • Reshape:
    pivot_wider(), pivot_longer()
  • Combine datasets:
    left_join(), inner_join(), ...

Analyze subsets: group_by() and summarize()


Example application of grouping: Counting

Previously, we counted like so:

library(palmerpenguins) # loads the `penguins` dataset
penguins %>%
count(species)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124

Example application of grouping: Counting

Previously, we counted like so:

library(palmerpenguins) # loads the `penguins` dataset
penguins %>%
count(species)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124

Now let's do it the hard way

Example application of grouping: Counting

Let's go back to the raw data:

penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# … with 334 more rows, and 2 more variables: sex <fct>, year <int>

Example application of grouping: Counting

Let's group it:

penguins %>%
group_by(species)
# A tibble: 344 × 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# … with 334 more rows, and 2 more variables: sex <fct>, year <int>

Example application of grouping: Counting

And summarize:

penguins %>%
group_by(species) %>%
summarize(
n = n() # n() returns the number of observations per group
)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124

Example application of grouping: Counting

Now let's group by multiple variables:

penguins %>%
group_by(species, island)
# A tibble: 344 × 8
# Groups: species, island [5]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# … with 334 more rows, and 2 more variables: sex <fct>, year <int>

Example application of grouping: Counting

And summarize:

penguins %>%
group_by(species, island) %>%
summarize(
n = n() # n() returns the number of observations per group
)
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups: species [3]
species island n
<fct> <fct> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Dream 68
5 Gentoo Biscoe 124

Example application of grouping: Counting

count(...) is a short-cut for group_by(...) %>% summarize(n = n())

Example application of grouping: Counting

count(...) is a short-cut for group_by(...) %>% summarize(n = n())

penguins %>%
count(species)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124

Example application of grouping: Counting

count(...) is a short-cut for group_by(...) %>% summarize(n = n())

penguins %>%
count(species)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
penguins %>%
group_by(species) %>%
summarize(
n = n()
)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124

Example application of grouping: Counting

count(...) is a short-cut for group_by(...) %>% summarize(n = n())

penguins %>%
count(species)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
penguins %>%
group_by(species) %>%
summarize(
n = n()
)
# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124

The output is exactly the same

Performing multiple summaries at once

Performing multiple summaries at once

penguins %>%
group_by(species) %>%
summarize(
n = n(), # number of penguins
mean_mass = mean(body_mass_g), # mean body mass
max_flipper_length = max(flipper_length_mm), # max flipper length
percent_female = sum(sex == "female")/n() # percent of female penguins
)

Performing multiple summaries at once

penguins %>%
group_by(species) %>%
summarize(
n = n(), # number of penguins
mean_mass = mean(body_mass_g), # mean body mass
max_flipper_length = max(flipper_length_mm), # max flipper length
percent_female = sum(sex == "female")/n() # percent of female penguins
)
# A tibble: 3 × 5
species n mean_mass max_flipper_length percent_female
<fct> <int> <dbl> <int> <dbl>
1 Adelie 152 NA NA NA
2 Chinstrap 68 3733. 212 0.5
3 Gentoo 124 NA NA NA

Performing multiple summaries at once

penguins %>%
group_by(species) %>%
summarize(
n = n(), # number of penguins
mean_mass = mean(body_mass_g), # mean body mass
max_flipper_length = max(flipper_length_mm), # max flipper length
percent_female = sum(sex == "female")/n() # percent of female penguins
)
# A tibble: 3 × 5
species n mean_mass max_flipper_length percent_female
<fct> <int> <dbl> <int> <dbl>
1 Adelie 152 NA NA NA
2 Chinstrap 68 3733. 212 0.5
3 Gentoo 124 NA NA NA

Each statement in summarize() creates one new column

Performing multiple summaries at once

penguins %>%
group_by(species) %>%
summarize(
n = n(), # number of penguins
mean_mass = mean(body_mass_g), # mean body mass
max_flipper_length = max(flipper_length_mm), # max flipper length
percent_female = sum(sex == "female")/n() # percent of female penguins
)
# A tibble: 3 × 5
species n mean_mass max_flipper_length percent_female
<fct> <int> <dbl> <int> <dbl>
1 Adelie 152 NA NA NA
2 Chinstrap 68 3733. 212 0.5
3 Gentoo 124 NA NA NA

Each statement in summarize() creates one new column

But why all the NAs?

Performing multiple summaries at once

penguins %>%
group_by(species) %>%
summarize(
n = n(),
mean_mass = mean(body_mass_g, na.rm = TRUE),
max_flipper_length = max(flipper_length_mm, na.rm = TRUE),
percent_female = sum(sex == "female", na.rm = TRUE)/sum(!is.na(sex))
)
# A tibble: 3 × 5
species n mean_mass max_flipper_length percent_female
<fct> <int> <dbl> <int> <dbl>
1 Adelie 152 3701. 210 0.5
2 Chinstrap 68 3733. 212 0.5
3 Gentoo 124 5076. 231 0.487

We typically need to tell R how exactly NAs should be handled

Reshaping: Making tables wider or longer

Reshape: pivot_wider() and pivot_longer()


Reshaping example: Making a wide summary table

penguins %>%
count(species, island)
# A tibble: 5 × 3
species island n
<fct> <fct> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Dream 68
5 Gentoo Biscoe 124

Reshaping example: Making a wide summary table

penguins %>%
count(species, island) %>%
pivot_wider(names_from = "island", values_from = "n")
# A tibble: 3 × 4
species Biscoe Dream Torgersen
<fct> <int> <int> <int>
1 Adelie 44 56 52
2 Chinstrap NA 68 NA
3 Gentoo 124 NA NA

The NAs indicate cases that don't exist

And going back to long format

penguins_wide <- penguins %>%
count(species, island) %>%
pivot_wider(names_from = "island", values_from = "n")
penguins_wide %>%
pivot_longer(cols = -species, names_to = "island", values_to = "n")
# A tibble: 9 × 3
species island n
<fct> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Biscoe NA
5 Chinstrap Dream 68
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 124
8 Gentoo Dream NA
9 Gentoo Torgersen NA

And going back to long format

Column specifications work just like in select():

# specify columns by subtraction
penguins_wide %>%
pivot_longer(cols = -species, names_to = "island", values_to = "n")
# A tibble: 9 × 3
species island n
<fct> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Biscoe NA
5 Chinstrap Dream 68
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 124
8 Gentoo Dream NA
9 Gentoo Torgersen NA

And going back to long format

Column specifications work just like in select():

# specify columns by explicit listing
penguins_wide %>%
pivot_longer(cols = c(Biscoe, Dream, Torgersen), names_to = "island", values_to = "n")
# A tibble: 9 × 3
species island n
<fct> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Biscoe NA
5 Chinstrap Dream 68
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 124
8 Gentoo Dream NA
9 Gentoo Torgersen NA

And going back to long format

Column specifications work just like in select():

# specify columns by range
penguins_wide %>%
pivot_longer(cols = Biscoe:Torgersen, names_to = "island", values_to = "n")
# A tibble: 9 × 3
species island n
<fct> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Biscoe NA
5 Chinstrap Dream 68
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 124
8 Gentoo Dream NA
9 Gentoo Torgersen NA

Combining datasets: joins

We use joins to add columns from one table into another

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
# instruments played by some band members
band_instruments
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
# instruments played by some band members
band_instruments
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
left_join(band_members, band_instruments) # add right table to left
Joining, by = "name"
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
# instruments played by some band members
band_instruments
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
right_join(band_members, band_instruments) # add left table to right
Joining, by = "name"
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
3 Keith <NA> guitar

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
# instruments played by some band members
band_instruments
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
inner_join(band_members, band_instruments) # keep intersection only
Joining, by = "name"
# A tibble: 2 × 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
# instruments played by some band members
band_instruments
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
full_join(band_members, band_instruments) # merge all cases
Joining, by = "name"
# A tibble: 4 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 Keith <NA> guitar

The different joins at a glance

The different joins at a glance


Tip: In case of doubt, use left_join()

The different joins at a glance


Tip: In case of doubt, use left_join()

In my experience, left_join() is the right choice
70–80% of the time

Further reading

Elementary data manipulations

Previous class:

  • Pick rows: filter()
  • Pick columns: select()
  • Sort rows: arrange()
  • Count things: count()
  • Make new columns: mutate()
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow