Data wrangling 2

Claus O. Wilke

2025-01-28

Elementary data manipulations

Previous class:

  • Pick rows: filter()
  • Pick columns: select()
  • Sort rows: arrange()
  • Count things: count()
  • Make new columns: mutate()

This class:

  • Analyze subsets:
    group_by() and summarize()
  • Reshape:
    pivot_wider(), pivot_longer()
  • Combine datasets:
    left_join(), inner_join(), ...

Analyze subsets: group_by() and summarize()

Example application of grouping: Counting

Previously, we counted like so:

library(palmerpenguins) # loads the `penguins` dataset

penguins |>
  count(species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Now let’s do it the hard way . . .

Example application of grouping: Counting

Let’s go back to the raw data:

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Example application of grouping: Counting

Let’s group it:

penguins |>
  group_by(species)
# A tibble: 344 × 8
# Groups:   species [3]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Example application of grouping: Counting

And summarize:

penguins |>
  group_by(species) |>
  summarize(
    n = n()  # n() returns the number of observations per group
  )
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Example application of grouping: Counting

Now let’s group by multiple variables:

penguins |>
  group_by(species, island)
# A tibble: 344 × 8
# Groups:   species, island [5]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Example application of grouping: Counting

And summarize:

penguins |>
  group_by(species, island) |>
  summarize(
    n = n()  # n() returns the number of observations per group
  )
# A tibble: 5 × 3
# Groups:   species [3]
  species   island        n
  <fct>     <fct>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Dream        68
5 Gentoo    Biscoe      124

Example application of grouping: Counting

count(...) is a short-cut for:
group_by(...) |> summarize(n = n())

#
penguins |>
  count(species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
penguins |>
  group_by(species) |>
  summarize(n = n())
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

The output is exactly the same

Performing multiple summaries at once

penguins |>
  group_by(species) |>
  summarize(
    n = n(),                                      # number of penguins
    mean_mass = mean(body_mass_g),                # mean body mass
    max_flipper_length = max(flipper_length_mm),  # max flipper length
    percent_female = sum(sex == "female")/n()     # percent of female penguins
  )
# A tibble: 3 × 5
  species       n mean_mass max_flipper_length percent_female
  <fct>     <int>     <dbl>              <int>          <dbl>
1 Adelie      152       NA                  NA           NA  
2 Chinstrap    68     3733.                212            0.5
3 Gentoo      124       NA                  NA           NA  

Each statement in summarize() creates one new column

But why all the NAs?

Performing multiple summaries at once

penguins |>
  group_by(species) |>
  summarize(
    n = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    max_flipper_length = max(flipper_length_mm, na.rm = TRUE),
    percent_female = sum(sex == "female", na.rm = TRUE)/sum(!is.na(sex))
  )
# A tibble: 3 × 5
  species       n mean_mass max_flipper_length percent_female
  <fct>     <int>     <dbl>              <int>          <dbl>
1 Adelie      152     3701.                210          0.5  
2 Chinstrap    68     3733.                212          0.5  
3 Gentoo      124     5076.                231          0.487

We typically need to tell R how exactly NAs should be handled

Reshaping: Making tables wider or longer

Reshape: pivot_wider() and pivot_longer()

Reshaping example: Making a wide summary table

penguins |>
  count(species, island)
# A tibble: 5 × 3
  species   island        n
  <fct>     <fct>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Dream        68
5 Gentoo    Biscoe      124

Reshaping example: Making a wide summary table

penguins |>
  count(species, island) |>
  pivot_wider(names_from = "island", values_from = "n")
# A tibble: 3 × 4
  species   Biscoe Dream Torgersen
  <fct>      <int> <int>     <int>
1 Adelie        44    56        52
2 Chinstrap     NA    68        NA
3 Gentoo       124    NA        NA

The NAs indicate cases that don’t exist

And going back to long format

penguins_wide <- penguins |>
  count(species, island) |>
  pivot_wider(names_from = "island", values_from = "n")

penguins_wide |> 
  pivot_longer(cols = -species, names_to = "island", values_to = "n")
# A tibble: 9 × 3
  species   island        n
  <fct>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Biscoe       NA
5 Chinstrap Dream        68
6 Chinstrap Torgersen    NA
7 Gentoo    Biscoe      124
8 Gentoo    Dream        NA
9 Gentoo    Torgersen    NA

And going back to long format

Column specifications work just like in select():

# specify columns by subtraction
penguins_wide |> 
  pivot_longer(cols = -species, names_to = "island", values_to = "n")
# A tibble: 9 × 3
  species   island        n
  <fct>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Biscoe       NA
5 Chinstrap Dream        68
6 Chinstrap Torgersen    NA
7 Gentoo    Biscoe      124
8 Gentoo    Dream        NA
9 Gentoo    Torgersen    NA

And going back to long format

Column specifications work just like in select():

# specify columns by explicit listing
penguins_wide |> 
  pivot_longer(cols = c(Biscoe, Dream, Torgersen), names_to = "island", values_to = "n")
# A tibble: 9 × 3
  species   island        n
  <fct>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Biscoe       NA
5 Chinstrap Dream        68
6 Chinstrap Torgersen    NA
7 Gentoo    Biscoe      124
8 Gentoo    Dream        NA
9 Gentoo    Torgersen    NA

And going back to long format

Column specifications work just like in select():

# specify columns by range
penguins_wide |> 
  pivot_longer(cols = Biscoe:Torgersen, names_to = "island", values_to = "n")
# A tibble: 9 × 3
  species   island        n
  <fct>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Biscoe       NA
5 Chinstrap Dream        68
6 Chinstrap Torgersen    NA
7 Gentoo    Biscoe      124
8 Gentoo    Dream        NA
9 Gentoo    Torgersen    NA

Combining datasets: joins

We use joins to add columns from one table into another

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
# instruments played
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar


# add right table to left
left_join(band_members, band_instruments)
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
# instruments played
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar


# add left table to right
right_join(band_members, band_instruments)
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
# instruments played
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar


# keep intersection only
inner_join(band_members, band_instruments)
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

Example: band members and their instruments

# some members of famous bands
band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
# instruments played
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar


# merge all cases
full_join(band_members, band_instruments)
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

The different joins at a glance

The different joins at a glance


Tip: In case of doubt, use left_join()

In my experience, left_join() is the right choice 70–80% of the time

Further reading