# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      1242025-06-18
Previous class:
filter()select()arrange()count()mutate()This class:
group_by() and summarize()pivot_wider(), pivot_longer()left_join(), inner_join(), ...group_by() and summarize()Previously, we counted like so:
Now let’s do it the hard way . . .
Let’s go back to the raw data:
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>Let’s group it:
# A tibble: 344 × 8
# Groups:   species [3]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>And summarize:
Now let’s group by multiple variables:
# A tibble: 344 × 8
# Groups:   species, island [5]
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>And summarize:
count(...) is a short-cut for:group_by(...) |> summarize(n = n())
The output is exactly the same
# A tibble: 3 × 5
  species       n mean_mass max_flipper_length percent_female
  <fct>     <int>     <dbl>              <int>          <dbl>
1 Adelie      152       NA                  NA           NA  
2 Chinstrap    68     3733.                212            0.5
3 Gentoo      124       NA                  NA           NA  Each statement in summarize() creates one new column
But why all the NAs?
penguins |>
  group_by(species) |>
  summarize(
    n = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    max_flipper_length = max(flipper_length_mm, na.rm = TRUE),
    percent_female = sum(sex == "female", na.rm = TRUE)/sum(!is.na(sex))
  )# A tibble: 3 × 5
  species       n mean_mass max_flipper_length percent_female
  <fct>     <int>     <dbl>              <int>          <dbl>
1 Adelie      152     3701.                210          0.5  
2 Chinstrap    68     3733.                212          0.5  
3 Gentoo      124     5076.                231          0.487We typically need to tell R how exactly NAs should be handled
pivot_wider() and pivot_longer()The NAs indicate cases that don’t exist
penguins_wide <- penguins |>
  count(species, island) |>
  pivot_wider(names_from = "island", values_from = "n")
penguins_wide |> 
  pivot_longer(cols = -species, names_to = "island", values_to = "n")# A tibble: 9 × 3
  species   island        n
  <fct>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Biscoe       NA
5 Chinstrap Dream        68
6 Chinstrap Torgersen    NA
7 Gentoo    Biscoe      124
8 Gentoo    Dream        NA
9 Gentoo    Torgersen    NAColumn specifications work just like in select():
# specify columns by subtraction
penguins_wide |> 
  pivot_longer(cols = -species, names_to = "island", values_to = "n")# A tibble: 9 × 3
  species   island        n
  <fct>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Biscoe       NA
5 Chinstrap Dream        68
6 Chinstrap Torgersen    NA
7 Gentoo    Biscoe      124
8 Gentoo    Dream        NA
9 Gentoo    Torgersen    NAColumn specifications work just like in select():
# specify columns by explicit listing
penguins_wide |> 
  pivot_longer(cols = c(Biscoe, Dream, Torgersen), names_to = "island", values_to = "n")# A tibble: 9 × 3
  species   island        n
  <fct>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Biscoe       NA
5 Chinstrap Dream        68
6 Chinstrap Torgersen    NA
7 Gentoo    Biscoe      124
8 Gentoo    Dream        NA
9 Gentoo    Torgersen    NAColumn specifications work just like in select():
# specify columns by range
penguins_wide |> 
  pivot_longer(cols = Biscoe:Torgersen, names_to = "island", values_to = "n")# A tibble: 9 × 3
  species   island        n
  <fct>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Biscoe       NA
5 Chinstrap Dream        68
6 Chinstrap Torgersen    NA
7 Gentoo    Biscoe      124
8 Gentoo    Dream        NA
9 Gentoo    Torgersen    NATip: In case of doubt, use left_join()
In my experience, left_join() is the right choice 70–80% of the time
summarise()