# A tibble: 3 × 2
species n
<fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
2025-01-28
Previous class:
filter()
select()
arrange()
count()
mutate()
This class:
group_by()
and summarize()
pivot_wider()
, pivot_longer()
left_join()
, inner_join()
, ...
group_by()
and summarize()
Previously, we counted like so:
Now let’s do it the hard way . . .
Let’s go back to the raw data:
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Let’s group it:
# A tibble: 344 × 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
And summarize:
Now let’s group by multiple variables:
# A tibble: 344 × 8
# Groups: species, island [5]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
And summarize:
count(...)
is a short-cut for:group_by(...) |> summarize(n = n())
The output is exactly the same
# A tibble: 3 × 5
species n mean_mass max_flipper_length percent_female
<fct> <int> <dbl> <int> <dbl>
1 Adelie 152 NA NA NA
2 Chinstrap 68 3733. 212 0.5
3 Gentoo 124 NA NA NA
Each statement in summarize()
creates one new column
But why all the NA
s?
penguins |>
group_by(species) |>
summarize(
n = n(),
mean_mass = mean(body_mass_g, na.rm = TRUE),
max_flipper_length = max(flipper_length_mm, na.rm = TRUE),
percent_female = sum(sex == "female", na.rm = TRUE)/sum(!is.na(sex))
)
# A tibble: 3 × 5
species n mean_mass max_flipper_length percent_female
<fct> <int> <dbl> <int> <dbl>
1 Adelie 152 3701. 210 0.5
2 Chinstrap 68 3733. 212 0.5
3 Gentoo 124 5076. 231 0.487
We typically need to tell R how exactly NA
s should be handled
pivot_wider()
and pivot_longer()
The NA
s indicate cases that don’t exist
penguins_wide <- penguins |>
count(species, island) |>
pivot_wider(names_from = "island", values_from = "n")
penguins_wide |>
pivot_longer(cols = -species, names_to = "island", values_to = "n")
# A tibble: 9 × 3
species island n
<fct> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Biscoe NA
5 Chinstrap Dream 68
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 124
8 Gentoo Dream NA
9 Gentoo Torgersen NA
Column specifications work just like in select()
:
# specify columns by subtraction
penguins_wide |>
pivot_longer(cols = -species, names_to = "island", values_to = "n")
# A tibble: 9 × 3
species island n
<fct> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Biscoe NA
5 Chinstrap Dream 68
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 124
8 Gentoo Dream NA
9 Gentoo Torgersen NA
Column specifications work just like in select()
:
# specify columns by explicit listing
penguins_wide |>
pivot_longer(cols = c(Biscoe, Dream, Torgersen), names_to = "island", values_to = "n")
# A tibble: 9 × 3
species island n
<fct> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Biscoe NA
5 Chinstrap Dream 68
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 124
8 Gentoo Dream NA
9 Gentoo Torgersen NA
Column specifications work just like in select()
:
# specify columns by range
penguins_wide |>
pivot_longer(cols = Biscoe:Torgersen, names_to = "island", values_to = "n")
# A tibble: 9 × 3
species island n
<fct> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Biscoe NA
5 Chinstrap Dream 68
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 124
8 Gentoo Dream NA
9 Gentoo Torgersen NA
Tip: In case of doubt, use left_join()
In my experience, left_join()
is the right choice 70–80% of the time
summarise()