Previous class:
filter()
select()
arrange()
count()
mutate()
Previous class:
filter()
select()
arrange()
count()
mutate()
This class:
group_by()
and summarize()
Previous class:
filter()
select()
arrange()
count()
mutate()
This class:
group_by()
and summarize()
pivot_wider()
, pivot_longer()
Previous class:
filter()
select()
arrange()
count()
mutate()
This class:
group_by()
and summarize()
pivot_wider()
, pivot_longer()
left_join()
, inner_join()
, ...
group_by()
and summarize()
Previously, we counted like so:
library(palmerpenguins) # loads the `penguins` datasetpenguins %>% count(species)
# A tibble: 3 × 2 species n <fct> <int>1 Adelie 1522 Chinstrap 683 Gentoo 124
Previously, we counted like so:
library(palmerpenguins) # loads the `penguins` datasetpenguins %>% count(species)
# A tibble: 3 × 2 species n <fct> <int>1 Adelie 1522 Chinstrap 683 Gentoo 124
Now let's do it the hard way
Let's go back to the raw data:
penguins
# A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 347510 Adelie Torgersen 42 20.2 190 4250# … with 334 more rows, and 2 more variables: sex <fct>, year <int>
Let's group it:
penguins %>% group_by(species)
# A tibble: 344 × 8# Groups: species [3] species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 347510 Adelie Torgersen 42 20.2 190 4250# … with 334 more rows, and 2 more variables: sex <fct>, year <int>
And summarize:
penguins %>% group_by(species) %>% summarize( n = n() # n() returns the number of observations per group )
# A tibble: 3 × 2 species n <fct> <int>1 Adelie 1522 Chinstrap 683 Gentoo 124
Now let's group by multiple variables:
penguins %>% group_by(species, island)
# A tibble: 344 × 8# Groups: species, island [5] species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 347510 Adelie Torgersen 42 20.2 190 4250# … with 334 more rows, and 2 more variables: sex <fct>, year <int>
And summarize:
penguins %>% group_by(species, island) %>% summarize( n = n() # n() returns the number of observations per group )
`summarise()` has grouped output by 'species'. You can override using the`.groups` argument.
# A tibble: 5 × 3# Groups: species [3] species island n <fct> <fct> <int>1 Adelie Biscoe 442 Adelie Dream 563 Adelie Torgersen 524 Chinstrap Dream 685 Gentoo Biscoe 124
count(...)
is a short-cut for group_by(...) %>% summarize(n = n())
count(...)
is a short-cut for group_by(...) %>% summarize(n = n())
penguins %>% count(species)
# A tibble: 3 × 2 species n <fct> <int>1 Adelie 1522 Chinstrap 683 Gentoo 124
count(...)
is a short-cut for group_by(...) %>% summarize(n = n())
penguins %>% count(species)
# A tibble: 3 × 2 species n <fct> <int>1 Adelie 1522 Chinstrap 683 Gentoo 124
penguins %>% group_by(species) %>% summarize( n = n() )
# A tibble: 3 × 2 species n <fct> <int>1 Adelie 1522 Chinstrap 683 Gentoo 124
count(...)
is a short-cut for group_by(...) %>% summarize(n = n())
penguins %>% count(species)
# A tibble: 3 × 2 species n <fct> <int>1 Adelie 1522 Chinstrap 683 Gentoo 124
penguins %>% group_by(species) %>% summarize( n = n() )
# A tibble: 3 × 2 species n <fct> <int>1 Adelie 1522 Chinstrap 683 Gentoo 124
The output is exactly the same
penguins %>% group_by(species) %>% summarize( n = n(), # number of penguins mean_mass = mean(body_mass_g), # mean body mass max_flipper_length = max(flipper_length_mm), # max flipper length percent_female = sum(sex == "female")/n() # percent of female penguins )
penguins %>% group_by(species) %>% summarize( n = n(), # number of penguins mean_mass = mean(body_mass_g), # mean body mass max_flipper_length = max(flipper_length_mm), # max flipper length percent_female = sum(sex == "female")/n() # percent of female penguins )
# A tibble: 3 × 5 species n mean_mass max_flipper_length percent_female <fct> <int> <dbl> <int> <dbl>1 Adelie 152 NA NA NA 2 Chinstrap 68 3733. 212 0.53 Gentoo 124 NA NA NA
penguins %>% group_by(species) %>% summarize( n = n(), # number of penguins mean_mass = mean(body_mass_g), # mean body mass max_flipper_length = max(flipper_length_mm), # max flipper length percent_female = sum(sex == "female")/n() # percent of female penguins )
# A tibble: 3 × 5 species n mean_mass max_flipper_length percent_female <fct> <int> <dbl> <int> <dbl>1 Adelie 152 NA NA NA 2 Chinstrap 68 3733. 212 0.53 Gentoo 124 NA NA NA
Each statement in summarize()
creates one new column
penguins %>% group_by(species) %>% summarize( n = n(), # number of penguins mean_mass = mean(body_mass_g), # mean body mass max_flipper_length = max(flipper_length_mm), # max flipper length percent_female = sum(sex == "female")/n() # percent of female penguins )
# A tibble: 3 × 5 species n mean_mass max_flipper_length percent_female <fct> <int> <dbl> <int> <dbl>1 Adelie 152 NA NA NA 2 Chinstrap 68 3733. 212 0.53 Gentoo 124 NA NA NA
Each statement in summarize()
creates one new column
But why all the NA
s?
penguins %>% group_by(species) %>% summarize( n = n(), mean_mass = mean(body_mass_g, na.rm = TRUE), max_flipper_length = max(flipper_length_mm, na.rm = TRUE), percent_female = sum(sex == "female", na.rm = TRUE)/sum(!is.na(sex)) )
# A tibble: 3 × 5 species n mean_mass max_flipper_length percent_female <fct> <int> <dbl> <int> <dbl>1 Adelie 152 3701. 210 0.5 2 Chinstrap 68 3733. 212 0.5 3 Gentoo 124 5076. 231 0.487
We typically need to tell R how exactly NA
s should be handled
pivot_wider()
and pivot_longer()
penguins %>% count(species, island)
# A tibble: 5 × 3 species island n <fct> <fct> <int>1 Adelie Biscoe 442 Adelie Dream 563 Adelie Torgersen 524 Chinstrap Dream 685 Gentoo Biscoe 124
penguins %>% count(species, island) %>% pivot_wider(names_from = "island", values_from = "n")
# A tibble: 3 × 4 species Biscoe Dream Torgersen <fct> <int> <int> <int>1 Adelie 44 56 522 Chinstrap NA 68 NA3 Gentoo 124 NA NA
The NA
s indicate cases that don't exist
penguins_wide <- penguins %>% count(species, island) %>% pivot_wider(names_from = "island", values_from = "n")penguins_wide %>% pivot_longer(cols = -species, names_to = "island", values_to = "n")
# A tibble: 9 × 3 species island n <fct> <chr> <int>1 Adelie Biscoe 442 Adelie Dream 563 Adelie Torgersen 524 Chinstrap Biscoe NA5 Chinstrap Dream 686 Chinstrap Torgersen NA7 Gentoo Biscoe 1248 Gentoo Dream NA9 Gentoo Torgersen NA
Column specifications work just like in select()
:
# specify columns by subtractionpenguins_wide %>% pivot_longer(cols = -species, names_to = "island", values_to = "n")
# A tibble: 9 × 3 species island n <fct> <chr> <int>1 Adelie Biscoe 442 Adelie Dream 563 Adelie Torgersen 524 Chinstrap Biscoe NA5 Chinstrap Dream 686 Chinstrap Torgersen NA7 Gentoo Biscoe 1248 Gentoo Dream NA9 Gentoo Torgersen NA
Column specifications work just like in select()
:
# specify columns by explicit listingpenguins_wide %>% pivot_longer(cols = c(Biscoe, Dream, Torgersen), names_to = "island", values_to = "n")
# A tibble: 9 × 3 species island n <fct> <chr> <int>1 Adelie Biscoe 442 Adelie Dream 563 Adelie Torgersen 524 Chinstrap Biscoe NA5 Chinstrap Dream 686 Chinstrap Torgersen NA7 Gentoo Biscoe 1248 Gentoo Dream NA9 Gentoo Torgersen NA
Column specifications work just like in select()
:
# specify columns by rangepenguins_wide %>% pivot_longer(cols = Biscoe:Torgersen, names_to = "island", values_to = "n")
# A tibble: 9 × 3 species island n <fct> <chr> <int>1 Adelie Biscoe 442 Adelie Dream 563 Adelie Torgersen 524 Chinstrap Biscoe NA5 Chinstrap Dream 686 Chinstrap Torgersen NA7 Gentoo Biscoe 1248 Gentoo Dream NA9 Gentoo Torgersen NA
# some members of famous bandsband_members
# A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles3 Paul Beatles
# some members of famous bandsband_members
# A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles3 Paul Beatles
# instruments played by some band membersband_instruments
# A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar2 Paul bass 3 Keith guitar
# some members of famous bandsband_members
# A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles3 Paul Beatles
# instruments played by some band membersband_instruments
# A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar2 Paul bass 3 Keith guitar
left_join(band_members, band_instruments) # add right table to left
Joining, by = "name"
# A tibble: 3 × 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> 2 John Beatles guitar3 Paul Beatles bass
# some members of famous bandsband_members
# A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles3 Paul Beatles
# instruments played by some band membersband_instruments
# A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar2 Paul bass 3 Keith guitar
right_join(band_members, band_instruments) # add left table to right
Joining, by = "name"
# A tibble: 3 × 3 name band plays <chr> <chr> <chr> 1 John Beatles guitar2 Paul Beatles bass 3 Keith <NA> guitar
# some members of famous bandsband_members
# A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles3 Paul Beatles
# instruments played by some band membersband_instruments
# A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar2 Paul bass 3 Keith guitar
inner_join(band_members, band_instruments) # keep intersection only
Joining, by = "name"
# A tibble: 2 × 3 name band plays <chr> <chr> <chr> 1 John Beatles guitar2 Paul Beatles bass
# some members of famous bandsband_members
# A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles3 Paul Beatles
# instruments played by some band membersband_instruments
# A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar2 Paul bass 3 Keith guitar
full_join(band_members, band_instruments) # merge all cases
Joining, by = "name"
# A tibble: 4 × 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> 2 John Beatles guitar3 Paul Beatles bass 4 Keith <NA> guitar
Tip: In case of doubt, use left_join()
Tip: In case of doubt, use left_join()
In my experience, left_join()
is the right choice
70–80% of the time
summarise()
Previous class:
filter()
select()
arrange()
count()
mutate()
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |