class: center, middle, title-slide .title[ # Data wrangling 2 ] .author[ ### Claus O. Wilke ] .date[ ### last updated: 2022-09-14 ] --- ## Elementary data manipulations .pull-left[ Previous class: - Pick rows: `filter()` - Pick columns: `select()` - Sort rows: `arrange()` - Count things: `count()` - Make new columns: `mutate()` ] --- ## Elementary data manipulations .pull-left[ Previous class: - Pick rows: `filter()` - Pick columns: `select()` - Sort rows: `arrange()` - Count things: `count()` - Make new columns: `mutate()` ] .pull-right[ This class: - Analyze subsets:<br>`group_by()` and `summarize()` ] --- ## Elementary data manipulations .pull-left[ Previous class: - Pick rows: `filter()` - Pick columns: `select()` - Sort rows: `arrange()` - Count things: `count()` - Make new columns: `mutate()` ] .pull-right[ This class: - Analyze subsets:<br>`group_by()` and `summarize()` - Reshape:<br>`pivot_wider()`, `pivot_longer()` ] --- ## Elementary data manipulations .pull-left[ Previous class: - Pick rows: `filter()` - Pick columns: `select()` - Sort rows: `arrange()` - Count things: `count()` - Make new columns: `mutate()` ] .pull-right[ This class: - Analyze subsets:<br>`group_by()` and `summarize()` - Reshape:<br>`pivot_wider()`, `pivot_longer()` - Combine datasets:<br>`left_join()`, `inner_join()`, `...` ] --- ## Analyze subsets: `group_by()` and `summarize()` <br> .center[ <img src = "data-wrangling-2_files/summarize.svg", width = 90%></img> ] --- ## Example application of grouping: Counting Previously, we counted like so: .tiny-font[ ```r library(palmerpenguins) # loads the `penguins` dataset penguins %>% count(species) ``` ``` # A tibble: 3 × 2 species n <fct> <int> 1 Adelie 152 2 Chinstrap 68 3 Gentoo 124 ``` ] -- Now let's do it the hard way --- ## Example application of grouping: Counting Let's go back to the raw data: .tiny-font[ ```r penguins ``` ``` # A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 2 more variables: sex <fct>, year <int> ``` ] --- ## Example application of grouping: Counting Let's group it: .tiny-font[ ```r penguins %>% group_by(species) ``` ``` # A tibble: 344 × 8 # Groups: species [3] species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 2 more variables: sex <fct>, year <int> ``` ] --- ## Example application of grouping: Counting And summarize: .tiny-font[ ```r penguins %>% group_by(species) %>% summarize( n = n() # n() returns the number of observations per group ) ``` ``` # A tibble: 3 × 2 species n <fct> <int> 1 Adelie 152 2 Chinstrap 68 3 Gentoo 124 ``` ] --- ## Example application of grouping: Counting Now let's group by multiple variables: .tiny-font[ ```r penguins %>% group_by(species, island) ``` ``` # A tibble: 344 × 8 # Groups: species, island [5] species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 2 more variables: sex <fct>, year <int> ``` ] --- ## Example application of grouping: Counting And summarize: .tiny-font[ ```r penguins %>% group_by(species, island) %>% summarize( n = n() # n() returns the number of observations per group ) ``` ``` `summarise()` has grouped output by 'species'. You can override using the `.groups` argument. ``` ``` # A tibble: 5 × 3 # Groups: species [3] species island n <fct> <fct> <int> 1 Adelie Biscoe 44 2 Adelie Dream 56 3 Adelie Torgersen 52 4 Chinstrap Dream 68 5 Gentoo Biscoe 124 ``` ] --- ## Example application of grouping: Counting `count(...)` is a short-cut for `group_by(...) %>% summarize(n = n())` -- .tiny-font.pull-left[ ```r penguins %>% count(species) ``` ``` # A tibble: 3 × 2 species n <fct> <int> 1 Adelie 152 2 Chinstrap 68 3 Gentoo 124 ``` ] -- .tiny-font.pull-right[ ```r penguins %>% group_by(species) %>% summarize( n = n() ) ``` ``` # A tibble: 3 × 2 species n <fct> <int> 1 Adelie 152 2 Chinstrap 68 3 Gentoo 124 ``` ] -- The output is exactly the same --- ## Performing multiple summaries at once -- .tiny-font[ ```r penguins %>% group_by(species) %>% summarize( n = n(), # number of penguins mean_mass = mean(body_mass_g), # mean body mass max_flipper_length = max(flipper_length_mm), # max flipper length percent_female = sum(sex == "female")/n() # percent of female penguins ) ``` ] -- .tiny-font[ ``` # A tibble: 3 × 5 species n mean_mass max_flipper_length percent_female <fct> <int> <dbl> <int> <dbl> 1 Adelie 152 NA NA NA 2 Chinstrap 68 3733. 212 0.5 3 Gentoo 124 NA NA NA ``` ] -- Each statement in `summarize()` creates one new column -- But why all the `NA`s? --- ## Performing multiple summaries at once .tiny-font[ ```r penguins %>% group_by(species) %>% summarize( n = n(), mean_mass = mean(body_mass_g, na.rm = TRUE), max_flipper_length = max(flipper_length_mm, na.rm = TRUE), percent_female = sum(sex == "female", na.rm = TRUE)/sum(!is.na(sex)) ) ``` ] .tiny-font[ ``` # A tibble: 3 × 5 species n mean_mass max_flipper_length percent_female <fct> <int> <dbl> <int> <dbl> 1 Adelie 152 3701. 210 0.5 2 Chinstrap 68 3733. 212 0.5 3 Gentoo 124 5076. 231 0.487 ``` ] We typically need to tell R how exactly `NA`s should be handled [//]: # "segment ends here" --- class: center middle ## Reshaping: Making tables wider or longer --- ## Reshape: `pivot_wider()` and `pivot_longer()` <br> .center[ <img src = "data-wrangling-2_files/pivot.svg", width = 50%></img> ] --- ## Reshaping example: Making a wide summary table .tiny-font[ ```r penguins %>% count(species, island) ``` ``` # A tibble: 5 × 3 species island n <fct> <fct> <int> 1 Adelie Biscoe 44 2 Adelie Dream 56 3 Adelie Torgersen 52 4 Chinstrap Dream 68 5 Gentoo Biscoe 124 ``` ] --- ## Reshaping example: Making a wide summary table .tiny-font[ ```r penguins %>% count(species, island) %>% pivot_wider(names_from = "island", values_from = "n") ``` ``` # A tibble: 3 × 4 species Biscoe Dream Torgersen <fct> <int> <int> <int> 1 Adelie 44 56 52 2 Chinstrap NA 68 NA 3 Gentoo 124 NA NA ``` ] The `NA`s indicate cases that don't exist --- ## And going back to long format .tiny-font[ ```r penguins_wide <- penguins %>% count(species, island) %>% pivot_wider(names_from = "island", values_from = "n") penguins_wide %>% pivot_longer(cols = -species, names_to = "island", values_to = "n") ``` ``` # A tibble: 9 × 3 species island n <fct> <chr> <int> 1 Adelie Biscoe 44 2 Adelie Dream 56 3 Adelie Torgersen 52 4 Chinstrap Biscoe NA 5 Chinstrap Dream 68 6 Chinstrap Torgersen NA 7 Gentoo Biscoe 124 8 Gentoo Dream NA 9 Gentoo Torgersen NA ``` ] --- ## And going back to long format Column specifications work just like in `select()`: .tiny-font[ ```r # specify columns by subtraction penguins_wide %>% pivot_longer(cols = -species, names_to = "island", values_to = "n") ``` ``` # A tibble: 9 × 3 species island n <fct> <chr> <int> 1 Adelie Biscoe 44 2 Adelie Dream 56 3 Adelie Torgersen 52 4 Chinstrap Biscoe NA 5 Chinstrap Dream 68 6 Chinstrap Torgersen NA 7 Gentoo Biscoe 124 8 Gentoo Dream NA 9 Gentoo Torgersen NA ``` ] --- ## And going back to long format Column specifications work just like in `select()`: .tiny-font[ ```r # specify columns by explicit listing penguins_wide %>% pivot_longer(cols = c(Biscoe, Dream, Torgersen), names_to = "island", values_to = "n") ``` ``` # A tibble: 9 × 3 species island n <fct> <chr> <int> 1 Adelie Biscoe 44 2 Adelie Dream 56 3 Adelie Torgersen 52 4 Chinstrap Biscoe NA 5 Chinstrap Dream 68 6 Chinstrap Torgersen NA 7 Gentoo Biscoe 124 8 Gentoo Dream NA 9 Gentoo Torgersen NA ``` ] --- ## And going back to long format Column specifications work just like in `select()`: .tiny-font[ ```r # specify columns by range penguins_wide %>% pivot_longer(cols = Biscoe:Torgersen, names_to = "island", values_to = "n") ``` ``` # A tibble: 9 × 3 species island n <fct> <chr> <int> 1 Adelie Biscoe 44 2 Adelie Dream 56 3 Adelie Torgersen 52 4 Chinstrap Biscoe NA 5 Chinstrap Dream 68 6 Chinstrap Torgersen NA 7 Gentoo Biscoe 124 8 Gentoo Dream NA 9 Gentoo Torgersen NA ``` ] [//]: # "segment ends here" --- class: center middle ## Combining datasets: joins --- ## We use joins to add columns from one table into another .center[ <img src = "data-wrangling-2_files/join.svg", width = 75%></img> ] --- ## Example: band members and their instruments .tiny-font.pull-left[ ```r # some members of famous bands band_members ``` ``` # A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles 3 Paul Beatles ``` ] -- .tiny-font.pull-right[ ```r # instruments played by some band members band_instruments ``` ``` # A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar 2 Paul bass 3 Keith guitar ``` ] -- .tiny-font[ ```r left_join(band_members, band_instruments) # add right table to left ``` ``` Joining, by = "name" ``` ``` # A tibble: 3 × 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> 2 John Beatles guitar 3 Paul Beatles bass ``` ] --- ## Example: band members and their instruments .tiny-font.pull-left[ ```r # some members of famous bands band_members ``` ``` # A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles 3 Paul Beatles ``` ] .tiny-font.pull-right[ ```r # instruments played by some band members band_instruments ``` ``` # A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar 2 Paul bass 3 Keith guitar ``` ] .tiny-font[ ```r right_join(band_members, band_instruments) # add left table to right ``` ``` Joining, by = "name" ``` ``` # A tibble: 3 × 3 name band plays <chr> <chr> <chr> 1 John Beatles guitar 2 Paul Beatles bass 3 Keith <NA> guitar ``` ] --- ## Example: band members and their instruments .tiny-font.pull-left[ ```r # some members of famous bands band_members ``` ``` # A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles 3 Paul Beatles ``` ] .tiny-font.pull-right[ ```r # instruments played by some band members band_instruments ``` ``` # A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar 2 Paul bass 3 Keith guitar ``` ] .tiny-font[ ```r inner_join(band_members, band_instruments) # keep intersection only ``` ``` Joining, by = "name" ``` ``` # A tibble: 2 × 3 name band plays <chr> <chr> <chr> 1 John Beatles guitar 2 Paul Beatles bass ``` ] --- ## Example: band members and their instruments .tiny-font.pull-left[ ```r # some members of famous bands band_members ``` ``` # A tibble: 3 × 2 name band <chr> <chr> 1 Mick Stones 2 John Beatles 3 Paul Beatles ``` ] .tiny-font.pull-right[ ```r # instruments played by some band members band_instruments ``` ``` # A tibble: 3 × 2 name plays <chr> <chr> 1 John guitar 2 Paul bass 3 Keith guitar ``` ] .tiny-font[ ```r full_join(band_members, band_instruments) # merge all cases ``` ``` Joining, by = "name" ``` ``` # A tibble: 4 × 3 name band plays <chr> <chr> <chr> 1 Mick Stones <NA> 2 John Beatles guitar 3 Paul Beatles bass 4 Keith <NA> guitar ``` ] --- ## The different joins at a glance .center[ <img src = "data-wrangling-2_files/joins-venn.svg", width = 65%></img> ] --- ## The different joins at a glance <img src = "data-wrangling-2_files/joins-venn2.svg", width = 35%, style = "position:absolute; top: 5%; left: 62%;"></img> <br> **Tip:** In case of doubt, use `left_join()` -- In my experience, `left_join()` is the right choice 70–80% of the time [//]: # "segment ends here" --- ## Further reading - R for Data Science: [Chapter 5.6: Grouped summaries with `summarise()`](https://r4ds.had.co.nz/transform.html?q=group_by#grouped-summaries-with-summarise) - R for Data Science: [Chapter 12.3: Pivoting](https://r4ds.had.co.nz/tidy-data.html?q=pivot#pivoting) - R for Data Science: [Chapter 13: Relational data](https://r4ds.had.co.nz/relational-data.html) - **dplyr** documentation: [Grouped data](https://dplyr.tidyverse.org/articles/grouping.html) - **tidyr** documentation: [Pivoting](https://tidyr.tidyverse.org/articles/pivot.html) - **dplyr** documentation: [Mutating joins](https://dplyr.tidyverse.org/articles/two-table.html#mutating-joins-1)