Data wrangling 2

class: center, middle, title-slide

.title[
# Data wrangling 2
]
.author[
### Claus O. Wilke
]
.date[
### last updated: 2024-02-12
]

---

## Elementary data manipulations

.pull-left[
Previous class:

- Pick rows: `filter()`
- Pick columns: `select()`
- Sort rows: `arrange()`
- Count things: `count()`
- Make new columns: `mutate()`
]

---

## Elementary data manipulations

.pull-left[
Previous class:

- Pick rows: `filter()`
- Pick columns: `select()`
- Sort rows: `arrange()`
- Count things: `count()`
- Make new columns: `mutate()`
]

.pull-right[
This class:

- Analyze subsets: `group_by()` and `summarize()`
]

---

## Elementary data manipulations

.pull-left[
Previous class:

- Pick rows: `filter()`
- Pick columns: `select()`
- Sort rows: `arrange()`
- Count things: `count()`
- Make new columns: `mutate()`
]

.pull-right[
This class:

- Analyze subsets: `group_by()` and `summarize()`
- Reshape: `pivot_wider()`, `pivot_longer()`
]

---

## Elementary data manipulations

.pull-left[
Previous class:

- Pick rows: `filter()`
- Pick columns: `select()`
- Sort rows: `arrange()`
- Count things: `count()`
- Make new columns: `mutate()`
]

.pull-right[
This class:

- Analyze subsets: `group_by()` and `summarize()`
- Reshape: `pivot_wider()`, `pivot_longer()`
- Combine datasets: `left_join()`, `inner_join()`, `...`
]

---

## Analyze subsets: `group_by()` and `summarize()`

.center[
<img src = "data-wrangling-2_files/summarize.svg", width = 90%></img>
]

---

## Example application of grouping: Counting

Previously, we counted like so:

.tiny-font[

```r
library(palmerpenguins) # loads the `penguins` dataset

penguins %>%
  count(species)
```

```
# A tibble: 3 × 2
 species n
 <fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
```
]

Now let's do it the hard way

---

## Example application of grouping: Counting

Let's go back to the raw data:

.tiny-font[

```r
penguins
```

```
# A tibble: 344 × 8
 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
 <fct> <fct> <dbl> <dbl> <int> <int>
 1 Adelie Torgersen 39.1 18.7 181 3750
 2 Adelie Torgersen 39.5 17.4 186 3800
 3 Adelie Torgersen 40.3 18 195 3250
 4 Adelie Torgersen NA NA NA NA
 5 Adelie Torgersen 36.7 19.3 193 3450
 6 Adelie Torgersen 39.3 20.6 190 3650
 7 Adelie Torgersen 38.9 17.8 181 3625
 8 Adelie Torgersen 39.2 19.6 195 4675
 9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
```
]

---

## Example application of grouping: Counting

Let's group it:

.tiny-font[

```r
penguins %>%
  group_by(species)
```

```
# A tibble: 344 × 8
# Groups: species [3]
 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
 <fct> <fct> <dbl> <dbl> <int> <int>
 1 Adelie Torgersen 39.1 18.7 181 3750
 2 Adelie Torgersen 39.5 17.4 186 3800
 3 Adelie Torgersen 40.3 18 195 3250
 4 Adelie Torgersen NA NA NA NA
 5 Adelie Torgersen 36.7 19.3 193 3450
 6 Adelie Torgersen 39.3 20.6 190 3650
 7 Adelie Torgersen 38.9 17.8 181 3625
 8 Adelie Torgersen 39.2 19.6 195 4675
 9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
```
]

---

## Example application of grouping: Counting

And summarize:

.tiny-font[

```r
penguins %>%
  group_by(species) %>%
  summarize(
    n = n()  # n() returns the number of observations per group
  )
```

```
# A tibble: 3 × 2
 species n
 <fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
```

]

---

## Example application of grouping: Counting

Now let's group by multiple variables:

.tiny-font[

```r
penguins %>%
  group_by(species, island)
```

```
# A tibble: 344 × 8
# Groups: species, island [5]
 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
 <fct> <fct> <dbl> <dbl> <int> <int>
 1 Adelie Torgersen 39.1 18.7 181 3750
 2 Adelie Torgersen 39.5 17.4 186 3800
 3 Adelie Torgersen 40.3 18 195 3250
 4 Adelie Torgersen NA NA NA NA
 5 Adelie Torgersen 36.7 19.3 193 3450
 6 Adelie Torgersen 39.3 20.6 190 3650
 7 Adelie Torgersen 38.9 17.8 181 3625
 8 Adelie Torgersen 39.2 19.6 195 4675
 9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
```
]

---

## Example application of grouping: Counting

And summarize:

.tiny-font[

```r
penguins %>%
  group_by(species, island) %>%
  summarize(
    n = n()  # n() returns the number of observations per group
  )
```

```
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
```

```
# A tibble: 5 × 3
# Groups: species [3]
 species island n
 <fct> <fct> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Dream 68
5 Gentoo Biscoe 124
```

]

---

## Example application of grouping: Counting

`count(...)` is a short-cut for `group_by(...) %>% summarize(n = n())`

.tiny-font.pull-left[

```r
penguins %>%
  count(species)
```

```
# A tibble: 3 × 2
 species n
 <fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
```

]

.tiny-font.pull-right[

```r
penguins %>%
  group_by(species) %>%
  summarize(
    n = n()
  )
```

```
# A tibble: 3 × 2
 species n
 <fct> <int>
1 Adelie 152
2 Chinstrap 68
3 Gentoo 124
```

]

The output is exactly the same

---

## Performing multiple summaries at once

.tiny-font[

```r
penguins %>%
  group_by(species) %>%
  summarize(
    n = n(),                                      # number of penguins
    mean_mass = mean(body_mass_g),                # mean body mass
    max_flipper_length = max(flipper_length_mm),  # max flipper length
    percent_female = sum(sex == "female")/n()     # percent of female penguins
  )
```
]

.tiny-font[

```
# A tibble: 3 × 5
 species n mean_mass max_flipper_length percent_female
 <fct> <int> <dbl> <int> <dbl>
1 Adelie 152 NA NA NA 
2 Chinstrap 68 3733. 212 0.5
3 Gentoo 124 NA NA NA 
```

]

Each statement in `summarize()` creates one new column

But why all the `NA`s?

---

## Performing multiple summaries at once

.tiny-font[

```r
penguins %>%
  group_by(species) %>%
  summarize(
    n = n(),
    mean_mass = mean(body_mass_g, na.rm = TRUE),
    max_flipper_length = max(flipper_length_mm, na.rm = TRUE),
    percent_female = sum(sex == "female", na.rm = TRUE)/sum(!is.na(sex))
  )
```
]

.tiny-font[

```
# A tibble: 3 × 5
 species n mean_mass max_flipper_length percent_female
 <fct> <int> <dbl> <int> <dbl>
1 Adelie 152 3701. 210 0.5 
2 Chinstrap 68 3733. 212 0.5 
3 Gentoo 124 5076. 231 0.487
```

]

We typically need to tell R how exactly `NA`s should be handled

[//]: # "segment ends here"

---
class: center middle

## Reshaping: Making tables wider or longer

---

## Reshape: `pivot_wider()` and `pivot_longer()`

.center[
<img src = "data-wrangling-2_files/pivot.svg", width = 50%></img>
]

---

## Reshaping example: Making a wide summary table

.tiny-font[

```r
penguins %>%
  count(species, island)
```

```
# A tibble: 5 × 3
 species island n
 <fct> <fct> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Dream 68
5 Gentoo Biscoe 124
```
]

---

## Reshaping example: Making a wide summary table

.tiny-font[

```r
penguins %>%
  count(species, island) %>%
  pivot_wider(names_from = "island", values_from = "n")
```

```
# A tibble: 3 × 4
 species Biscoe Dream Torgersen
 <fct> <int> <int> <int>
1 Adelie 44 56 52
2 Chinstrap NA 68 NA
3 Gentoo 124 NA NA
```
]

The `NA`s indicate cases that don't exist

---

## And going back to long format

.tiny-font[

```r
penguins_wide <- penguins %>%
 count(species, island) %>%
 pivot_wider(names_from = "island", values_from = "n")

penguins_wide %>% 
  pivot_longer(cols = -species, names_to = "island", values_to = "n")
```

```
# A tibble: 9 × 3
 species island n
 <fct> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Biscoe NA
5 Chinstrap Dream 68
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 124
8 Gentoo Dream NA
9 Gentoo Torgersen NA
```
]

---

## And going back to long format

Column specifications work just like in `select()`:

.tiny-font[

```r
# specify columns by subtraction
penguins_wide %>% 
  pivot_longer(cols = -species, names_to = "island", values_to = "n")
```

---

## And going back to long format

Column specifications work just like in `select()`:

.tiny-font[

```r
# specify columns by explicit listing
penguins_wide %>% 
  pivot_longer(cols = c(Biscoe, Dream, Torgersen), names_to = "island", values_to = "n")
```

]

---

## And going back to long format

Column specifications work just like in `select()`:

.tiny-font[

```r
# specify columns by range
penguins_wide %>% 
  pivot_longer(cols = Biscoe:Torgersen, names_to = "island", values_to = "n")
```

]

[//]: # "segment ends here"

---
class: center middle

## Combining datasets: joins

---

## We use joins to add columns from one table into another

.center[
<img src = "data-wrangling-2_files/join.svg", width = 75%></img>
]

---

## Example: band members and their instruments

.tiny-font.pull-left[

```r
# some members of famous bands
band_members
```

```
# A tibble: 3 × 2
 name band 
 <chr> <chr> 
1 Mick Stones 
2 John Beatles
3 Paul Beatles
```
]

.tiny-font.pull-right[

```r
# instruments played by some band members
band_instruments
```

```
# A tibble: 3 × 2
 name plays 
 <chr> <chr> 
1 John guitar
2 Paul bass 
3 Keith guitar
```
]

.tiny-font[

```r
left_join(band_members, band_instruments) # add right table to left
```

```
Joining with `by = join_by(name)`
```

```
# A tibble: 3 × 3
 name band plays 
 <chr> <chr> <chr> 
1 Mick Stones <NA> 
2 John Beatles guitar
3 Paul Beatles bass 
```
]

---

## Example: band members and their instruments

.tiny-font.pull-left[

```r
# some members of famous bands
band_members
```

```
# A tibble: 3 × 2
 name band 
 <chr> <chr> 
1 Mick Stones 
2 John Beatles
3 Paul Beatles
```
]

.tiny-font.pull-right[

```r
# instruments played by some band members
band_instruments
```

```
# A tibble: 3 × 2
 name plays 
 <chr> <chr> 
1 John guitar
2 Paul bass 
3 Keith guitar
```
]

.tiny-font[

```r
right_join(band_members, band_instruments) # add left table to right
```

```
Joining with `by = join_by(name)`
```

```
# A tibble: 3 × 3
 name band plays 
 <chr> <chr> <chr> 
1 John Beatles guitar
2 Paul Beatles bass 
3 Keith <NA> guitar
```
]

---

## Example: band members and their instruments

.tiny-font.pull-left[

```r
# some members of famous bands
band_members
```

```
# A tibble: 3 × 2
 name band 
 <chr> <chr> 
1 Mick Stones 
2 John Beatles
3 Paul Beatles
```
]

.tiny-font.pull-right[

```r
# instruments played by some band members
band_instruments
```

```
# A tibble: 3 × 2
 name plays 
 <chr> <chr> 
1 John guitar
2 Paul bass 
3 Keith guitar
```
]

.tiny-font[

```r
inner_join(band_members, band_instruments) # keep intersection only
```

```
Joining with `by = join_by(name)`
```

```
# A tibble: 2 × 3
 name band plays 
 <chr> <chr> <chr> 
1 John Beatles guitar
2 Paul Beatles bass 
```
]

---

## Example: band members and their instruments

.tiny-font.pull-left[

```r
# some members of famous bands
band_members
```

```
# A tibble: 3 × 2
 name band 
 <chr> <chr> 
1 Mick Stones 
2 John Beatles
3 Paul Beatles
```
]

.tiny-font.pull-right[

```r
# instruments played by some band members
band_instruments
```

```
# A tibble: 3 × 2
 name plays 
 <chr> <chr> 
1 John guitar
2 Paul bass 
3 Keith guitar
```
]

.tiny-font[

```r
full_join(band_members, band_instruments) # merge all cases
```

```
Joining with `by = join_by(name)`
```

```
# A tibble: 4 × 3
 name band plays 
 <chr> <chr> <chr> 
1 Mick Stones <NA> 
2 John Beatles guitar
3 Paul Beatles bass 
4 Keith <NA> guitar
```
]

---

## The different joins at a glance

.center[
<img src = "data-wrangling-2_files/joins-venn.svg", width = 65%></img>
]

---

## The different joins at a glance

**Tip:** In case of doubt, use `left_join()`

In my experience, `left_join()` is the right choice  
70–80% of the time

[//]: # "segment ends here"

---

## Further reading

- R for Data Science: [Chapter 5.6: Grouped summaries with `summarise()`](https://r4ds.had.co.nz/transform.html?q=group_by#grouped-summaries-with-summarise)
- R for Data Science: [Chapter 12.3: Pivoting](https://r4ds.had.co.nz/tidy-data.html?q=pivot#pivoting)
- R for Data Science: [Chapter 13: Relational data](https://r4ds.had.co.nz/relational-data.html)
- **dplyr** documentation: [Grouped data](https://dplyr.tidyverse.org/articles/grouping.html)
- **tidyr** documentation: [Pivoting](https://tidyr.tidyverse.org/articles/pivot.html)
- **dplyr** documentation: [Mutating joins](https://dplyr.tidyverse.org/articles/two-table.html#mutating-joins-1)