Data wrangling 1

Claus O. Wilke

2025-01-28

Elementary data manipulations

  • Pick rows: filter()
  • Pick columns: select()
  • Sort rows: arrange()
  • Count things: count()
  • Make new columns: mutate()

But first: the pipe operator |>

|> is pronounced “and then”

The pipe |> feeds data into functions

library(palmerpenguins) # loads the `penguins` dataset

head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

The pipe |> feeds data into functions

library(palmerpenguins) # loads the `penguins` dataset

# head(penguins)
penguins |>
  head()
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

The pipe |> feeds data into functions

ggplot(penguins, aes(bill_length_mm, bill_depth_mm, color = species)) + 
  geom_point()

 

The pipe |> feeds data into functions

penguins |>
  ggplot(aes(bill_length_mm, bill_depth_mm, color = species)) +
  geom_point()

 

Old-school “magrittr” pipe: %>%

penguins %>%
  ggplot(aes(bill_length_mm, bill_depth_mm, color = species)) +
  geom_point()

 

Which to use? Native pipe or old-school pipe?

  • The native pipe |> is the future. I’ll be using it throughout.
  • The magrittr %>% works fine. You can use it if you llike to.

Picking rows or columns, and sorting

Pick rows from a table: filter()

Filter out penguins of species Gentoo

penguins |>
  filter(species == "Gentoo")
# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Filter out penguins with bill length > 50 mm

penguins |>
  filter(bill_length_mm > 50)
# A tibble: 52 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           50.2          14.3               218        5700
 2 Gentoo  Biscoe           59.6          17                 230        6050
 3 Gentoo  Biscoe           50.5          15.9               222        5550
 4 Gentoo  Biscoe           50.5          15.9               225        5400
 5 Gentoo  Biscoe           50.1          15                 225        5000
 6 Gentoo  Biscoe           50.4          15.3               224        5550
 7 Gentoo  Biscoe           54.3          15.7               231        5650
 8 Gentoo  Biscoe           50.7          15                 223        5550
 9 Gentoo  Biscoe           51.1          16.3               220        6000
10 Gentoo  Biscoe           52.5          15.6               221        5450
# ℹ 42 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Pick columns from a table: select()

Pick columns species, island, and sex

penguins |>
  select(species, island, sex)
# A tibble: 344 × 3
   species island    sex   
   <fct>   <fct>     <fct> 
 1 Adelie  Torgersen male  
 2 Adelie  Torgersen female
 3 Adelie  Torgersen female
 4 Adelie  Torgersen <NA>  
 5 Adelie  Torgersen female
 6 Adelie  Torgersen male  
 7 Adelie  Torgersen female
 8 Adelie  Torgersen male  
 9 Adelie  Torgersen <NA>  
10 Adelie  Torgersen <NA>  
# ℹ 334 more rows

Sort the rows in a table: arrange()

Sort penguins by bill length, ascending

penguins |>
  arrange(bill_length_mm)
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.1          18.1               193        3475
 6 Adelie  Torgersen           34.4          18.4               184        3325
 7 Adelie  Biscoe              34.5          18.1               187        2900
 8 Adelie  Torgersen           34.6          21.1               198        4400
 9 Adelie  Torgersen           34.6          17.2               189        3200
10 Adelie  Biscoe              35            17.9               190        3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Sort penguins by bill length, descending

penguins |>
  arrange(desc(bill_length_mm))
# A tibble: 344 × 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo    Biscoe           59.6          17                 230        6050
 2 Chinstrap Dream            58            17.8               181        3700
 3 Gentoo    Biscoe           55.9          17                 228        5600
 4 Chinstrap Dream            55.8          19.8               207        4000
 5 Gentoo    Biscoe           55.1          16                 230        5850
 6 Gentoo    Biscoe           54.3          15.7               231        5650
 7 Chinstrap Dream            54.2          20.8               201        4300
 8 Chinstrap Dream            53.5          19.9               205        4500
 9 Gentoo    Biscoe           53.4          15.8               219        5500
10 Chinstrap Dream            52.8          20                 205        4550
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Counting things

Count things

penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Count things

penguins |>
  count(species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Count things

penguins |>
  count(species, island)
# A tibble: 5 × 3
  species   island        n
  <fct>     <fct>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Dream        68
5 Gentoo    Biscoe      124

Use the pipe to build analysis pipelines

penguins |>
  filter(species == "Adelie")
# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Use the pipe to build analysis pipelines

penguins |>
  filter(species == "Adelie") |>
  select(island, sex)
# A tibble: 152 × 2
   island    sex   
   <fct>     <fct> 
 1 Torgersen male  
 2 Torgersen female
 3 Torgersen female
 4 Torgersen <NA>  
 5 Torgersen female
 6 Torgersen male  
 7 Torgersen female
 8 Torgersen male  
 9 Torgersen <NA>  
10 Torgersen <NA>  
# ℹ 142 more rows

Use the pipe to build analysis pipelines

penguins |>
  filter(species == "Adelie") |>
  select(island, sex) |>
  count(island, sex)
# A tibble: 8 × 3
  island    sex        n
  <fct>     <fct>  <int>
1 Biscoe    female    22
2 Biscoe    male      22
3 Dream     female    27
4 Dream     male      28
5 Dream     <NA>       1
6 Torgersen female    24
7 Torgersen male      23
8 Torgersen <NA>       5

Adding new columns to a table

Make a new table column: mutate()

Example: flipper length in cm

penguins |>
  select(species, island, flipper_length_mm)
# A tibble: 344 × 3
   species island    flipper_length_mm
   <fct>   <fct>                 <int>
 1 Adelie  Torgersen               181
 2 Adelie  Torgersen               186
 3 Adelie  Torgersen               195
 4 Adelie  Torgersen                NA
 5 Adelie  Torgersen               193
 6 Adelie  Torgersen               190
 7 Adelie  Torgersen               181
 8 Adelie  Torgersen               195
 9 Adelie  Torgersen               193
10 Adelie  Torgersen               190
# ℹ 334 more rows

Example: flipper length in cm

penguins |>
  select(species, island, flipper_length_mm) |>
  mutate(flipper_length_cm = flipper_length_mm / 10)
# A tibble: 344 × 4
   species island    flipper_length_mm flipper_length_cm
   <fct>   <fct>                 <int>             <dbl>
 1 Adelie  Torgersen               181              18.1
 2 Adelie  Torgersen               186              18.6
 3 Adelie  Torgersen               195              19.5
 4 Adelie  Torgersen                NA              NA  
 5 Adelie  Torgersen               193              19.3
 6 Adelie  Torgersen               190              19  
 7 Adelie  Torgersen               181              18.1
 8 Adelie  Torgersen               195              19.5
 9 Adelie  Torgersen               193              19.3
10 Adelie  Torgersen               190              19  
# ℹ 334 more rows

Make multiple columns at once

penguins |>
  select(species, island, flipper_length_mm) |>
  mutate(
    flipper_length_cm = flipper_length_mm / 10,  # <- notice the comma
    flipper_length_in = flipper_length_mm / 25.4
  )
# A tibble: 344 × 5
   species island    flipper_length_mm flipper_length_cm flipper_length_in
   <fct>   <fct>                 <int>             <dbl>             <dbl>
 1 Adelie  Torgersen               181              18.1              7.13
 2 Adelie  Torgersen               186              18.6              7.32
 3 Adelie  Torgersen               195              19.5              7.68
 4 Adelie  Torgersen                NA              NA               NA   
 5 Adelie  Torgersen               193              19.3              7.60
 6 Adelie  Torgersen               190              19                7.48
 7 Adelie  Torgersen               181              18.1              7.13
 8 Adelie  Torgersen               195              19.5              7.68
 9 Adelie  Torgersen               193              19.3              7.60
10 Adelie  Torgersen               190              19                7.48
# ℹ 334 more rows

Further reading