class8_solutions.utf8.md

In-class worksheet 8

Feb 14, 2019

In this worksheet, we will use the library tidyverse:

library(tidyverse)

1. Making wide tables longer

Consider the following data set, which contains information about income and religious affiliation in the US:

pew <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/pew.csv")

## Parsed with column specification:
## cols(
##   religion = col_character(),
##   below10k = col_integer(),
##   from10to20k = col_integer(),
##   from20to30k = col_integer(),
##   from30to40k = col_integer(),
##   from40to50k = col_integer(),
##   from50to75k = col_integer(),
##   from75to100k = col_integer(),
##   from100to150k = col_integer(),
##   above150k = col_integer(),
##   no_answer = col_integer()
## )

head(pew)

## # A tibble: 6 x 11
##   religion below10k from10to20k from20to30k from30to40k from40to50k
##   <chr>       <int>       <int>       <int>       <int>       <int>
## 1 Agnostic       27          34          60          81          76
## 2 Atheist        12          27          37          52          35
## 3 Buddhist       27          21          30          34          33
## 4 Catholic      418         617         732         670         638
## 5 Don't k…       15          14          15          11          10
## 6 Evangel…      575         869        1064         982         881
## # … with 5 more variables: from50to75k <int>, from75to100k <int>,
## #   from100to150k <int>, above150k <int>, no_answer <int>

This table is not tidy, because income levels are used as column headers rather than as levels of an income variable.

Use gather() to turn this table into a table with three columns, one for religion, one for income (called income), and one for the count of people with the respective combination of income and religion (called count).

pew %>% 
  gather(income, count, below10k:no_answer) %>%
  head()

## # A tibble: 6 x 3
##   religion           income   count
##   <chr>              <chr>    <int>
## 1 Agnostic           below10k    27
## 2 Atheist            below10k    12
## 3 Buddhist           below10k    27
## 4 Catholic           below10k   418
## 5 Don't know/refused below10k    15
## 6 Evangelical Prot   below10k   575

Now call the income column income_level and the count column number_of_people.

pew %>%
  gather(income_level, number_of_people, below10k:no_answer) %>%
  head()

## # A tibble: 6 x 3
##   religion           income_level number_of_people
##   <chr>              <chr>                   <int>
## 1 Agnostic           below10k                   27
## 2 Atheist            below10k                   12
## 3 Buddhist           below10k                   27
## 4 Catholic           below10k                  418
## 5 Don't know/refused below10k                   15
## 6 Evangelical Prot   below10k                  575

Now, instead of gathering data from all columns, gather only the data from columns below10k, from20to30k, and from50to75k, such that your final data frame contains only these three income levels. Sort your final data frame according to religion and then income_level.

pew %>% 
  select(religion, below10k, from20to30k, from50to75k) %>% 
  gather(income_level, number_of_people, below10k:from50to75k) %>%
  arrange(religion, income_level) %>%
  head()

## # A tibble: 6 x 3
##   religion income_level number_of_people
##   <chr>    <chr>                   <int>
## 1 Agnostic below10k                   27
## 2 Agnostic from20to30k                60
## 3 Agnostic from50to75k               137
## 4 Atheist  below10k                   12
## 5 Atheist  from20to30k                37
## 6 Atheist  from50to75k                70

2. Making long tables wider

Consider the following data set, which contains information about the sex, weight, and height of 200 individuals:

persons <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/persons.csv")

## Parsed with column specification:
## cols(
##   subject = col_integer(),
##   indicator = col_character(),
##   value = col_character()
## )

head(persons)

## # A tibble: 6 x 3
##   subject indicator value
##     <int> <chr>     <chr>
## 1       1 sex       M    
## 2       1 weight    77   
## 3       1 height    182  
## 4       2 sex       F    
## 5       2 weight    58   
## 6       2 height    161

Is this data set tidy? And can you rearrange it so that you have one column for subject, one for sex, one for weight, and one for height?

The data set is not tidy, because neither indicator nor value are variables. The variables are subject, sex, weight, height. The following version of the table is tidy:

persons %>%
  spread(indicator, value) %>%
  head()

## # A tibble: 6 x 4
##   subject height sex   weight
##     <int> <chr>  <chr> <chr> 
## 1       1 182    M     77    
## 2       2 161    F     58    
## 3       3 161    F     53    
## 4       4 177    M     68    
## 5       5 157    F     59    
## 6       6 170    M     76

For the data set diamonds from the ggplot2 package, create a table displaying the mean price for each combination of cut and clarity. Then use spread() to rearrange this table into a wide format, such that there is a column of mean prices for each cut level (Fair, Good, Very Good, etc.).

price_by_cut <-
  diamonds %>%
  group_by(cut, clarity) %>% 
  summarize(mean.price = mean(price))
price_by_cut

## # A tibble: 40 x 3
## # Groups:   cut [5]
##    cut   clarity mean.price
##    <ord> <ord>        <dbl>
##  1 Fair  I1           3704.
##  2 Fair  SI2          5174.
##  3 Fair  SI1          4208.
##  4 Fair  VS2          4175.
##  5 Fair  VS1          4165.
##  6 Fair  VVS2         3350.
##  7 Fair  VVS1         3871.
##  8 Fair  IF           1912.
##  9 Good  I1           3597.
## 10 Good  SI2          4580.
## # … with 30 more rows

price_by_cut %>% spread(cut, mean.price)

## # A tibble: 8 x 6
##   clarity  Fair  Good `Very Good` Premium Ideal
##   <ord>   <dbl> <dbl>       <dbl>   <dbl> <dbl>
## 1 I1      3704. 3597.       4078.   3947. 4336.
## 2 SI2     5174. 4580.       4989.   5546. 4756.
## 3 SI1     4208. 3690.       3932.   4455. 3752.
## 4 VS2     4175. 4262.       4216.   4550. 3285.
## 5 VS1     4165. 3801.       3805.   4485. 3490.
## 6 VVS2    3350. 3079.       3038.   3795. 3250.
## 7 VVS1    3871. 2255.       2459.   2831. 2468.
## 8 IF      1912. 4098.       4396.   3856. 2273.

3. If this was easy

Take the sepal lengths from the iris dataset and put them into a wide table so that is one data column per species. You might be tempted to do this with the following code, which however doesn’t work. Can you explain why?

# If you remove the # sign in the line below you will get an error; this code doesn't work
# iris %>% select(Sepal.Length, Species) %>% spread(Species, Sepal.Length)

The problem is that spread() does not like to put data into the same row if it isn’t sure that they actually belong together. In the iris table, there is no indication which “setosa” values, for example, should go with which “versicolor” values. Therefore, spread() throws an error complaining about “duplicate identifiers for rows …”

We can avoid this issue by adding a row column that is repeated among the three groups. This column simply counts rows within each group, from 1 to 50 in this particular data set. This trick forces the data from the same rows for different species into one row. (That means, rows 1 of setosa, versicolor, and virginica get combined, then rows 2, and so on.)

iris %>%
  select(Sepal.Length, Species) %>% 
  group_by(Species) %>%
  mutate(row = 1:n()) %>%
  spread(Species, Sepal.Length)

## # A tibble: 50 x 4
##      row setosa versicolor virginica
##    <int>  <dbl>      <dbl>     <dbl>
##  1     1    5.1        7         6.3
##  2     2    4.9        6.4       5.8
##  3     3    4.7        6.9       7.1
##  4     4    4.6        5.5       6.3
##  5     5    5          6.5       6.5
##  6     6    5.4        5.7       7.6
##  7     7    4.6        6.3       4.9
##  8     8    5          4.9       7.3
##  9     9    4.4        6.6       6.7
## 10    10    4.9        5.2       7.2
## # … with 40 more rows

At the end, if you want, you can delete this column again:

iris %>%
  select(Sepal.Length, Species) %>% 
  group_by(Species) %>%
  mutate(row = 1:n()) %>%
  spread(Species, Sepal.Length) %>%
  select(-row)

## # A tibble: 50 x 3
##    setosa versicolor virginica
##     <dbl>      <dbl>     <dbl>
##  1    5.1        7         6.3
##  2    4.9        6.4       5.8
##  3    4.7        6.9       7.1
##  4    4.6        5.5       6.3
##  5    5          6.5       6.5
##  6    5.4        5.7       7.6
##  7    4.6        6.3       4.9
##  8    5          4.9       7.3
##  9    4.4        6.6       6.7
## 10    4.9        5.2       7.2
## # … with 40 more rows