Feb 13, 2020
In this worksheet, we will use the library tidyverse:
library(tidyverse)
Consider the following data set, which contains information about income and religious affiliation in the US:
pew <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/pew.csv")
## Parsed with column specification:
## cols(
## religion = col_character(),
## below10k = col_double(),
## from10to20k = col_double(),
## from20to30k = col_double(),
## from30to40k = col_double(),
## from40to50k = col_double(),
## from50to75k = col_double(),
## from75to100k = col_double(),
## from100to150k = col_double(),
## above150k = col_double(),
## no_answer = col_double()
## )
head(pew)
## # A tibble: 6 x 11
## religion below10k from10to20k from20to30k from30to40k from40to50k from50to75k
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137
## 2 Atheist 12 27 37 52 35 70
## 3 Buddhist 27 21 30 34 33 58
## 4 Catholic 418 617 732 670 638 1116
## 5 Don't k… 15 14 15 11 10 35
## 6 Evangel… 575 869 1064 982 881 1486
## # … with 4 more variables: from75to100k <dbl>, from100to150k <dbl>,
## # above150k <dbl>, no_answer <dbl>
This table is not tidy, because income levels are used as column headers rather than as levels of an income
variable.
Use pivot_longer()
to turn this table into a table with three columns, one for religion, one for income (called income
), and one for the count of people with the respective combination of income and religion (called count
).
# solution 1, specifying the columns to convert from wide to long
pew %>%
pivot_longer(below10k:no_answer, names_to = "income", values_to = "count") %>%
head()
## # A tibble: 6 x 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic below10k 27
## 2 Agnostic from10to20k 34
## 3 Agnostic from20to30k 60
## 4 Agnostic from30to40k 81
## 5 Agnostic from40to50k 76
## 6 Agnostic from50to75k 137
# solution 2, specifying the column not to convert (religion)
pew %>%
pivot_longer(-religion, names_to = "income", values_to = "count") %>%
head()
## # A tibble: 6 x 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic below10k 27
## 2 Agnostic from10to20k 34
## 3 Agnostic from20to30k 60
## 4 Agnostic from30to40k 81
## 5 Agnostic from40to50k 76
## 6 Agnostic from50to75k 137
Now call the income column income_level
and the count column number_of_people
.
pew %>%
pivot_longer(-religion, names_to = "income_level", values_to = "number_of_people") %>%
head()
## # A tibble: 6 x 3
## religion income_level number_of_people
## <chr> <chr> <dbl>
## 1 Agnostic below10k 27
## 2 Agnostic from10to20k 34
## 3 Agnostic from20to30k 60
## 4 Agnostic from30to40k 81
## 5 Agnostic from40to50k 76
## 6 Agnostic from50to75k 137
Now, instead of using data from all columns, use only the data from columns below10k
, from20to30k
, and from50to75k
, such that your final data frame contains only these three income levels. Sort your final data frame according to religion
and then income_level
.
pew %>%
select(religion, below10k, from20to30k, from50to75k) %>%
pivot_longer(-religion, names_to = "income_level", values_to = "number_of_people") %>%
arrange(religion, income_level) %>%
head()
## # A tibble: 6 x 3
## religion income_level number_of_people
## <chr> <chr> <dbl>
## 1 Agnostic below10k 27
## 2 Agnostic from20to30k 60
## 3 Agnostic from50to75k 137
## 4 Atheist below10k 12
## 5 Atheist from20to30k 37
## 6 Atheist from50to75k 70
Consider the following data set, which contains information about the sex, weight, and height of 200 individuals:
persons <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/persons.csv")
## Parsed with column specification:
## cols(
## subject = col_double(),
## indicator = col_character(),
## value = col_character()
## )
head(persons)
## # A tibble: 6 x 3
## subject indicator value
## <dbl> <chr> <chr>
## 1 1 sex M
## 2 1 weight 77
## 3 1 height 182
## 4 2 sex F
## 5 2 weight 58
## 6 2 height 161
Is this data set tidy? And can you rearrange it so that you have one column for subject, one for sex, one for weight, and one for height?
The data set is not tidy, because neither indicator
nor value
are variables. The variables are subject
, sex
, weight
, height
. The following version of the table is tidy:
persons %>%
pivot_wider(names_from = "indicator", values_from = "value") %>%
head()
## # A tibble: 6 x 4
## subject sex weight height
## <dbl> <chr> <chr> <chr>
## 1 1 M 77 182
## 2 2 F 58 161
## 3 3 F 53 161
## 4 4 M 68 177
## 5 5 F 59 157
## 6 6 M 76 170
For the data set diamonds
from the ggplot2 package, create a table displaying the mean price for each combination of cut and clarity. Then use pivot_wider()
to rearrange this table into a wide format, such that there is a column of mean prices for each cut level (Fair, Good, Very Good, etc.).
price_by_cut <-
diamonds %>%
group_by(cut, clarity) %>%
summarize(mean.price = mean(price))
price_by_cut
## # A tibble: 40 x 3
## # Groups: cut [5]
## cut clarity mean.price
## <ord> <ord> <dbl>
## 1 Fair I1 3704.
## 2 Fair SI2 5174.
## 3 Fair SI1 4208.
## 4 Fair VS2 4175.
## 5 Fair VS1 4165.
## 6 Fair VVS2 3350.
## 7 Fair VVS1 3871.
## 8 Fair IF 1912.
## 9 Good I1 3597.
## 10 Good SI2 4580.
## # … with 30 more rows
price_by_cut %>%
pivot_wider(names_from = "cut", values_from = "mean.price")
## # A tibble: 8 x 6
## clarity Fair Good `Very Good` Premium Ideal
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 I1 3704. 3597. 4078. 3947. 4336.
## 2 SI2 5174. 4580. 4989. 5546. 4756.
## 3 SI1 4208. 3690. 3932. 4455. 3752.
## 4 VS2 4175. 4262. 4216. 4550. 3285.
## 5 VS1 4165. 3801. 3805. 4485. 3490.
## 6 VVS2 3350. 3079. 3038. 3795. 3250.
## 7 VVS1 3871. 2255. 2459. 2831. 2468.
## 8 IF 1912. 4098. 4396. 3856. 2273.
Take the sepal lengths from the iris
dataset and put them into a wide table so that is one data column per species. You might be tempted to do this with the following code, which however doesn’t work. Can you explain why?
# If you remove the # signs in the lines below you will get an error; this code doesn't work
# iris %>%
# select(Sepal.Length, Species) %>%
# pivot_wider(names_from = "Species", values_from = "Sepal.Length")
The problem is that pivot_wider()
does not like to put data into the same row if it isn’t sure that they actually belong together. In the iris table, there is no indication which “setosa” values, for example, should go with which “versicolor” values. Therefore, pivot_wider()
prints a warning about values that are not uniquely identified.
We can avoid this issue by adding a row
column that is repeated among the three groups. This column simply counts rows within each group, from 1 to 50 in this particular data set. This trick forces the data from the same rows for different species into one row. (That means, rows 1 of setosa, versicolor, and virginica get combined, then rows 2, and so on.)
iris %>%
select(Sepal.Length, Species) %>%
group_by(Species) %>%
mutate(row = 1:n()) %>%
pivot_wider(names_from = "Species", values_from = "Sepal.Length")
## # A tibble: 50 x 4
## row setosa versicolor virginica
## <int> <dbl> <dbl> <dbl>
## 1 1 5.1 7 6.3
## 2 2 4.9 6.4 5.8
## 3 3 4.7 6.9 7.1
## 4 4 4.6 5.5 6.3
## 5 5 5 6.5 6.5
## 6 6 5.4 5.7 7.6
## 7 7 4.6 6.3 4.9
## 8 8 5 4.9 7.3
## 9 9 4.4 6.6 6.7
## 10 10 4.9 5.2 7.2
## # … with 40 more rows
At the end, if you want, you can delete this column again:
iris %>%
select(Sepal.Length, Species) %>%
group_by(Species) %>%
mutate(row = 1:n()) %>%
pivot_wider(names_from = "Species", values_from = "Sepal.Length") %>%
select(-row)
## # A tibble: 50 x 3
## setosa versicolor virginica
## <dbl> <dbl> <dbl>
## 1 5.1 7 6.3
## 2 4.9 6.4 5.8
## 3 4.7 6.9 7.1
## 4 4.6 5.5 6.3
## 5 5 6.5 6.5
## 6 5.4 5.7 7.6
## 7 4.6 6.3 4.9
## 8 5 4.9 7.3
## 9 4.4 6.6 6.7
## 10 4.9 5.2 7.2
## # … with 40 more rows