In-class worksheet 6

Feb 1, 2018

In this worksheet, we will continue to work with the tidyverse libraries:

library(tidyverse)

1. The msleep dataset

The msleep dataset, provided with ggplot2, contains information about sleep and awake times of different mammals:

msleep
## # A tibble: 83 x 11
##                          name       genus  vore        order conservation
##                         <chr>       <chr> <chr>        <chr>        <chr>
##  1                    Cheetah    Acinonyx carni    Carnivora           lc
##  2                 Owl monkey       Aotus  omni     Primates         <NA>
##  3            Mountain beaver  Aplodontia herbi     Rodentia           nt
##  4 Greater short-tailed shrew     Blarina  omni Soricomorpha           lc
##  5                        Cow         Bos herbi Artiodactyla domesticated
##  6           Three-toed sloth    Bradypus herbi       Pilosa         <NA>
##  7          Northern fur seal Callorhinus carni    Carnivora           vu
##  8               Vesper mouse     Calomys  <NA>     Rodentia         <NA>
##  9                        Dog       Canis carni    Carnivora domesticated
## 10                   Roe deer   Capreolus herbi Artiodactyla           lc
## # ... with 73 more rows, and 6 more variables: sleep_total <dbl>,
## #   sleep_rem <dbl>, sleep_cycle <dbl>, awake <dbl>, brainwt <dbl>,
## #   bodywt <dbl>

Verify that the sum of total sleep time (column sleep_total) and total awake time (column awake) adds up to 24h for all animals in the msleep dataset.

msleep$sleep_total + msleep$awake
##  [1] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
## [12] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
## [23] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.05 24.00 24.00
## [34] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
## [45] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
## [56] 24.00 24.00 24.00 24.00 24.05 24.00 24.00 24.00 24.00 24.00 24.00
## [67] 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00 24.00
## [78] 24.00 24.00 24.00 24.00 24.00 24.00

Make a list of all the domesticated species in the msleep dataset, in alphabetical order. Hint: domesticated species have the entry “domesticated” in the column conservation.

msleep %>% filter(conservation=="domesticated") %>%
  select(name) %>% arrange(name)
## # A tibble: 10 x 1
##            name
##           <chr>
##  1   Chinchilla
##  2          Cow
##  3          Dog
##  4 Domestic cat
##  5       Donkey
##  6   Guinea pig
##  7        Horse
##  8          Pig
##  9       Rabbit
## 10        Sheep

For the different vore classifications, tally how many species are awake for at least 18 hours. Hint: use the function tally().

msleep %>% filter(awake>=18) %>% group_by(vore) %>% tally()
## # A tibble: 3 x 2
##    vore     n
##   <chr> <int>
## 1 carni     4
## 2 herbi    11
## 3  <NA>     1

Using the function top_n(), identify the top-10 least-awake animals, and list them from least awake to most awake. Explain why this analysis gives you 11 results instead of 10.

msleep %>% select(name, sleep_total) %>% top_n(10) %>% arrange(desc(sleep_total))
## Selecting by sleep_total
## # A tibble: 11 x 2
##                              name sleep_total
##                             <chr>       <dbl>
##  1               Little brown bat        19.9
##  2                  Big brown bat        19.7
##  3           Thick-tailed opposum        19.4
##  4                Giant armadillo        18.1
##  5         North American Opossum        18.0
##  6           Long-nosed armadillo        17.4
##  7                     Owl monkey        17.0
##  8         Arctic ground squirrel        16.6
##  9 Golden-mantled ground squirrel        15.9
## 10                          Tiger        15.8
## 11      Eastern american chipmunk        15.8

There are 11 results because there is a tie. Both the Tiger and the Eastern american chipmunk have a total sleep time of 15.8h, and they are in positions 10 and 11 of the list. Note that by default, top_n() orders based on the last variable in the table. Since we selected sleep_total as the last column before we called top_n(), we get the desired result.

Considering only carnivores and herbivores, make a plot of the percent of time each animal is in REM sleep (out of the total sleep time) vs. the animal’s total sleep time. Hint: Use the operator | to indicate logical OR in the filter() function.

msleep %>% filter(vore=='carni' | vore=='herbi') %>%
  mutate(perc.rem = sleep_rem/sleep_total) %>%
  ggplot(aes(x=sleep_total, y=perc.rem, color=vore)) + geom_point()
## Warning: Removed 17 rows containing missing values (geom_point).

2. The diamonds dataset

The diamonds dataset provided by ggplot2 provides information about quality and price of 53940 diamonds:

head(diamonds)
## # A tibble: 6 x 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

The best cuts of diamonds are “Very Good”, “Premium”, and “Ideal”. Make a table that selects only those diamonds, and find the minimum, median, and maximum price for each cut. Hint: The operator %in% is helpful for selecting the diamond cuts.

diamonds %>% filter(cut %in% c("Very Good", "Premium", "Ideal")) %>% 
  group_by(cut) %>% 
  summarize(min.price=min(price), med.price=median(price), max.price=max(price))
## # A tibble: 3 x 4
##         cut min.price med.price max.price
##       <ord>     <dbl>     <dbl>     <dbl>
## 1 Very Good       336      2648     18818
## 2   Premium       326      3185     18823
## 3     Ideal       326      1810     18806

For each of the different diamond cuts, calculate the mean carat level among the diamonds whose price falls within 10% of the most expensive diamond for that cut.

diamonds %>% group_by(cut) %>%
  filter(price > 0.9*max(price)) %>%
  summarize(mean.carat=mean(carat))
## # A tibble: 5 x 2
##         cut mean.carat
##       <ord>      <dbl>
## 1      Fair   2.729444
## 2      Good   2.083889
## 3 Very Good   1.970552
## 4   Premium   2.075451
## 5     Ideal   1.992840

For each of the different diamond cuts, calculate the mean carat level among the top-10% most expensive diamonds.

diamonds %>% group_by(cut) %>%
  mutate(price_rank = rank(desc(price))) %>% # rank diamonds by price, in descending order
  filter(price_rank < 0.1*max(price_rank)) %>% # pick the top 10%
  summarize(mean.carat=mean(carat))
## # A tibble: 5 x 2
##         cut mean.carat
##       <ord>      <dbl>
## 1      Fair   2.039813
## 2      Good   1.762193
## 3 Very Good   1.709180
## 4   Premium   1.876962
## 5     Ideal   1.584023

Make a table that contains the median price for each combination of cut and clarity, and arrange the final table in descending order of median price.

diamonds %>% group_by( cut, clarity ) %>%
  summarize(med.price = median(price)) %>% 
  arrange(desc(med.price))
## # A tibble: 40 x 3
## # Groups:   cut [5]
##          cut clarity med.price
##        <ord>   <ord>     <dbl>
##  1   Premium     SI2    4291.0
##  2     Ideal     SI2    4059.5
##  3 Very Good     SI2    4042.0
##  4      Good     SI2    3770.0
##  5      Fair     SI2    3681.0
##  6     Ideal      I1    3673.5
##  7   Premium     SI1    3618.0
##  8      Fair     SI1    3528.5
##  9 Very Good      I1    3283.0
## 10   Premium      I1    3261.0
## # ... with 30 more rows

Now arrange the same table first by cut and then within each cut group by median price.

diamonds %>% group_by( cut, clarity ) %>%
  summarize(med.price = median(price)) %>% 
  arrange(desc(cut), desc(med.price))
## # A tibble: 40 x 3
## # Groups:   cut [5]
##        cut clarity med.price
##      <ord>   <ord>     <dbl>
##  1   Ideal     SI2    4059.5
##  2   Ideal      I1    3673.5
##  3   Ideal     SI1    2537.0
##  4   Ideal     VS1    1813.0
##  5   Ideal     VS2    1689.0
##  6   Ideal    VVS2    1330.0
##  7   Ideal    VVS1    1114.0
##  8   Ideal      IF    1022.5
##  9 Premium     SI2    4291.0
## 10 Premium     SI1    3618.0
## # ... with 30 more rows

3. If this was easy

For the diamonds data set, using the function do(), fit a linear model of price vs. carat separately for each cut. Then make a table that holds the resulting intercepts and slopes.

diamonds %>% group_by(cut) %>%
  do(lm.result = lm(price ~ carat, data=.)) -> fitted.models
do(fitted.models, data.frame(cut=.$cut, 
                             intercept=coef(.$lm.result)[1], 
                             slope=coef(.$lm.result)[2]))
## Source: local data frame [5 x 3]
## Groups: <by row>
## 
## # A tibble: 5 x 3
##         cut intercept    slope
## *     <ord>     <dbl>    <dbl>
## 1      Fair -1839.074 5924.495
## 2      Good -2422.728 7479.636
## 3 Very Good -2417.660 7935.972
## 4   Premium -2379.905 7807.752
## 5     Ideal -2300.374 8192.391