Lab Worksheet 4

Part 1: The dplyr pipe

The following questions in Part 1 are from Lab Worksheet 3. Answer these questions again, but this time use the dplyr pipe (%>%) in your answer.

Problem 1: In an in-class exercise, we made the following plot of the Sitka dataset:

# download the sitka data set:
sitka <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/sitka.csv")
head(sitka)

##   size Time tree treat
## 1 4.51  152    1 ozone
## 2 4.98  174    1 ozone
## 3 5.41  201    1 ozone
## 4 5.90  227    1 ozone
## 5 6.15  258    1 ozone
## 6 4.24  152    2 ozone

ggplot(sitka, aes(x=Time, y=size, group=tree)) + geom_line() + facet_wrap(~treat)

Now modify the plot so that the line for each tree is colored according to the maximum size of the tree.

# R code goes here.

Problem 2: The package nycflights13 contains information about all flights departing from one of the NY City airports in 2013. In particular, the data table flights lists on-time departure and arrival information for 336,776 individual flights:

library(nycflights13)
flights

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      533            529         4      850
## 3   2013     1     1      542            540         2      923
## 4   2013     1     1      544            545        -1     1004
## 5   2013     1     1      554            600        -6      812
## 6   2013     1     1      554            558        -4      740
## 7   2013     1     1      555            600        -5      913
## 8   2013     1     1      557            600        -3      709
## 9   2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

We would like to collect some information about arrival delays of United Airlines (UA) flights. Do the following: pick all UA departures with non-zero arrival delay and calculate the mean arrival delay for each of the corresponding flight numbers. Which flight had the longest mean arrival delay and how long was that delay?

# R code goes here.

Discussion goes here. 1-2 sentences.

Part 2: Combining data-frames with dplyr

Problem 1: Invent two simple data sets that allow you explain the difference between the dplyr functions left_join() and inner_join(). Explain which features of your data sets affect the behavior of these two functions.

# R code goes here.

Discussion goes here. 3-4 sentences.

Problem 2: I have split the sitka data set into two data-frames. First, look up the documentation for the bind_rows function. What does bind_rows do? Next, use bind_rows to combine sitka1 and sitka2 back into a single data-frame.

Discussion goes here. 1-2 sentences.

sitka1 <- sitka[1:100,]
sitka2 <- sitka[101:395,]
head(sitka1)

##   size Time tree treat
## 1 4.51  152    1 ozone
## 2 4.98  174    1 ozone
## 3 5.41  201    1 ozone
## 4 5.90  227    1 ozone
## 5 6.15  258    1 ozone
## 6 4.24  152    2 ozone

head(sitka2)

##     size Time tree treat
## 101 4.04  152   21 ozone
## 102 4.64  174   21 ozone
## 103 4.86  201   21 ozone
## 104 5.09  227   21 ozone
## 105 5.25  258   21 ozone
## 106 3.53  152   22 ozone

# R code goes here.