The following questions in Part 1 are from Lab Worksheet 3. Answer these questions again, but this time use the dplyr pipe (%>%
) in your answer.
Problem 1: In an in-class exercise, we made the following plot of the Sitka dataset:
# download the sitka data set:
sitka <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/sitka.csv")
head(sitka)
## size Time tree treat
## 1 4.51 152 1 ozone
## 2 4.98 174 1 ozone
## 3 5.41 201 1 ozone
## 4 5.90 227 1 ozone
## 5 6.15 258 1 ozone
## 6 4.24 152 2 ozone
ggplot(sitka, aes(x=Time, y=size, group=tree)) + geom_line() + facet_wrap(~treat)
Now modify the plot so that the line for each tree is colored according to the maximum size of the tree.
# R code goes here.
Problem 2: The package nycflights13 contains information about all flights departing from one of the NY City airports in 2013. In particular, the data table flights
lists on-time departure and arrival information for 336,776 individual flights:
library(nycflights13)
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
We would like to collect some information about arrival delays of United Airlines (UA) flights. Do the following: pick all UA departures with non-zero arrival delay and calculate the mean arrival delay for each of the corresponding flight numbers. Which flight had the longest mean arrival delay and how long was that delay?
# R code goes here.
Discussion goes here. 1-2 sentences.
Problem 1: Invent two simple data sets that allow you explain the difference between the dplyr functions left_join()
and inner_join()
. Explain which features of your data sets affect the behavior of these two functions.
# R code goes here.
Discussion goes here. 3-4 sentences.
Problem 2: I have split the sitka data set into two data-frames. First, look up the documentation for the bind_rows
function. What does bind_rows
do? Next, use bind_rows
to combine sitka1
and sitka2
back into a single data-frame.
Discussion goes here. 1-2 sentences.
sitka1 <- sitka[1:100,]
sitka2 <- sitka[101:395,]
head(sitka1)
## size Time tree treat
## 1 4.51 152 1 ozone
## 2 4.98 174 1 ozone
## 3 5.41 201 1 ozone
## 4 5.90 227 1 ozone
## 5 6.15 258 1 ozone
## 6 4.24 152 2 ozone
head(sitka2)
## size Time tree treat
## 101 4.04 152 21 ozone
## 102 4.64 174 21 ozone
## 103 4.86 201 21 ozone
## 104 5.09 227 21 ozone
## 105 5.25 258 21 ozone
## 106 3.53 152 22 ozone
# R code goes here.