```{r global_options, include=FALSE}
library(knitr)
opts_chunk$set(fig.align="center", fig.height=3, fig.width=4)
```
## In-class worksheet 7
**Feb 6, 2018**
In this worksheet, we will use the libraries tidyverse and nycflights13:
```{r message=FALSE}
library(tidyverse)
theme_set(theme_bw(base_size=12)) # set default ggplot2 theme
library(nycflights13)
```
The nycflights13 package contains information about all planes departing fron New York City in 2013.
## 1. Joining tables
The following two tables list the population size and area (in sq miles) of three major Texas cities each:
```{r}
population <- read.csv(text=
"city,year,population
Houston,2014,2239558
San Antonio,2014,1436697
Austin,2014,912791
Austin,2010,790390", stringsAsFactors = FALSE)
population
area <- read.csv(text=
"city,area
Houston,607.5
Dallas,385.6
Austin,307.2", stringsAsFactors = FALSE)
area
```
Combine these two tables using the functions `left_join()`, `right_join()`, and `inner_join()`. How do these join functions differ in their results?
```{r}
# R code goes here.
```
## 2. Relationship between arrival delay and age of plane
The table `flights` from nycflights13 contains information about individual departures:
```{r}
flights
```
Individual planes are indicated by their tail number (`tailnum` in the table). Calculate the mean arrival delay (`arr_delay`) for each tail number. Do you notice anything unusual in the result? Try to calculate the mean with and without adding the option `na.rm=TRUE`.
```{r}
# R code goes here.
```
Information about individual planes is availabe in the table `planes`:
```{r}
planes
```
In particular, this table lists the year each individual plane was manufactured. Make a combined table that holds tail number, mean arrival delay, and year of manufacture for each plane. Then plot mean arrival delay vs. year of manufacture.
```{r}
# R code goes here.
```
## 3. Relationship between arrival delay and temperature
Now calculate the mean arrival delay for each day of the year, and store in a variable called `daily_delays`.
```{r}
# R code goes here.
```
We want to correlate these delay values with the temperature of each day. The data frame `weather` holds temperature measurements for each hour of each day:
```{r}
weather
```
First, calculate the mean temperature for each day, and store in a variable called `mean_temp`:
```{r}
# R code goes here.
```
Now combine the mean delay and the mean temperature into one table, and then plot mean delay vs. mean temperature.
```{r}
# R code goes here.
```
## 4. If this was easy
Find out for how many tail numbers in the `flights` data set we have no information in the `planes` data set. What do we have to pay attention to when joining the `flights` and `planes` tables?
```{r}
# R code goes here.
```
Calculate the mean arrival delay by plane model and by plane engine. Sort in order of descending mean delay. Remove all tailnumbers for which no plane information is available.
```{r}
# R code goes here.
```