In-class worksheet 7

Feb 6, 2018

In this worksheet, we will use the libraries tidyverse and nycflights13:

library(tidyverse)
theme_set(theme_bw(base_size=12)) # set default ggplot2 theme
library(nycflights13)

The nycflights13 package contains information about all planes departing fron New York City in 2013.

1. Joining tables

The following two tables list the population size and area (in sq miles) of three major Texas cities each:

population <- read.csv(text=
"city,year,population
Houston,2014,2239558
San Antonio,2014,1436697
Austin,2014,912791
Austin,2010,790390", stringsAsFactors = FALSE)
population
##          city year population
## 1     Houston 2014    2239558
## 2 San Antonio 2014    1436697
## 3      Austin 2014     912791
## 4      Austin 2010     790390
area <- read.csv(text=
"city,area
Houston,607.5
Dallas,385.6
Austin,307.2", stringsAsFactors = FALSE)
area
##      city  area
## 1 Houston 607.5
## 2  Dallas 385.6
## 3  Austin 307.2

Combine these two tables using the functions left_join(), right_join(), and inner_join(). How do these join functions differ in their results?

# R code goes here.

2. Relationship between arrival delay and age of plane

The table flights from nycflights13 contains information about individual departures:

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Individual planes are indicated by their tail number (tailnum in the table). Calculate the mean arrival delay (arr_delay) for each tail number. Do you notice anything unusual in the result? Try to calculate the mean with and without adding the option na.rm=TRUE.

# R code goes here.

Information about individual planes is availabe in the table planes:

planes
## # A tibble: 3,322 x 9
##    tailnum  year                    type     manufacturer     model
##      <chr> <int>                   <chr>            <chr>     <chr>
##  1  N10156  2004 Fixed wing multi engine          EMBRAER EMB-145XR
##  2  N102UW  1998 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
##  3  N103US  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
##  4  N104UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
##  5  N10575  2002 Fixed wing multi engine          EMBRAER EMB-145LR
##  6  N105UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
##  7  N107US  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
##  8  N108UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
##  9  N109UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## 10  N110UW  1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214
## # ... with 3,312 more rows, and 4 more variables: engines <int>,
## #   seats <int>, speed <int>, engine <chr>

In particular, this table lists the year each individual plane was manufactured. Make a combined table that holds tail number, mean arrival delay, and year of manufacture for each plane. Then plot mean arrival delay vs. year of manufacture.

# R code goes here.

3. Relationship between arrival delay and temperature

Now calculate the mean arrival delay for each day of the year, and store in a variable called daily_delays.

# R code goes here.

We want to correlate these delay values with the temperature of each day. The data frame weather holds temperature measurements for each hour of each day:

weather
## # A tibble: 26,130 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##     <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1    EWR  2013     1     1     0 37.04 21.92 53.97      230   10.35702
##  2    EWR  2013     1     1     1 37.04 21.92 53.97      230   13.80936
##  3    EWR  2013     1     1     2 37.94 21.92 52.09      230   12.65858
##  4    EWR  2013     1     1     3 37.94 23.00 54.51      230   13.80936
##  5    EWR  2013     1     1     4 37.94 24.08 57.04      240   14.96014
##  6    EWR  2013     1     1     6 39.02 26.06 59.37      270   10.35702
##  7    EWR  2013     1     1     7 39.02 26.96 61.63      250    8.05546
##  8    EWR  2013     1     1     8 39.02 28.04 64.43      240   11.50780
##  9    EWR  2013     1     1     9 39.92 28.04 62.21      250   12.65858
## 10    EWR  2013     1     1    10 39.02 28.04 64.43      260   12.65858
## # ... with 26,120 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

First, calculate the mean temperature for each day, and store in a variable called mean_temp:

# R code goes here.

Now combine the mean delay and the mean temperature into one table, and then plot mean delay vs. mean temperature.

# R code goes here.

4. If this was easy

Find out for how many tail numbers in the flights data set we have no information in the planes data set. What do we have to pay attention to when joining the flights and planes tables?

# R code goes here.

Calculate the mean arrival delay by plane model and by plane engine. Sort in order of descending mean delay. Remove all tailnumbers for which no plane information is available.

# R code goes here.