## Homework 4

Enter your name and EID here

This homework is due on Feb. 17, 2020 at 12:00pm. Please submit as a PDF file on Canvas.

Problem 1a: (3 pts) The following two data tables contain information about the hair and eye colors of male and female statistics students. Make these dataframes tidy and then combine them into a single dataframe using pivot_wider() and bind_rows().

Hint: Before combining the dataframes, make sure to mutate() a new column specifying whether the students are male or female. Two dataframes can be combined with bind_rows() as long as the column names are identical and contain the same types of data.

## Parsed with column specification:
## cols(
##   Hair = col_character(),
##   Brown = col_double(),
##   Blue = col_double(),
##   Hazel = col_double(),
##   Green = col_double()
## )
## Parsed with column specification:
## cols(
##   Hair = col_character(),
##   Brown = col_double(),
##   Blue = col_double(),
##   Hazel = col_double(),
##   Green = col_double()
## )
female_tidy <- female %>%
pivot_longer(-Hair, names_to = "Eyes", values_to = "Num_Students") %>%
mutate(Sex = "Female")

male_tidy <- male %>%
pivot_longer(-Hair, names_to = "Eyes", values_to = "Num_Students") %>%
mutate(Sex = "Male")

combined <- bind_rows(female_tidy, male_tidy)

combined
## # A tibble: 32 x 4
##    Hair  Eyes  Num_Students Sex
##    <chr> <chr>        <dbl> <chr>
##  1 Black Brown           36 Female
##  2 Black Blue             9 Female
##  3 Black Hazel            5 Female
##  4 Black Green            2 Female
##  5 Brown Brown           66 Female
##  6 Brown Blue            34 Female
##  7 Brown Hazel           29 Female
##  8 Brown Green           14 Female
##  9 Red   Brown           16 Female
## 10 Red   Blue             7 Female
## # … with 22 more rows

Problem 1b: (1 pts) Using the data-frame you created above, compute the total number of students for each hair color (i.e., the sum of students that have brown, black, blond or red hair). How many students have each color of hair?

combined %>%
group_by(Hair) %>%
summarize(Total = sum(Num_Students))
## # A tibble: 4 x 2
##   Hair  Total
##   <chr> <dbl>
## 1 Black   108
## 2 Blond   127
## 3 Brown   286
## 4 Red      71

One hundred eight students have black hair, 127 students have blond hair, 286 students have brown hair, and 71 students have red hair.

Problem 2: (3 pts) The chickwts dataset contains information on the weight of chicks after being fed different feed supplements. The different feed supplements are labeled casein, horsebean, linseed, meatmeal, soybean, and sunflower in the feed column. I have created a new data-frame (feed_names), that contains the abbreviated names of different feed supplements. Using one of the dplyr join functions, combine the two data-frames so that there is an additional feed_abbr column and all of the original columns and rows in chickwts are retained. Which join function is most appropriate to use and why?

##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean
## Parsed with column specification:
## cols(
##   feed = col_character(),
##   feed_abbr = col_character()
## )
new_feed_names <- left_join(chickwts, feed_names)
## Joining, by = "feed"
## Warning: Column feed joining factor and character vector, coercing into
## character vector
new_feed_names
##    weight      feed feed_abbr
## 1     179 horsebean      <NA>
## 2     160 horsebean      <NA>
## 3     136 horsebean      <NA>
## 4     227 horsebean      <NA>
## 5     217 horsebean      <NA>
## 6     168 horsebean      <NA>
## 7     108 horsebean      <NA>
## 8     124 horsebean      <NA>
## 9     143 horsebean      <NA>
## 10    140 horsebean      <NA>
## 11    309   linseed        ls
## 12    229   linseed        ls
## 13    181   linseed        ls
## 14    141   linseed        ls
## 15    260   linseed        ls
## 16    203   linseed        ls
## 17    148   linseed        ls
## 18    169   linseed        ls
## 19    213   linseed        ls
## 20    257   linseed        ls
## 21    244   linseed        ls
## 22    271   linseed        ls
## 23    243   soybean        sb
## 24    230   soybean        sb
## 25    248   soybean        sb
## 26    327   soybean        sb
## 27    329   soybean        sb
## 28    250   soybean        sb
## 29    193   soybean        sb
## 30    271   soybean        sb
## 31    316   soybean        sb
## 32    267   soybean        sb
## 33    199   soybean        sb
## 34    171   soybean        sb
## 35    158   soybean        sb
## 36    248   soybean        sb
## 37    423 sunflower        sf
## 38    340 sunflower        sf
## 39    392 sunflower        sf
## 40    339 sunflower        sf
## 41    341 sunflower        sf
## 42    226 sunflower        sf
## 43    320 sunflower        sf
## 44    295 sunflower        sf
## 45    334 sunflower        sf
## 46    322 sunflower        sf
## 47    297 sunflower        sf
## 48    318 sunflower        sf
## 49    325  meatmeal        mm
## 50    257  meatmeal        mm
## 51    303  meatmeal        mm
## 52    315  meatmeal        mm
## 53    380  meatmeal        mm
## 54    153  meatmeal        mm
## 55    263  meatmeal        mm
## 56    242  meatmeal        mm
## 57    206  meatmeal        mm
## 58    344  meatmeal        mm
## 59    258  meatmeal        mm
## 60    368    casein        cs
## 61    390    casein        cs
## 62    379    casein        cs
## 63    260    casein        cs
## 64    404    casein        cs
## 65    318    casein        cs
## 66    352    casein        cs
## 67    359    casein        cs
## 68    216    casein        cs
## 69    222    casein        cs
## 70    283    casein        cs
## 71    332    casein        cs

The left_join function is most appropriate because we want to retain all of the observations in chickwts, while copying observations from feed_names when there is more than one observation for the same feed type.

Problem 3: (3 pts) Recall the flights dataset from lab 3 worksheet. Ask a conceptual question about the flights dataset. Your question should not repeat the questions from class materials. Describe in 1-2 sentences how you would answer this question with an analysis or a graph.

Question: Is there a relationship between aiports in New York and the air time of the flights from those airports?

Answer approach: I could plot distributions of air time for the three airports to see if these distributions differ from each other. I could also run an ANOVA to assess whether the mean air times are different between the three airports.