Homework 4

Enter your name and EID here

This homework is due on Feb. 13, 2018 at 7:00pm. Please submit as a PDF file on Canvas.

Problem 1: (4 pts) The Titanic data set contains the information about the passengers of the ocean liner Titanic.

Titanic
## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20

We have extracted data from Titanic about the passengers that survived. We placed that information into two data-frames, one Adult and one Child. Using the dplyr and tidyr packages, make these data-frames tidy and then combine them into a single data-frame. Make sure that your final data-frame has an Age column indicating which data-frame the observations originally came from. HINT: You can use the bind_rows function to add rows from one data-frame onto another as long as both data-frames have identical column names.

Adult <- read.table(text="
Class  Male Female
1st    57    140
2nd    14     80
3rd    75     76
Crew  192     20
", head=T)

Child <- read.table(text="
Class  Male Female
1st     5      1
2nd    11     13
3rd    13     14
Crew    0      0
", head=T)

Adult %>% gather(Sex, Count, Male:Female) %>% mutate(Age = "Adult") -> Adult

Child %>% gather(Sex, Count, Male:Female) %>% mutate(Age = "Child") -> Child

survived <- bind_rows(Adult, Child)

survived
##    Class    Sex Count   Age
## 1    1st   Male    57 Adult
## 2    2nd   Male    14 Adult
## 3    3rd   Male    75 Adult
## 4   Crew   Male   192 Adult
## 5    1st Female   140 Adult
## 6    2nd Female    80 Adult
## 7    3rd Female    76 Adult
## 8   Crew Female    20 Adult
## 9    1st   Male     5 Child
## 10   2nd   Male    11 Child
## 11   3rd   Male    13 Child
## 12  Crew   Male     0 Child
## 13   1st Female     1 Child
## 14   2nd Female    13 Child
## 15   3rd Female    14 Child
## 16  Crew Female     0 Child

Using the data-frame you created above, compute the total counts for each age group (i.e., the sum of the counts for each age).

survived %>% group_by(Age) %>% summarize(Total_count = sum(Count)) -> survived_summary

survived_summary
## # A tibble: 2 x 2
##   Age   Total_count
##   <chr>       <int>
## 1 Adult         654
## 2 Child          57

Problem 2: (3 pts) Recall that the Alfalfa data set contains the height of alfalfa sprouts after four days (Ht4) grown indoors in different acidic environments (Acid). The column Row contains information on the amount of daylight the plants received. The rows were indicated with values a through e, where a indicates the row farthest from the window and e indicates the row closest to the window. We have created a new data-frame (Row_numbers), that contains row information as numbers. Using one of the dplyr join functions, combine the data-frames Alfalfa and Row_numbers so that there is an additional Row_num column and all of the original columns and rows in Alfalfa are retained. Which join function is most appropriate to use and why?

Alfalfa <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/Alfalfa.csv")
head(Alfalfa)
##    Ht4   Acid Row
## 1 1.45  water   a
## 2 2.79  water   b
## 3 1.93  water   c
## 4 2.33  water   d
## 5 4.85  water   e
## 6 1.00 1.5HCl   a
Row_numbers <- data.frame(Row=c("a","b","c","d","e","f","g"),
                         Row_num=c(1:7))
Row_numbers
##   Row Row_num
## 1   a       1
## 2   b       2
## 3   c       3
## 4   d       4
## 5   e       5
## 6   f       6
## 7   g       7
Alfalfa %>% left_join(Row_numbers) -> Alfalfa_row_num
## Joining, by = "Row"
## Warning: Column `Row` joining factors with different levels, coercing to
## character vector
Alfalfa_row_num
##     Ht4   Acid Row Row_num
## 1  1.45  water   a       1
## 2  2.79  water   b       2
## 3  1.93  water   c       3
## 4  2.33  water   d       4
## 5  4.85  water   e       5
## 6  1.00 1.5HCl   a       1
## 7  0.70 1.5HCl   b       2
## 8  1.37 1.5HCl   c       3
## 9  2.80 1.5HCl   d       4
## 10 1.46 1.5HCl   e       5
## 11 1.03 3.0HCl   a       1
## 12 1.22 3.0HCl   b       2
## 13 0.45 3.0HCl   c       3
## 14 1.65 3.0HCl   d       4
## 15 1.07 3.0HCl   e       5

The left_join function is most appropriate because we want to retain all of the observations in Alfalfa, while copying observations from Row_numbers when there is more than one observation for the same row.

Problem 3: (3 pts) Make up your own data set which is not tidy and input it into R via the text option of read.table(). First, explain why it is not tidy. Then, using dplyr and/or tidyr convert it into a tidy data set.

netflix <- read.table(text="
week HouseOfCards TheWalkingDead BoJackHorseman
1 2 1 0.5 
2 1 3 5 
3 4 7 9 
4 3 2 10
5 2.5 20 0.5
6 1 3 2
7 0.5 1 1
8 3 2.5 6
", head=T)

netflix
##   week HouseOfCards TheWalkingDead BoJackHorseman
## 1    1          2.0            1.0            0.5
## 2    2          1.0            3.0            5.0
## 3    3          4.0            7.0            9.0
## 4    4          3.0            2.0           10.0
## 5    5          2.5           20.0            0.5
## 6    6          1.0            3.0            2.0
## 7    7          0.5            1.0            1.0
## 8    8          3.0            2.5            6.0

I made a data-frame called netflix which contains the number of hours of different shows that I watched each week on Netflix. The data-frame is not tidy because the three columns (HouseOfCards, TheWalkingDead, BoJackHorseman) are all values of one variable (the title of the show that I watched). I made a new data-frame with the columns, week, hours_watched, and show.

netflix %>% gather(show, hours_watched, HouseOfCards:BoJackHorseman) -> tidy_netflix

tidy_netflix
##    week           show hours_watched
## 1     1   HouseOfCards           2.0
## 2     2   HouseOfCards           1.0
## 3     3   HouseOfCards           4.0
## 4     4   HouseOfCards           3.0
## 5     5   HouseOfCards           2.5
## 6     6   HouseOfCards           1.0
## 7     7   HouseOfCards           0.5
## 8     8   HouseOfCards           3.0
## 9     1 TheWalkingDead           1.0
## 10    2 TheWalkingDead           3.0
## 11    3 TheWalkingDead           7.0
## 12    4 TheWalkingDead           2.0
## 13    5 TheWalkingDead          20.0
## 14    6 TheWalkingDead           3.0
## 15    7 TheWalkingDead           1.0
## 16    8 TheWalkingDead           2.5
## 17    1 BoJackHorseman           0.5
## 18    2 BoJackHorseman           5.0
## 19    3 BoJackHorseman           9.0
## 20    4 BoJackHorseman          10.0
## 21    5 BoJackHorseman           0.5
## 22    6 BoJackHorseman           2.0
## 23    7 BoJackHorseman           1.0
## 24    8 BoJackHorseman           6.0