Enter your name and EID here
This homework is due on Feb. 17, 2020 at 12:00pm. Please submit as a PDF file on Canvas.
Problem 1a: (3 pts) The following two data tables contain information about the hair and eye colors of male and female statistics students. Make these dataframes tidy and then combine them into a single dataframe using pivot_wider()
and bind_rows()
.
Hint: Before combining the dataframes, make sure to mutate()
a new column specifying whether the students are male or female. Two dataframes can be combined with bind_rows()
as long as the column names are identical and contain the same types of data.
female <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/female_haireyecolor.csv")
## Parsed with column specification:
## cols(
## Hair = col_character(),
## Brown = col_double(),
## Blue = col_double(),
## Hazel = col_double(),
## Green = col_double()
## )
male <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/male_haireyecolor.csv")
## Parsed with column specification:
## cols(
## Hair = col_character(),
## Brown = col_double(),
## Blue = col_double(),
## Hazel = col_double(),
## Green = col_double()
## )
female_tidy <- female %>%
pivot_longer(-Hair, names_to = "Eyes", values_to = "Num_Students") %>%
mutate(Sex = "Female")
male_tidy <- male %>%
pivot_longer(-Hair, names_to = "Eyes", values_to = "Num_Students") %>%
mutate(Sex = "Male")
combined <- bind_rows(female_tidy, male_tidy)
combined
## # A tibble: 32 x 4
## Hair Eyes Num_Students Sex
## <chr> <chr> <dbl> <chr>
## 1 Black Brown 36 Female
## 2 Black Blue 9 Female
## 3 Black Hazel 5 Female
## 4 Black Green 2 Female
## 5 Brown Brown 66 Female
## 6 Brown Blue 34 Female
## 7 Brown Hazel 29 Female
## 8 Brown Green 14 Female
## 9 Red Brown 16 Female
## 10 Red Blue 7 Female
## # … with 22 more rows
Problem 1b: (1 pts) Using the data-frame you created above, compute the total number of students for each hair color (i.e., the sum of students that have brown, black, blond or red hair). How many students have each color of hair?
combined %>%
group_by(Hair) %>%
summarize(Total = sum(Num_Students))
## # A tibble: 4 x 2
## Hair Total
## <chr> <dbl>
## 1 Black 108
## 2 Blond 127
## 3 Brown 286
## 4 Red 71
One hundred eight students have black hair, 127 students have blond hair, 286 students have brown hair, and 71 students have red hair.
Problem 2: (3 pts) The chickwts
dataset contains information on the weight of chicks after being fed different feed supplements. The different feed supplements are labeled casein, horsebean, linseed, meatmeal, soybean, and sunflower in the feed
column. I have created a new data-frame (feed_names
), that contains the abbreviated names of different feed supplements. Using one of the dplyr
join functions, combine the two data-frames so that there is an additional feed_abbr
column and all of the original columns and rows in chickwts are retained. Which join function is most appropriate to use and why?
head(chickwts)
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
feed_names <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/feed_names.csv")
## Parsed with column specification:
## cols(
## feed = col_character(),
## feed_abbr = col_character()
## )
new_feed_names <- left_join(chickwts, feed_names)
## Joining, by = "feed"
## Warning: Column `feed` joining factor and character vector, coercing into
## character vector
new_feed_names
## weight feed feed_abbr
## 1 179 horsebean <NA>
## 2 160 horsebean <NA>
## 3 136 horsebean <NA>
## 4 227 horsebean <NA>
## 5 217 horsebean <NA>
## 6 168 horsebean <NA>
## 7 108 horsebean <NA>
## 8 124 horsebean <NA>
## 9 143 horsebean <NA>
## 10 140 horsebean <NA>
## 11 309 linseed ls
## 12 229 linseed ls
## 13 181 linseed ls
## 14 141 linseed ls
## 15 260 linseed ls
## 16 203 linseed ls
## 17 148 linseed ls
## 18 169 linseed ls
## 19 213 linseed ls
## 20 257 linseed ls
## 21 244 linseed ls
## 22 271 linseed ls
## 23 243 soybean sb
## 24 230 soybean sb
## 25 248 soybean sb
## 26 327 soybean sb
## 27 329 soybean sb
## 28 250 soybean sb
## 29 193 soybean sb
## 30 271 soybean sb
## 31 316 soybean sb
## 32 267 soybean sb
## 33 199 soybean sb
## 34 171 soybean sb
## 35 158 soybean sb
## 36 248 soybean sb
## 37 423 sunflower sf
## 38 340 sunflower sf
## 39 392 sunflower sf
## 40 339 sunflower sf
## 41 341 sunflower sf
## 42 226 sunflower sf
## 43 320 sunflower sf
## 44 295 sunflower sf
## 45 334 sunflower sf
## 46 322 sunflower sf
## 47 297 sunflower sf
## 48 318 sunflower sf
## 49 325 meatmeal mm
## 50 257 meatmeal mm
## 51 303 meatmeal mm
## 52 315 meatmeal mm
## 53 380 meatmeal mm
## 54 153 meatmeal mm
## 55 263 meatmeal mm
## 56 242 meatmeal mm
## 57 206 meatmeal mm
## 58 344 meatmeal mm
## 59 258 meatmeal mm
## 60 368 casein cs
## 61 390 casein cs
## 62 379 casein cs
## 63 260 casein cs
## 64 404 casein cs
## 65 318 casein cs
## 66 352 casein cs
## 67 359 casein cs
## 68 216 casein cs
## 69 222 casein cs
## 70 283 casein cs
## 71 332 casein cs
The left_join
function is most appropriate because we want to retain all of the observations in chickwts
, while copying observations from feed_names
when there is more than one observation for the same feed type.
Problem 3: (3 pts) Recall the flights
dataset from lab 3 worksheet. Ask a conceptual question about the flights
dataset. Your question should not repeat the questions from class materials. Describe in 1-2 sentences how you would answer this question with an analysis or a graph.
Question: Is there a relationship between aiports in New York and the air time of the flights from those airports?
Answer approach: I could plot distributions of air time for the three airports to see if these distributions differ from each other. I could also run an ANOVA to assess whether the mean air times are different between the three airports.