Homework 3

Enter your name and EID here

This homework is due on Feb. 10, 2019 at 12:00pm. Please submit as a PDF file on Canvas.

In this homework, you are asked to evaluate two data sets and determine if they are tidy data sets. We are referring to a very specific definition of “tidy”, so if this term is unfamiliar to you, please review the lecture materials.

Problem 1: (3 pts) The dataset ldeaths built into R is a time series giving the monthly deaths from bronchitis, emphysema and asthma in the UK (1974-1979). You can run ?ldeaths to learn more about this data set. Using the variables in this dataset and the formal definition of tidy data that we learned in lecture, is this data set tidy? Explain why or why not.

ldeaths
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
## 1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
## 1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
## 1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
## 1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
## 1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915

The dataset contains the variables for year, month, and number of lung disease-related deaths. The dataset is not tidy. There should be one column for year, one column for month, and one column for number of accidental deaths. Instead, the data are arranged such that months vary along the columns and years vary along the rows. Lung death count varies along both the rows and the columns.

The dataset airquality built into R contains daily air quality measurements in New York, May to September in 1973. You can run ?airquality to learn more about this data set. Using the variables in this dataset and the formal definition of tidy data that we learned in lecture, is this data set tidy? Explain why or why not.

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

The dataset contains the variables for Ozone, Solar Radiation, Wind, Temp, Month and Day. The dataset is tidy because each column is a variable, and each row is an observation of the daily readings taken for each variable.

Problem 2: (3 pts) Listed below are three examples of code that violate the rules in section 2 of the tidyverse style guide. Name at least one style violation in each example.

ToothGrowth %>% filter(supp=="OJ") %>% head()

The spaces on both sides of == are missing and the pipes are written on one line.

ToothGrowth[,1]

There is no space after the comma.

boxplot ( len ~ dose, data = ToothGrowth, range = 1, width = c(2, 2, 2), varwidth = TRUE, notch = FALSE, outline = TRUE )

There is a space before ( and a space after ), and the code is too long to fit on a single line.

Problem 3: (4 pts) The NCbirths contains 1409 birth records from North Carolina in 2001. The column contents are as follows:

• Plural: 1=single birth, 2=twins, 3=triplets.
• Sex: Sex of the baby 1=male 2=female.
• MomAge: Mother’s age (in years).
• Weeks: Completed weeks of gestation.
• Gained: Weight gained during pregnancy (in pounds).
• BirthWeightGm: Birth weight in grams.
• Low: Indicator for low birth weight, 1=2500 grams or less, 0=otherwise.
• Premie: Indicator for premature birth, 1=36 weeks or sooner, 0=otherwise.
• Marital: Marital status: 0=married or 1=not married.
NCbirths <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
head(NCbirths)
##   Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## 1      1   1     32    40     38     0       3146.85   0      0       0
## 2      1   2     32    37     34     0       3288.60   0      0       0
## 3      1   1     27    39     12     0       3912.30   0      0       0
## 4      1   1     27    39     15     0       3855.60   0      0       0
## 5      1   1     25    39     32     0       3430.35   0      0       0
## 6      1   1     28    43     32     0       3316.95   0      0       0

Using some of the analysis functions we’ve discussed in class (i.e., mutate(), filter(), group_by(), summarize(), etc), write code that outputs the answer to the following question:

For premature births, what are the maximum age of mothers and the mean birth weight for single babies, twins and triplets? Using the computed results, answer the question in 1-2 sentences. HINT: Use the function max() to determine the maximum age of mothers.

NCbirths %>%
filter(Premie == 1) %>%
group_by(Plural) %>%
summarize(max_Mom = max(MomAge), mean_BirthWeight = mean(BirthWeightGm))
## # A tibble: 3 x 3
##   Plural max_Mom mean_BirthWeight
##    <int>   <dbl>            <dbl>
## 1      1      43            2616.
## 2      2      40            1896.
## 3      3      37            1772.

The maximum age of premie mothers is 43 for single births, 40 for twins, and 37 weeks for triplets. The mean birth weight for premature babies is 2616g for single births, 1896g for twins, and 1772g for triplets.