Homework 3

Enter your name and EID here

This homework is due on Feb. 6, 2018 at 7:00pm. Please submit as a PDF file on Canvas.

In this homework, you are asked to evaluate two data sets and determine if they are tidy data sets. We are referring to a very specific definition of “tidy”, so if this term is unfamiliar to you, please review the lecture materials.

Problem 1: (2 pts) The dataset WorldPhones built into R contains the number of telephones (in thousands) in various regions of the world for the years 1951 and 1956-1961. You can run ?WorldPhones to learn more about this data set.

WorldPhones
##      N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
## 1951  45939  21574 2876   1815    1646     89      555
## 1956  60423  29990 4708   2568    2366   1411      733
## 1957  64721  32510 5230   2695    2526   1546      773
## 1958  68484  35218 6662   2845    2691   1663      836
## 1959  71799  37598 6856   3000    2868   1769      911
## 1960  76036  40341 8220   3145    3054   1905     1008
## 1961  79831  43173 9053   3338    3224   2005     1076

Explain the variables present in this dataset. Using the variables in this dataset and the formal definition of tidy data that we learned in lecture, is this data set tidy? Explain why or why not.

The dataset contains the variables for number of telephones, years, and regions. The dataset is not tidy. There should be one column for number of telephones, one column for regions, and one column for years. Instead, the data are arranged such that regions vary along the columns and years vary along the rows. Telephone count varies along both the rows and the columns.

The dataset ToothGrowth built into R contains data on the effect of vitamin C on tooth growth in 60 Guinea pigs. You can run ?ToothGrowth to learn more about this data set.

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Explain the variables present in this dataset. Using the variables in this dataset and the formal definition of tidy data that we learned in lecture, is this data set tidy? Explain why or why not.

The dataset contains the variables for length of the tooth, supplement type, and dose of the supplement. The dataset is tidy because each column is a variable, and each row is an observation.

Problem 2: (2 pts) The MedGPA dataset contains information about medical school admission. The dataset has 55 observations and 11 columns. It contains information on acceptance status (Accept) with levels A for accepted and D for denied, indicator for acceptance status (Acceptance) with levels 1 for accepted and 0 for denied, sex of a student (Sex), Biology/Chemistry/Physics/Math grade point average (BCPM), college grade point average (GPA), MCAT exam’s verbal reasoning score (VR), MCAT exam’s physical sciences score (PS), MCAT exam’s writing sample score (WS), MCAT exam’s biological science score (BS), MCAT exam’s total score (sum of VR+PS+WS+BS), and the number of medical schools the student applied to (Apps).

MedGPA <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/MedGPA.csv")
head(MedGPA)
##   Accept Acceptance Sex BCPM  GPA VR PS WS BS MCAT Apps
## 1      D          0   F 3.59 3.62 11  9  9  9   38    5
## 2      A          1   M 3.75 3.84 12 13  8 12   45    3
## 3      A          1   F 3.24 3.23  9 10  5  9   33   19
## 4      A          1   F 3.74 3.69 12 11  7 10   40    5
## 5      A          1   F 3.53 3.38  9 11  4 11   35   11
## 6      A          1   M 3.59 3.72 10  9  7 10   36    5

What are the mean GPA and the mean MCAT exam score for students that were accepted and for students that were denied? State your answer in a sentence.

MedGPA %>%
  group_by(Accept) %>%
  summarize(mean_MCAT = mean(MCAT), mean_GPA = mean(GPA))
## # A tibble: 2 x 3
##   Accept mean_MCAT mean_GPA
##   <fct>      <dbl>    <dbl>
## 1 A           38.1     3.69
## 2 D           34.1     3.39

The mean MCAT score of accepted students is 38.07 and of denied students is 34.12. The mean GPA of accepted students is 3.69 and of denied students is 3.39

Problem 3: (3 pts) For female students that were accepted, what was the minimum and the maximum number of medical schools the students applied to? HINT: Use the functions max() and min() to determine the maximum and the minimum number of schools applied.

MedGPA %>%
  filter(Sex=="F", Accept=="A") %>%
  summarize(min_Apps=min(Apps), max_Apps=max(Apps))
##   min_Apps max_Apps
## 1        1       19

For female students that were accepted, the minimum number and the maximum number of medical schools the students applied to is 1 and 19, respectively.

Problem 4: (3 pts) Ask a question about the MedGPA data set. Your question should not repeat the questions in problems 2 or 3. Describe in 1-2 sentences how you would answer this question with an analysis or a graph.

Question: Is there a relationship between GPA and MCAT exam score?

Answer approach: I could make a plot of GPA vs. MCAT scores and look for a trend. I could also do a correlation analysis.