Project 1

Enter your name and EID here


This knitted R Markdown document (as a PDF) and the raw R Markdown file (as .Rmd) should both be submitted to Canvas by 7:00pm on Feb 20th, 2018. These two documents will be graded jointly, so they must be consistent (as in, don’t change the R Markdown file without also updating the knitted document!).

All results presented must have corresponding code. Any answers/results given without the corresponding R code that generated the result will be considered absent. To be clear: if you do calculations by hand instead of using R and then report the results from the calculations, you will not receive credit for those calculations. All code reported in your final project document should work properly. Please do not include any extraneous code or code which produces error messages. (Code which produces warnings is acceptable, as long as you understand what the warnings mean.)

For this project, you will be using the data set ais that contains characteristics of body size and blood of Australian athletes.

ais <- read.csv("")
##    rcc wcc   hc   hg ferr   bmi   ssf pcBfat   lbm    ht   wt sex  sport
## 1 3.96 7.5 37.5 12.3   60 20.56 109.1  19.75 63.32 195.9 78.9   f B_Ball
## 2 4.41 8.3 38.2 12.7   68 20.67 102.8  21.30 58.55 189.7 74.4   f B_Ball
## 3 4.14 5.0 36.4 11.6   21 21.86 104.6  19.88 55.36 177.8 69.1   f B_Ball
## 4 4.11 5.3 37.3 12.6   69 21.88 126.4  23.66 57.18 185.0 74.9   f B_Ball
## 5 4.45 6.8 41.5 14.0   29 18.96  80.3  17.64 53.20 184.6 64.6   f B_Ball
## 6 4.10 4.4 37.4 12.5   42 21.04  75.2  15.58 53.77 174.0 63.7   f B_Ball

The column contents are as follows:

  • rcc: red blood cell count (in 1012 per liter).
  • wcc: white blood cell count (in 1012 per liter).
  • hc: percent hematocrit.
  • hg: hemaglobin concentration (in grams per decaliter).
  • ferr: plasma ferritins (in nanograms per decaliter).
  • bmi: body mass index (kilograms per centimeter2 x 102).
  • ssf: sum of skin folds (the units were not reported).
  • pcBfat: percent body fat.
  • lbm: lean body mass (in kilograms).
  • ht: height (in centimeters).
  • wt: weight (in kilograms).
  • sex: sex of an athlete, f=female or m=male.
  • sport: type of sport.


Problem 1: (5 pts) Write R code that counts the number of athletes present for each combination of sport and sex. Order your output by sport and sex. Which sports contain observations only for one of the two sexes?

ais %>% group_by(sport, sex) %>% summarize(count=n()) %>% arrange(sport, sex)
## # A tibble: 17 x 3
## # Groups:   sport [10]
##    sport   sex   count
##    <fct>   <fct> <int>
##  1 B_Ball  f        13
##  2 B_Ball  m        12
##  3 Field   f         7
##  4 Field   m        12
##  5 Gym     f         4
##  6 Netball f        23
##  7 Row     f        22
##  8 Row     m        15
##  9 Swim    f         9
## 10 Swim    m        13
## 11 T_400m  f        11
## 12 T_400m  m        18
## 13 T_Sprnt f         4
## 14 T_Sprnt m        11
## 15 Tennis  f         7
## 16 Tennis  m         4
## 17 W_Polo  m        17

Gymnastics, netball, and water polo contain observations for one sex only.

Problem 2: (25 pts) The following data set provides the full name for each sport. Note that it uses the same name (“Track”) for the two categories “T_Sprnt” and “T_400m”.

sport_name <- read.csv(text="
W_Polo,Water polo

Combine the data set ais with the data set sport_name such that all the information in the ais data set is retained. Using the combined data set, on the basis of the data column full_name, find the sport with the highest mean BMI for male athletes and the sport with the lowest mean BMI for male athletes. Perform a statistical test to determine whether there is a significant difference in the male athlete’s mean BMI between these two sports, and interpret your findings.

ais %>% left_join(sport_name) -> ais_full
## Joining, by = "sport"
ais_full %>% 
  filter(sex=="m") %>%
  group_by(full_name) %>%
  summarize(mean_bmi=mean(bmi)) %>%
  arrange(desc(mean_bmi)) -> bmi_summary

## # A tibble: 7 x 2
##   full_name  mean_bmi
##   <fct>         <dbl>
## 1 Field          28.0
## 2 Rowing         24.6
## 3 Water polo     24.5
## 4 Swimming       23.7
## 5 Basketball     23.2
## 6 Tennis         22.3
## 7 Track          22.2
ais_full %>% filter(full_name=="Field", sex=="m") -> field
ais_full %>% filter(full_name=="Track", sex=="m") -> track
t.test(field$bmi, track$bmi)
##  Welch Two Sample t-test
## data:  field$bmi and track$bmi
## t = 4.3819, df = 12.379, p-value = 0.0008309
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.915622 8.643861
## sample estimates:
## mean of x mean of y 
##  27.95250  22.17276

The sport with maximum mean BMI for male athletes is field, and the sport with minimum mean BMI for male athletes is track. The mean BMI of athletes in field is significantly higher than the mean BMI of athletes in track.

Problem 3: (40 pts)

a. (30 points) Using the ais data set, extract all the rows corresponding to sports for which there is data for both sexes. Then make one plot that visualizes the distributions of percent body fat (data column pcBfat) for each sex in the different sports. Your code should be well-commented and describe the various steps you take to create this figure.

#removing observations for gymnastics, netball, and water polo
ais %>% filter(sport != "Gym", sport != "Netball", sport != "W_Polo") -> ais_filtered

#making density plots of percent body fat, faceted by sport and colored by sex
ggplot(ais_filtered, aes(x=pcBfat, fill=sex)) + 
  geom_density(alpha=0.5) +
  facet_wrap(~sport) +
  xlab("Percent body fat") +

b. (10 points) Discuss the information (overarching trends, patterns, etc.) your plot reveals. Be sure to include in your discussion the similarities/differences among the different sports and sexes. Be sure to also include a clear, logical justification for why you selected the particular geom(s) used to represent this data. Please limit your full response to a maximum of 6 sentences.

I used geom_density() to display distributions of percent body fat and to separate sex with colors. I facetted by sport to compare those distributions among different sport types. Within each sport, male athletes seem to have lower percent body fat than female athletes. This is not always the case when you compare percent body fat across sports. For instance, percent body fat for female athletes in sprints (T_Sprnt) seems lower than the percent body fat for male athletes in field (Field).

Problem 4: (30 pts) Think of two (and only two!) conceptual questions to ask about the data set ais. Clearly state each question in the spaces provided below. For each question, use the ggplot2 library to create a plot that can help you find an answer to the question. For each plot, provide a clear explanation as to why this type of plot (e.g. boxplot, barplot, histogram, etc.) is best for providing the information you are asking about. Answer your questions by interpreting your plot and identifying any trends it reveals, or does not reveal, as the case may be. Your two plots must use different primary geoms. Please limit the discussion for each question-plot pair to 4-6 sentences.

To receive full credit for Problem 4, we look for the following for each question:

  • A clear, coherent question about the data. (Questions end in a question mark!)
  • The question should be conceptual and should not prompt a specific analysis or plot.
  • A plot that helps answer your proposed question, with a justification for why you chose to make the type of plot that you made.
  • An interpretation of your plot and a response to your proposed question.
  • Statistical analysis is not necessary. Just interpret your plot.

You cannot reuse the questions about the ais data set from the previous problems.

Question 1

State your first question here.

# R code for plot 1 creation, analysis goes here

Answer to question 1 goes here.

Question 2

State your second question here.

# R code for plot 2 creation, analysis goes here

Answer to question 2 goes here.