Enter your name and EID here
This knitted R Markdown document (as a PDF) and the raw R Markdown file (as .Rmd) should both be submitted to Canvas by 4:00pm on Feb 26th, 2019. These two documents will be graded jointly, so they must be consistent (as in, don’t change the R Markdown file without also updating the knitted document!).
All results presented must have corresponding code. Any answers/results given without the corresponding R code that generated the result will be considered absent. To be clear: if you do calculations by hand instead of using R and then report the results from the calculations, you will not receive credit for those calculations. All code reported in your final project document should work properly. Please do not include any extraneous code or code which produces error messages. (Code which produces warnings is acceptable, as long as you understand what the warnings mean.)
For this project, you will be using the dataset flavors_of_cacao
. This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used, and where the beans were grown.
flavors_of_cacao <-
read_csv("https://raw.githubusercontent.com/clauswilke/dviz.supp/master/data-raw/cacao/cacao_clean.csv") %>%
extract(cocoa_percent, "cocoa_percent", regex = "([^%]+)%", convert = TRUE)
## Parsed with column specification:
## cols(
## company = col_character(),
## bean_origin_detailed = col_character(),
## REF = col_integer(),
## review_date = col_integer(),
## cocoa_percent = col_character(),
## location = col_character(),
## rating = col_double(),
## bean_type = col_character(),
## bean_origin = col_character()
## )
head(flavors_of_cacao)
## # A tibble: 6 x 9
## company bean_origin_det… REF review_date cocoa_percent location rating
## <chr> <chr> <int> <int> <dbl> <chr> <dbl>
## 1 A. Mor… Agua Grande 1876 2016 63 France 3.75
## 2 A. Mor… Kpime 1676 2015 70 France 2.75
## 3 A. Mor… Atsane 1676 2015 70 France 3
## 4 A. Mor… Akata 1680 2015 70 France 3.5
## 5 A. Mor… Quilla 1704 2015 70 France 3.5
## 6 A. Mor… Carenero 1315 2014 70 France 2.75
## # ... with 2 more variables: bean_type <chr>, bean_origin <chr>
The column contents are as follows:
Problem 1: (10 pts) Write R code that counts the number of reviews for each company location and calculates a minimum and a maximum ratings of each company location. Filter your output for countries with more than 20 reviews, and order your output from highest to lowest number of reviews.
# R code goes here
Problem 2: (20 pts) Use the data-frame you generated in Problem 1 to find a location with the highest maximum rating and a location with the lowest minimum ratings. Perform a statistical test to determine whether there is a significant difference in ratings between these two locations.
# R code goes here
Your answer goes here. 1-2 sentences only.
Problem 3: (40 pts) Make one plot that visualizes the relationship between the number of reviews and maximum and minimum ratings. Use the data-frame you created in Problem 1. Your code should be well-commented and describe the various steps you take to create this figure. HINT: Convert your dataset to a tidy format before you plot.
a. (30 points)
# R code goes here
b. (10 points) Discuss the information (overarching trends, patterns, etc.) your plot reveals. Be sure to include in your discussion the similarities/differences among minimum and maximum ratings. Your discussion should also explain the results of the t-test in Problem 2 in the context of this plot. Be sure to also include a clear, logical justification for why you selected the particular geom(s) used to represent this data. Please limit your full response to a maximum of 10 sentences.
Your answer goes here.
Problem 4: (30 pts) Think of one (and only one!) conceptual question to ask about the data set flavors_of_cacao
. Clearly state your question in the space provided below. Use the ggplot2 library to create a plot that can help you find an answer to the question. For the plot, provide a clear explanation as to why this type of plot (e.g. boxplot, barplot, histogram, etc.) is best for providing the information you are asking about. Answer your question by interpreting your plot and identifying any trends it reveals, or does not reveal, as the case may be. Please limit the discussion to 4-6 sentences.
To receive full credit for Problem 4, we look for the following for a question:
You cannot reuse the questions about the flavors_of_cacao
data set from the previous problems.
Question
State your question here.
# R code for a plot creation, analysis goes here
Answer to your question goes here.