Project 3 Instructions

Please use the project template R Markdown document to complete your project. The knitted R Markdown document (as a PDF) and the raw R Markdown file (as .Rmd) must be submitted to Canvas by 11:00pm on Thurs., April 18, 2024. These two documents will be graded jointly, so they must be consistent (as in, don’t change the R Markdown file without also updating the knitted document!).

All results presented must have corresponding code, and the code should be visible in the final generated pdf for ease of grading. Any answers/results given without the corresponding R code that generated the result will be considered absent. All code reported in your final project document should work properly. Please do not include any extraneous code or code which produces error messages. (Code which produces warnings is acceptable, as long as you understand what the warnings mean and explain this.)

For this project, you will be choosing your own dataset, given the following constraints: Pick one of the datasets published by the Tidy Tuesday project between May 30, 2023 and December 26, 2023 (both dates inclusive). All these datasets are available here: https://github.com/rfordatascience/tidytuesday/tree/master/data/2023

The project structure will be similar to Project 2, except there will be only one question. The final project should be structured as follows:

We encourage you to be concise. A paragraph should typically not be longer than 5 sentences.

Important: Your project needs to include some material from classes 17 or 19–22, i.e., either some statistical modeling applied to subsets of data or some dimension reduction or clustering. We recommend you do a PCA, but you are not required to do so if you use one of the other techniques.

Instructions

In the Introduction section, write a brief introduction to the dataset and describe what parts of the dataset are necessary to answer your question. Imagine that your project is a standalone document and the grader has no prior knowledge of the dataset. Important: You must provide a detailed description of data columns you are going to use in your analysis, reproducing relevant information from the data dictionary as necessary. However, you do not need to describe variables that are never used in your analysis.

Next you will state your question. The question should be conceptual and open-ended and not prompt a specific analysis. In particular, make sure you understand the difference between a question and an instruction (see Project 2 instructions for details on this topic).

In the Approach section, describe what type of data wrangling and analysis/modeling you will perform and what kind of plot(s) you will generate to address your questions. Provide a clear explanation as to why these plots (e.g. boxplot, barplot, histogram, etc.) are best for providing the information you are asking about. (You can draw on the materials provided here for guidance.)

In the Analysis section, provide the code that performs required data wrangling and then generates your summary table and your plots. Use scale functions to provide nice axis labels and guides. Also, use theme functions to customize the appearance of your plot. For full points, you will have to apply some unique styling to your plots; you cannot rely exclusively on preexisting theme functions. All plots must be made with ggplot2. Do not use base R plotting functions.

The computed summary table does not have to be complicated. It should just provide useful information about the dataset or analysis. Examples are summary statistics about different groups in the data, regression coefficients or other regression statistics if you run regression models, the composition of the rotation matrix if you run a PCA, etc. The two plots should be of different types, and at least one plot needs to use either color mapping or faceting or both. In aggregate, the summary table and the two plots should allow you to answer your question.

In the Discussion section, interpret the results of your analysis. Identify any trends revealed (or not revealed) by your analysis. Speculate about why the data looks the way it does.