class5.utf8.md

In-class worksheet 5

Feb 5, 2019

1. Tidy data

Is the iris dataset tidy? Explain why or why not.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Your answer goes here.

Is the HairEyeColor dataset tidy? Explain why or why not.

HairEyeColor

## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

Your answer goes here.

2. Selecting rows and columns

All subsequent code will be based on the dplyr library, which is part of the tidyverse. So we first have to load this library:

library(tidyverse)

Now, using the dplyr function filter(), pick all the rows in the iris dataset that pertain to species setosa, and store them in a new table called iris_setosa.

# R code goes here.

Pick all the rows in the iris dataset where species virginica has a sepal length > 7.

# R code goes here.

Are there any cases in the iris dataset for which the ratio of sepal length to sepal width exceeds the ratio of petal length to petal width? Use filter() to find out.

# R code goes here.

Create a pared-down table which contains only data for species setosa and which only has the columns Sepal.Length and Sepal.Width. Store the result in a table called iris_pared.

# R code goes here.

3. Creating new data, arranging

Using the function mutate(), create a new data column that holds the ratio of sepal length to sepal width. Store the resulting table in a variable called iris_ratio.

# R code goes here.

Order the iris_ratio table by species name and by increasing values of sepal length-to-width ratio.

# R code goes here.

4. Grouping and summarizing

Calculate the mean and standard deviation of the sepal lengths for each species. Do this by first creating a table grouped by species, which you call iris_grouped. Then run summarize() on that table.

# R code goes here.

Use the function n() to count the number of observations for each species.

# R code goes here.

For each species, calculate the percentage of cases with sepal length > 5.5.

# R code goes here.

5. If this was easy

Take the iris_ratio data set you have created and plot the distribution of sepal length-to-width ratios for the three species.

# R code goes here.

Now plot sepal length-to-width ratios vs. sepal lengths. Does it look like there is a relationship between the length-to-width ratios and the lengths? Does it matter whether you consider each species individually or all together? How could you find out?

# R code goes here.