Homework 6

Enter your name and EID here

This homework is due on Mar. 6, 2018 at 7:00pm. Please submit as a PDF file on Canvas.

For this homework you will use the wine data set. We are only interested in samples from cultivar 1 and 2, so we have removed samples from cultivar 3. The wine data set contains concentrations of 13 different chemical compounds (chem1-chem13) in 130 samples of wines grown in Italy. Each row is a different sample of wine, and the data set now contains just two different cultivars (cultivar) of wine.

wine <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/wine.csv", colClasses = c("cultivar" = "factor")) %>% filter(cultivar != 3)
head(wine)
##   cultivar chem1 chem2 chem3 chem4 chem5 chem6 chem7 chem8 chem9 chem10
## 1        1 14.23  1.71  2.43  15.6   127  2.80  3.06  0.28  2.29   5.64
## 2        1 13.20  1.78  2.14  11.2   100  2.65  2.76  0.26  1.28   4.38
## 3        1 13.16  2.36  2.67  18.6   101  2.80  3.24  0.30  2.81   5.68
## 4        1 14.37  1.95  2.50  16.8   113  3.85  3.49  0.24  2.18   7.80
## 5        1 13.24  2.59  2.87  21.0   118  2.80  2.69  0.39  1.82   4.32
## 6        1 14.20  1.76  2.45  15.2   112  3.27  3.39  0.34  1.97   6.75
##   chem11 chem12 chem13
## 1   1.04   3.92   1065
## 2   1.05   3.40   1050
## 3   1.03   3.17   1185
## 4   0.86   3.45   1480
## 5   1.04   2.93    735
## 6   1.05   2.85   1450

Problem 1

A. (1 pt) Make a logistic regression model that predicts the cultivar from the concentrations of three chemical compounds of your choosing (not all of them!) in the wine data set. Show the summary (using summary) of your model below.

I choose …

# your R code goes here

B. (1 pt) Make a plot of the fitted probability as a function of the linear predictor, colored by cultivar.

# your R code goes here

C. (3 pts) Choose a probability cut-off for classifying a given sample of wine as cultivar 1 or cultivar 2. State the cut-off that you chose. Calculate the true positive rate and false positive rate and interpret these rates in the context of the wine data set. Your answer should mention something about cultivars and the three chemical compounds you chose in part A.

I choose …

# your R code goes here

Your answer goes here. 2-3 sentences only.

Problem 2

A (1pt). Plot an ROC curve for the model that you created in Problem 1A. Does the model perform better than a model in which you randomly classify a wine sample as cultivar 1 or cultivar 2? Explain your answer in 1-2 sentences. HINT: To make an ROC plot, the variable for known truth needs to be converted to a factor.

# your R code goes here

Your answer goes here. 2-3 sentences only.

B. (4 pts) Choose a new set of predictor variables (different from the variables that you chose in Problem 1A), and create a logistic regression model. Plot an ROC curve for your newly-created model and, on the same plot, add an ROC curve from your model in Problem 1A. What can you conclude from your plot? Which model performs better and why? Support your conclusions with AUC values for each model.

I choose …

# your R code goes here

Your answer goes here. 1-2 sentences only.