*Enter your name and EID here*

**This homework is due on Mar. 6, 2018 at 7:00pm. Please submit as a PDF file on Canvas.**

For this homework you will use the `wine`

data set from the previous homework. For this homework, however, we have removed samples from cultivar 3. The `wine`

data set contains concentrations of 13 different chemical compounds (`chem1`

-`chem13`

) in 130 samples of wines grown in Italy. Each row is a different sample of wine, and the data set now contains just two different cultivars (`cultivar`

) of wine.

```
wine <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/wine.csv", colClasses = c("cultivar" = "factor")) %>% filter(cultivar != 3)
head(wine)
```

```
## cultivar chem1 chem2 chem3 chem4 chem5 chem6 chem7 chem8 chem9 chem10
## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80
## 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32
## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75
## chem11 chem12 chem13
## 1 1.04 3.92 1065
## 2 1.05 3.40 1050
## 3 1.03 3.17 1185
## 4 0.86 3.45 1480
## 5 1.04 2.93 735
## 6 1.05 2.85 1450
```

**A. (1 pt)** Make a logistic regression model that predicts the cultivar from the concentrations of **three chemical compounds of your choosing** (not all of them!) in the `wine`

data set. Show the summary (using `summary`

) of your model below.

I choose `chem1`

, `chem2`

, and `chem3`

.

```
glm.out <- glm(cultivar ~ chem1 + chem2 + chem3, family = "binomial", data = wine)
summary(glm.out)
```

```
##
## Call:
## glm(formula = cultivar ~ chem1 + chem2 + chem3, family = "binomial",
## data = wine)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.66637 -0.20594 0.03888 0.17769 3.06013
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 68.88854 12.71587 5.418 6.04e-08 ***
## chem1 -4.68710 0.91379 -5.129 2.91e-07 ***
## chem2 -0.08856 0.33612 -0.263 0.7922
## chem3 -3.17020 1.45917 -2.173 0.0298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 179.109 on 129 degrees of freedom
## Residual deviance: 45.927 on 126 degrees of freedom
## AIC: 53.927
##
## Number of Fisher Scoring iterations: 7
```

**B. (1 pt)** Make a plot of the fitted probability as a function of the linear predictor, colored by cultivar.

```
lr_data <- data.frame(predictor=glm.out$linear.predictors, prob=glm.out$fitted.values, wine)
ggplot(lr_data, aes(x=predictor, y=prob, color=cultivar)) + geom_point() + scale_color_colorblind()
```

**C. (3 pts)** Choose a probability cut-off for classifying a given sample of wine as cultivar 1 or cultivar 2. State the cut-off that you chose. Calculate the **true positive rate** and **false positive rate** and interpret these rates in the context of the `wine`

data set. Your answer should mention something about cultivars and the three chemical compounds you chose in part A.

I choose a cut-off of 0.75.

```
pred_data <- data.frame(probability=glm.out$fitted.values, cultivar=wine$cultivar)
# cutoff of 0.75
cutoff <- 0.75
# Number of true cultivar 1 samples identified as cultivar 1 (true positives)
pred_data %>% filter(probability <= cutoff & cultivar==1) %>%
tally() -> cultivar_1_true
# Number of true cultivar 2 samples identified as cultivar 2 (true negatives)
pred_data %>% filter(probability > cutoff & cultivar==2) %>%
tally() -> cultivar_2_true
# Total number of true cultivar 1 samples (known postives)
pred_data %>% filter(cultivar==1) %>%
tally() -> cultivar_1_total
# Total number of true cultivar 2 samples (known negatives)
pred_data %>% filter(cultivar==2) %>%
tally() -> cultivar_2_total
# True positive rate
tp <- cultivar_1_true$n/(cultivar_1_total$n)
# True negative rate
tn <- cultivar_2_true$n/(cultivar_2_total$n)
tp
```

`## [1] 0.9830508`

```
# False positive rate (1 - true negative rate)
(1 - tn)
```

`## [1] 0.09859155`

The true positive rate is 98.3%. The false positive rate is 9.9%. With a probability cut-off of 0.75, our regression model based on the concentrations of chemical compounds chem1, chem2, and chem3 correctly classifies 98.3% of the cultivar 1 samples. The model incorrectly classifies 9.9% of cultivar 2 samples as cultivar 1 samples.

**A (1pt).** Plot an ROC curve for the model that you created in Problem 1A. Does the model perform better than a model in which you randomly classify a wine sample as cultivar 1 or cultivar 2? Explain your answer in 1-2 sentences. **HINT**: To make an ROC plot, the variable for known truth needs to be converted to a factor.

```
df1 <- data.frame(predictor = predict(glm.out, wine),
known_truth = wine$cultivar,
model = "chem123")
# plot
# the aesthetic names are not the most intuitive
# `d` (disease) holds the known truth
# `m` (marker) holds the predictor values
ggplot(df1, aes(d = factor(known_truth), m = predictor, color = model)) +
geom_roc(n.cuts = 0) +
coord_fixed() + scale_color_colorblind()
```

`## Warning in verify_d(data$d): D not labeled 0/1, assuming 1 = 0 and 2 = 1!`

The model I created in Problem 1A performs better than a random classifier since the ROC curve is above the y = x line. The true positive rate is always greater than the false positive rate. No matter what probability cut-off we use, the model correctly classifies cultivar 1 samples more often than it incorrectly classifies cultivar 2 samples as cultivar 1.

**B. (4 pts)** Choose a new set of predictor variables (different from the variables that you chose in Problem 1A), and create a logistic regression model. Plot an ROC curve for your newly-created model and, on the same plot, add an ROC curve from your model in Problem 1A. What can you conclude from your plot? Which model performs better and why? Support your conclusions **with AUC values for each model**.

I choose `chem4`

, `chem5`

, and `chem6`

.

```
glm.out2 <- glm(cultivar ~ chem4 + chem5 + chem6, data=wine, family="binomial")
df2 <- data.frame(predictor = predict(glm.out2, wine),
known_truth = wine$cultivar,
model = "chem456")
df_combined <- bind_rows(df1, df2)
```

`## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character`

```
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
```

```
# plot
# the aesthetic names are not the most intuitive
# `d` (disease) holds the known truth
# `m` (marker) holds the predictor values
p <- ggplot(df_combined, aes(d = factor(known_truth), m = predictor, color = model)) +
geom_roc(n.cuts = 0) +
coord_fixed() + scale_color_colorblind()
# calculate AUC
calc_auc(p)
```

`## Warning in verify_d(data$d): D not labeled 0/1, assuming 1 = 0 and 2 = 1!`

```
## PANEL group AUC
## 1 1 1 0.9792313
## 2 1 2 0.9047505
```

```
model <- unique(df_combined$model)
model_info <- data.frame(model,
group = order(model))
left_join(model_info, calc_auc(p)) %>%
select(-group, -PANEL) %>%
arrange(desc(AUC))
```

`## Warning in verify_d(data$d): D not labeled 0/1, assuming 1 = 0 and 2 = 1!`

`## Joining, by = "group"`

```
## model AUC
## 1 chem123 0.9792313
## 2 chem456 0.9047505
```

My newly-created model performs worse than the model created in Problem 1A because the AUCs are 0.90 and 0.98, respectively. I conclude that chem1, chem2, and chem3 are, collectively, better predictors of cultivar than chem4, chem5, and chem6.