Homework 6 Solutions

Enter your name and EID here

This homework is due on Mar. 6, 2018 at 7:00pm. Please submit as a PDF file on Canvas.

For this homework you will use the wine data set from the previous homework. For this homework, however, we have removed samples from cultivar 3. The wine data set contains concentrations of 13 different chemical compounds (chem1-chem13) in 130 samples of wines grown in Italy. Each row is a different sample of wine, and the data set now contains just two different cultivars (cultivar) of wine.

wine <- read.csv("http://wilkelab.org/classes/SDS348/data_sets/wine.csv", colClasses = c("cultivar" = "factor")) %>% filter(cultivar != 3)
head(wine)
##   cultivar chem1 chem2 chem3 chem4 chem5 chem6 chem7 chem8 chem9 chem10
## 1        1 14.23  1.71  2.43  15.6   127  2.80  3.06  0.28  2.29   5.64
## 2        1 13.20  1.78  2.14  11.2   100  2.65  2.76  0.26  1.28   4.38
## 3        1 13.16  2.36  2.67  18.6   101  2.80  3.24  0.30  2.81   5.68
## 4        1 14.37  1.95  2.50  16.8   113  3.85  3.49  0.24  2.18   7.80
## 5        1 13.24  2.59  2.87  21.0   118  2.80  2.69  0.39  1.82   4.32
## 6        1 14.20  1.76  2.45  15.2   112  3.27  3.39  0.34  1.97   6.75
##   chem11 chem12 chem13
## 1   1.04   3.92   1065
## 2   1.05   3.40   1050
## 3   1.03   3.17   1185
## 4   0.86   3.45   1480
## 5   1.04   2.93    735
## 6   1.05   2.85   1450

Problem 1

A. (1 pt) Make a logistic regression model that predicts the cultivar from the concentrations of three chemical compounds of your choosing (not all of them!) in the wine data set. Show the summary (using summary) of your model below.

I choose chem1, chem2, and chem3.

glm.out <- glm(cultivar ~ chem1 + chem2 + chem3, family = "binomial", data = wine)

summary(glm.out)
## 
## Call:
## glm(formula = cultivar ~ chem1 + chem2 + chem3, family = "binomial", 
##     data = wine)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.66637  -0.20594   0.03888   0.17769   3.06013  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 68.88854   12.71587   5.418 6.04e-08 ***
## chem1       -4.68710    0.91379  -5.129 2.91e-07 ***
## chem2       -0.08856    0.33612  -0.263   0.7922    
## chem3       -3.17020    1.45917  -2.173   0.0298 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 179.109  on 129  degrees of freedom
## Residual deviance:  45.927  on 126  degrees of freedom
## AIC: 53.927
## 
## Number of Fisher Scoring iterations: 7

B. (1 pt) Make a plot of the fitted probability as a function of the linear predictor, colored by cultivar.

lr_data <- data.frame(predictor=glm.out$linear.predictors, prob=glm.out$fitted.values, wine)

ggplot(lr_data, aes(x=predictor, y=prob, color=cultivar)) + geom_point() + scale_color_colorblind()

C. (3 pts) Choose a probability cut-off for classifying a given sample of wine as cultivar 1 or cultivar 2. State the cut-off that you chose. Calculate the true positive rate and false positive rate and interpret these rates in the context of the wine data set. Your answer should mention something about cultivars and the three chemical compounds you chose in part A.

I choose a cut-off of 0.75.

pred_data <- data.frame(probability=glm.out$fitted.values, cultivar=wine$cultivar)

# cutoff of 0.75
cutoff <- 0.75

# Number of true cultivar 1 samples identified as cultivar 1 (true positives)
pred_data %>% filter(probability <= cutoff & cultivar==1) %>%
  tally() -> cultivar_1_true 

# Number of true cultivar 2 samples identified as cultivar 2 (true negatives)
pred_data %>% filter(probability > cutoff & cultivar==2) %>%
  tally() -> cultivar_2_true

# Total number of true cultivar 1 samples (known postives)
pred_data %>% filter(cultivar==1) %>%
  tally() -> cultivar_1_total
# Total number of true cultivar 2 samples (known negatives)
pred_data %>% filter(cultivar==2) %>%
  tally() -> cultivar_2_total

# True positive rate
tp <- cultivar_1_true$n/(cultivar_1_total$n)
# True negative rate
tn <- cultivar_2_true$n/(cultivar_2_total$n)
tp
## [1] 0.9830508
# False positive rate (1 - true negative rate)
(1 - tn)
## [1] 0.09859155

The true positive rate is 98.3%. The false positive rate is 9.9%. With a probability cut-off of 0.75, our regression model based on the concentrations of chemical compounds chem1, chem2, and chem3 correctly classifies 98.3% of the cultivar 1 samples. The model incorrectly classifies 9.9% of cultivar 2 samples as cultivar 1 samples.

Problem 2

A (1pt). Plot an ROC curve for the model that you created in Problem 1A. Does the model perform better than a model in which you randomly classify a wine sample as cultivar 1 or cultivar 2? Explain your answer in 1-2 sentences. HINT: To make an ROC plot, the variable for known truth needs to be converted to a factor.

df1 <- data.frame(predictor = predict(glm.out, wine),
                  known_truth = wine$cultivar, 
                  model = "chem123")             
# plot

# the aesthetic names are not the most intuitive
# `d` (disease) holds the known truth
# `m` (marker) holds the predictor values 
ggplot(df1, aes(d = factor(known_truth), m = predictor, color = model)) + 
  geom_roc(n.cuts = 0) +
  coord_fixed() + scale_color_colorblind()
## Warning in verify_d(data$d): D not labeled 0/1, assuming 1 = 0 and 2 = 1!

The model I created in Problem 1A performs better than a random classifier since the ROC curve is above the y = x line. The true positive rate is always greater than the false positive rate. No matter what probability cut-off we use, the model correctly classifies cultivar 1 samples more often than it incorrectly classifies cultivar 2 samples as cultivar 1.

B. (4 pts) Choose a new set of predictor variables (different from the variables that you chose in Problem 1A), and create a logistic regression model. Plot an ROC curve for your newly-created model and, on the same plot, add an ROC curve from your model in Problem 1A. What can you conclude from your plot? Which model performs better and why? Support your conclusions with AUC values for each model.

I choose chem4, chem5, and chem6.

glm.out2 <- glm(cultivar ~ chem4 + chem5 + chem6, data=wine, family="binomial")

df2 <- data.frame(predictor = predict(glm.out2, wine),
                  known_truth = wine$cultivar, 
                  model = "chem456")   

df_combined <- bind_rows(df1, df2)
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
# plot

# the aesthetic names are not the most intuitive
# `d` (disease) holds the known truth
# `m` (marker) holds the predictor values 
p <- ggplot(df_combined, aes(d = factor(known_truth), m = predictor, color = model)) + 
  geom_roc(n.cuts = 0) +
  coord_fixed() + scale_color_colorblind()

# calculate AUC
calc_auc(p)
## Warning in verify_d(data$d): D not labeled 0/1, assuming 1 = 0 and 2 = 1!
##   PANEL group       AUC
## 1     1     1 0.9792313
## 2     1     2 0.9047505
model <- unique(df_combined$model)
model_info <- data.frame(model,
                         group = order(model))
left_join(model_info, calc_auc(p)) %>%
  select(-group, -PANEL) %>%
  arrange(desc(AUC))
## Warning in verify_d(data$d): D not labeled 0/1, assuming 1 = 0 and 2 = 1!
## Joining, by = "group"
##     model       AUC
## 1 chem123 0.9792313
## 2 chem456 0.9047505

My newly-created model performs worse than the model created in Problem 1A because the AUCs are 0.90 and 0.98, respectively. I conclude that chem1, chem2, and chem3 are, collectively, better predictors of cultivar than chem4, chem5, and chem6.