**Feb 22, 2018**

In this worksheet, we will use the libraries tidyverse, ggthemes, and plotROC:

```
library(tidyverse)
theme_set(theme_bw(base_size=12)) # set default ggplot2 theme
library(ggthemes) # for scale_color_colorblind()
# install.packages("plotROC")
library(plotROC) # for geom_roc() and calc_auc()
```

We continue working with the two species virginica and versicolor from the `iris`

data set. As was done in the previous class, the following code makes a reduced data set, fits a logistic regression model (this time using only `Sepal.Length`

as predictor), and then combines the predicted probabilities and the known species identity into one data frame called `pred_data`

:

```
# make a reduced iris data set that only contains virginica and versicolor species
iris.small <- filter(iris, Species %in% c("virginica", "versicolor")) %>%
mutate(Species = factor(Species)) # this makes R forget about the 3rd species
# fit logistic regression model
glm.out <- glm(Species ~ Sepal.Length,
data = iris.small,
family = binomial)
# combine fitted values and Species identity
pred_data <- data.frame(probability = glm.out$fitted.values,
Species = iris.small$Species)
```

Pick a probability cutoff at which you would decide that a specimen belongs to virginica rather than versicolor. Calculate how many virginicas you call correctly and how many incorrectly given that cutoff. Hint: use the functions `filter()`

and `tally()`

.

```
cutoff <- 0.5 # cutoff of 0.5
pred_data %>% filter(probability >= cutoff & Species=="virginica") %>%
tally() -> virg_true
pred_data %>% filter(probability < cutoff & Species=="virginica") %>%
tally() -> virg_false
virg_true
```

```
## n
## 1 37
```

`virg_false`

```
## n
## 1 13
```

Now do the same calculation for versicolor.

```
pred_data %>% filter(probability < cutoff & Species=="versicolor") %>%
tally() -> vers_true
pred_data %>% filter(probability >= cutoff & Species=="versicolor") %>%
tally() -> vers_false
vers_true
```

```
## n
## 1 36
```

`vers_false`

```
## n
## 1 14
```

If we define a call of virginica as a positive and a call of versicolor as a negative, what are the true positive rate (sensitivity, true positives divided by all possible positives, i.e., by the total count of virginicas) and the true negative rate (specificity, true negatives divided by all possible negatives, i.e., by the total count of versicolors) in your analysis?

```
# get total counts
pred_data %>% filter(Species=="virginica") %>%
tally() -> virg_total
pred_data %>% filter(Species=="versicolor") %>%
tally() -> vers_total
tp <- virg_true$n/(virg_total$n)
tn <- vers_true$n/(vers_total$n)
tp
```

`## [1] 0.74`

`tn`

`## [1] 0.72`

The true-positive rate (sensitivity) is 0.74 and the true-negative rate (specificity) is 0.72.

Recalculate the true-positive rate and the true-negative rate for a different probability cutoff.

```
cutoff <- 0.1 # cutoff of 0.1
pred_data %>% filter(probability >= cutoff & Species=="virginica") %>%
tally() -> virg_true
pred_data %>% filter(probability < cutoff & Species=="versicolor") %>%
tally() -> vers_true
tp <- virg_true$n/(virg_total$n)
tn <- vers_true$n/(vers_total$n)
tp
```

`## [1] 0.98`

`tn`

`## [1] 0.08`

At a cutoff of 0.1, we get a true-positive rate of 0.98 and a true-negative rate of 0.08.

Next we will calculate and plot ROC curves. We do this with the function `geom_roc()`

from the package plotROC. Here is an example:

```
# fit the model
glm.out <- glm(Species ~ Sepal.Width,
data = iris.small,
family = binomial)
# make data frame of the linear predictor and the known truth
df1 <- data.frame(predictor = predict(glm.out, iris.small), # linear predictor
known_truth = iris.small$Species, # the known truth, i.e., true species assignment
model = "Sepal.Width") # an arbitrary name to distinguish different curves
# plot
# the aesthetic names are not the most intuitive
# `d` (disease) holds the known truth
# `m` (marker) holds the predictor values
ggplot(df1, aes(d = known_truth, m = predictor, color = model)) +
geom_roc(n.cuts = 0) +
coord_fixed() + scale_color_colorblind()
```

```
## Warning in verify_d(data$d): D not labeled 0/1, assuming versicolor = 0 and
## virginica = 1!
```

Now make a few additional models and then make a plot showing the various ROC curves in one graph. (Hint: combine the data frames from the different models into one using the function `rbind()`

.)

```
glm.out <- glm(Species ~ Petal.Width,
data = iris.small,
family = binomial)
df2 <- data.frame(predictor = predict(glm.out, iris.small),
known_truth = iris.small$Species,
model = "Petal.Width")
glm.out <- glm(Species ~ Petal.Length,
data = iris.small,
family = binomial)
df3 <- data.frame(predictor = predict(glm.out, iris.small),
known_truth = iris.small$Species,
model = "Petal.Length")
glm.out <- glm(Species ~ Petal.Width + Petal.Length + Sepal.Width,
data = iris.small,
family = binomial)
```

`## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred`

```
df4 <- data.frame(predictor = predict(glm.out, iris.small),
known_truth = iris.small$Species,
model = "combined")
df_combined <- rbind(df1, df2, df3, df4)
ggplot(df_combined, aes(d = known_truth, m = predictor, color = model)) +
geom_roc(n.cuts = 0) +
coord_fixed() + scale_color_colorblind()
```

```
## Warning in verify_d(data$d): D not labeled 0/1, assuming versicolor = 0 and
## virginica = 1!
```

Calculate the area under the ROC curve (AUC) for your first model, using the function `calc_auc()`

from the plotROC package. This function needs to be called on a plot containing a `geom_roc()`

statement.

```
# make plot and save as p
p <- ggplot(df1, aes(d = known_truth, m = predictor, color = model)) +
geom_roc(n.cuts = 0)
# calculate AUC
calc_auc(p)
```

```
## Warning in verify_d(data$d): D not labeled 0/1, assuming versicolor = 0 and
## virginica = 1!
```

```
## PANEL group AUC
## 1 1 1 0.6636
```

Now calculate the AUC values for all the ROC curves you generated earlier, using the function `calc_auc()`

exactly once.

```
p <- ggplot(df_combined, aes(d = known_truth, m = predictor, color = model)) +
geom_roc(n.cuts = 0)
calc_auc(p)
```

```
## Warning in verify_d(data$d): D not labeled 0/1, assuming versicolor = 0 and
## virginica = 1!
```

```
## PANEL group AUC
## 1 1 1 0.6636
## 2 1 2 0.9804
## 3 1 3 0.9822
## 4 1 4 0.9972
```

Unfortunately, because of how ggplot works, we have now lost the model names and instead just have group numbers. We can recover the connection between model name and group number by calling `unique(df_combined$model)`

to obtain the model names and `order(unique(df_combined$model))`

to obtain the corresponding group numbers. Use this information to generate a modified table that contains model name and AUC and is sorted in descending order of AUC.

```
model <- unique(df_combined$model)
model_info <- data.frame(model,
group = order(model))
left_join(model_info, calc_auc(p)) %>%
select(-group, -PANEL) %>%
arrange(desc(AUC))
```

```
## Warning in verify_d(data$d): D not labeled 0/1, assuming versicolor = 0 and
## virginica = 1!
```

`## Joining, by = "group"`

```
## model AUC
## 1 combined 0.9972
## 2 Petal.Length 0.9822
## 3 Petal.Width 0.9804
## 4 Sepal.Width 0.6636
```