Feb 26, 2019
In this worksheet, we will use the library tidyverse and ggthemes:
library(tidyverse)
theme_set(theme_bw(base_size = 12)) # set default ggplot2 theme
library(ggthemes)
We will work with the iris
data set. Specifically, with a subset of the data that consists only of the species virginica and versicolor:
# make a reduced iris data set that only contains virginica and versicolor species
iris_small <-
iris %>%
filter(Species %in% c("virginica", "versicolor"))
Fit a logistic regression model to the iris_small
data set. Then successively remove predictors until only predictors with a p value less than 0.1 remain.
# logistic regression
glm_out <- glm(
Species ~ Sepal.Width + Sepal.Length + Petal.Width + Petal.Length,
data = iris_small,
family = binomial
) # family = binomial required for logistic regression
summary(glm_out)
##
## Call:
## glm(formula = Species ~ Sepal.Width + Sepal.Length + Petal.Width +
## Petal.Length, family = binomial, data = iris_small)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.01105 -0.00541 -0.00001 0.00677 1.78065
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -42.638 25.707 -1.659 0.0972 .
## Sepal.Width -6.681 4.480 -1.491 0.1359
## Sepal.Length -2.465 2.394 -1.030 0.3032
## Petal.Width 18.286 9.743 1.877 0.0605 .
## Petal.Length 9.429 4.737 1.991 0.0465 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 11.899 on 95 degrees of freedom
## AIC: 21.899
##
## Number of Fisher Scoring iterations: 10
We remove Sepal.Length
as it has the highest P value:
glm_out <- glm(
Species ~ Sepal.Width + Petal.Width + Petal.Length,
data = iris_small,
family = binomial
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_out)
##
## Call:
## glm(formula = Species ~ Sepal.Width + Petal.Width + Petal.Length,
## family = binomial, data = iris_small)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.75795 -0.00412 0.00000 0.00290 1.92193
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -50.527 23.995 -2.106 0.0352 *
## Sepal.Width -8.376 4.761 -1.759 0.0785 .
## Petal.Width 21.430 10.707 2.001 0.0453 *
## Petal.Length 7.875 3.841 2.050 0.0403 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 13.266 on 96 degrees of freedom
## AIC: 21.266
##
## Number of Fisher Scoring iterations: 10
Make a plot of the fitted probability as a function of the linear predictor, colored by species identity. Hint: you will have to make a new data frame combining data from the fitted model with data from the iris_small
data frame.
lr_data <- data.frame(
predictor=glm_out$linear.predictors,
prob=glm_out$fitted.values,
Species=iris_small$Species
)
ggplot(lr_data, aes(x = predictor, y = prob, color = Species)) +
geom_point() +
scale_color_colorblind()
If we want to plot the original observations as “rug” along the bottom or top of the plot, we first need to make two separate data frames with the data points for either species, and then feed them individually into the geom_rug()
function. In addition, we make a data frame that contains the logistic function, so we get a smooth black line underneath the points.
# extract virginica and versicolor data
virg_data <- filter(lr_data, Species == "virginica")
vers_data <- filter(lr_data, Species == "versicolor")
# make data frame with logistic function
predictor <- (-250:320)/10
prob <- exp(predictor)/(1 + exp(predictor))
log_fun_data <- data.frame(predictor, prob)
# plot
ggplot(lr_data, aes(x = predictor, y = prob, color = Species)) +
geom_line(data = log_fun_data, color = "black") +
geom_point() +
geom_rug(data = vers_data, sides = 'b') +
geom_rug(data = virg_data, sides = 't') +
scale_color_colorblind()
Make a density plot that shows how the two species are separated by the linear predictor.
ggplot(lr_data, aes(x = predictor, fill = Species)) +
geom_density(alpha = .5) +
scale_fill_colorblind()
Assume you have obtained samples from three plants, with measurements as listed below. Predict the likelihood that each of these plants belongs to the species virginica.
plant1 <- data.frame(
Sepal.Length = 6.4,
Sepal.Width = 2.8,
Petal.Length = 4.6,
Petal.Width = 1.8
)
plant2 <- data.frame(
Sepal.Length = 6.3,
Sepal.Width = 2.5,
Petal.Length = 4.1,
Petal.Width = 1.7
)
plant3 <- data.frame(
Sepal.Length = 6.7,
Sepal.Width = 3.3,
Petal.Length = 5.2,
Petal.Width = 2.3
)
predict(glm_out, plant1, type = "response")
## 1
## 0.6934611
Plant 1 has a 69% probability of being virginica. It may be a virginica.
predict(glm_out, plant2, type = "response")
## 1
## 0.06002675
Plant 2 has a 6% probability of being virginica. It is very likely versicolor.
predict(glm_out, plant3, type = "response")
## 1
## 0.9999943
Plant 3 has a 99.999% probability of being virginica. It is very likely virginica.
Pick a cutoff predictor value at which you would decide that a specimen belongs to virginica rather than versicolor. Calculate how many virginicas you call correctly and how many incorrectly given that choice.
cutoff <- 0
virg_true <- sum(lr_data$predictor > cutoff & lr_data$Species == "virginica")
virg_false <- sum(lr_data$predictor <= cutoff & lr_data$Species == "virginica")
virg_true
## [1] 49
virg_false
## [1] 1
Now do the same calculation for versicolor.
vers_true <- sum(lr_data$predictor <= cutoff & lr_data$Species == "versicolor")
vers_false <- sum(lr_data$predictor > cutoff & lr_data$Species == "versicolor")
vers_true
## [1] 48
vers_false
## [1] 2
If we define a call of virginica as a positive and a call of versicolor as a negative, what are the true positive rate (sensitivity, true positives divided by all possible positives) and the true negative rate (specificity, true negatives divided by all possible negatives) in your analysis?
tp <- virg_true/(virg_true + virg_false)
tn <- vers_true/(vers_true + vers_false)
tp
## [1] 0.98
tn
## [1] 0.96
The true-positive rate (sensitivity) is 98%.
The true-negative rate (specificity) is 96%.