In 1898, Hermon Bumpus, an American biologist working at Brown University, collected data on one of the first examples of natural selection directly observed in nature. Immediately following a bad winter storm, he collected 136 English house sparrows, Passer domesticus, and brought them indoors. Of these birds, 64 had died during the storm, but 72 recovered and survived. By comparing measurements of physical traits, Bumpus demonstrated physical differences between the dead and living birds. He interpreted this finding as evidence for natural selection as a result of this storm:
bumpus <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/bumpus_full.csv")
## Parsed with column specification:
## cols(
## Sex = col_character(),
## Age = col_character(),
## Survival = col_character(),
## Length = col_double(),
## Wingspread = col_double(),
## Weight = col_double(),
## Skull_Length = col_double(),
## Humerus_Length = col_double(),
## Femur_Length = col_double(),
## Tarsus_Length = col_double(),
## Sternum_Length = col_double(),
## Skull_Width = col_double()
## )
bumpus$Survival <- factor(bumpus$Survival)
head(bumpus)
## # A tibble: 6 x 12
## Sex Age Survival Length Wingspread Weight Skull_Length Humerus_Length
## <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Male Adult Alive 154 241 24.5 31.2 17.4
## 2 Male Adult Alive 160 252 26.9 30.8 18.7
## 3 Male Adult Alive 155 243 26.9 30.6 18.6
## 4 Male Adult Alive 154 245 24.3 31.7 18.8
## 5 Male Adult Alive 156 247 24.1 31.5 18.2
## 6 Male Adult Alive 161 253 26.5 31.8 19.8
## # … with 4 more variables: Femur_Length <dbl>, Tarsus_Length <dbl>,
## # Sternum_Length <dbl>, Skull_Width <dbl>
The data set has three categorical variables (Sex
, with levels Male
and Female
, Age
, with levels Adult
and Young
, and Survival
, with levels Alive
and Dead
) and nine numerical variables that hold various aspects of the birds’ anatomy, such as wingspread, weight, etc.
Problem 1: Make a logistic regression model that can predict survival status from all other predictor variables. (Include the categorical predictors Sex
and Age
.) Then do backwards selection, removing the predictors with the highest P value one by one, until you are only left with predictors that have P<0.1. How many and which predictors remain in the final model?
glm_out_all <- glm(Survival ~ Sex +
Age +
Length +
Wingspread +
Weight +
Skull_Length +
Humerus_Length +
Femur_Length +
Tarsus_Length +
Sternum_Length +
Skull_Width,
data = bumpus,
family = "binomial")
summary(glm_out_all)
##
## Call:
## glm(formula = Survival ~ Sex + Age + Length + Wingspread + Weight +
## Skull_Length + Humerus_Length + Femur_Length + Tarsus_Length +
## Sternum_Length + Skull_Width, family = "binomial", data = bumpus)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2342 -0.7890 -0.1887 0.7655 2.1927
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.79812 15.13435 -0.713 0.47555
## SexMale -1.64710 0.66562 -2.475 0.01334 *
## AgeYoung 0.32973 0.47216 0.698 0.48496
## Length 0.42375 0.10958 3.867 0.00011 ***
## Wingspread -0.01025 0.08496 -0.121 0.90394
## Weight 0.88472 0.24353 3.633 0.00028 ***
## Skull_Length -0.46347 0.46141 -1.004 0.31516
## Humerus_Length -1.66395 0.89997 -1.849 0.06447 .
## Femur_Length 0.09391 0.86933 0.108 0.91397
## Tarsus_Length -0.25479 0.39646 -0.643 0.52045
## Sternum_Length -0.67528 0.32942 -2.050 0.04037 *
## Skull_Width -0.68535 0.76052 -0.901 0.36750
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 188.07 on 135 degrees of freedom
## Residual deviance: 129.56 on 124 degrees of freedom
## AIC: 153.56
##
## Number of Fisher Scoring iterations: 5
# Your R code here (remove variables that do not contribute well to the model)
Your answer here
Problem 2: Make a plot of the fitted probability as a function of the linear predictor, colored by survival. Make a density plot that shows how the two outcomes are separated by the linear predictor. Interperet your plots in 1-2 sentences. If you had to choose a cut-off value for alive or dead, where would it be?
# Your R code here
Your answer here
Problem 3: Add rugs to both the top and bottom of the probability plot above. Add a curve for the logistic function.
# Your R code here