Project 2

Enter your name and EID here


Please submit both this completed Rmarkdown document and its knitted HTML, converted to PDF, on Canvas no later than 7:00 pm on March 27th, 2018. These two documents will be graded jointly, so they must be consistent (as in, don’t change the Rmarkdown file without also updating the knitted HTML!).

All results presented must have corresponding code. Any answers/results given without the corresponding R code that generated the result will be considered absent. All code reported in your final project document should work properly. Please bear in mind that you will lose points for the following:

  • an R-code chunk with no comments
  • results without corresponding R code
  • extraneous code which does not contribute to the question
  • printing out the entire data table

For this project, you will work with a dataset that contains information about the passengers on board of the ocean liner Titanic.

We have already subdivided the full data set into training and test data sets (train_data and test_data). And we also provide the full data set (Titanic). Please use the training and test data sets for Part 1 and either data set for Part 2.

Titanic <- read.csv("")

train_fraction <- 0.8 # fraction of data for training purposes
set.seed(123)  # set the seed to make your partition reproductible
train_size <- floor(train_fraction * nrow(Titanic)) # number of observations in training set
train_indices <- sample(1:nrow(Titanic), size = train_size) 

train_data <- Titanic[train_indices, ] # get training data
test_data <- Titanic[-train_indices, ] # get test data

##                                            Name PClass   Age    Sex
## 1                  Allen, Miss Elisabeth Walton    1st 29.00 female
## 2                   Allison, Miss Helen Loraine    1st  2.00 female
## 3           Allison, Mr Hudson Joshua Creighton    1st 30.00   male
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st 25.00 female
## 5                 Allison, Master Hudson Trevor    1st  0.92   male
## 6                            Anderson, Mr Harry    1st 47.00   male
##   Survived SexCode
## 1        1       1
## 2        0       1
## 3        0       0
## 4        0       1
## 5        1       0
## 6        1       0

The column contents are as follows:

  • Name: Passenger name.
  • PClass: Passenger class: 1st, 2nd, or 3rd.
  • Age: Age in years.
  • Sex: Sex of the passenger: male or female.
  • Survived: Survival status: 1=survived, 0=died.
  • SexCode: 1=female, 0=male.

Part 1 (40 points). We have divided the dataset, which consists of observations from 1313 individuals, into a training and a test data set. Fit a logistic regression model to predict survival status on the training data set. Justify the predictor variables you use to predict survival status. When building your model, choose predictors which are significant at your chosen significance level (be sure to report your chosen value!). Your code should be appropriately commented with high-level statements about the code’s function.

Using your final model, predict the outcome on the test data set, and plot and discuss your results. You should have two final plots: a plot with two ROC curves, one for the training and one for the test data set, and a density plot that shows how the linear predictor separates the two survival outcomes in the test data. Your discussion should, at least, cover the differences and similarities in model performance on the training vs. test data (including AUC) as well as a clear interpretation of each plot. Please limit your discussion to a maximum of 15 sentences.

Part 2 (60 points). Think of one overarching question to ask about this data set. Then, think of two (and only two) analysis questions that can jointly provide an answer to your overarching question. The overarching question must be conceptual, and the two analysis questions can be either conceptual or procedural. You are welcome to use either the training, test, or full data set for this part.

For each analysis question, perform an exploratory statistical analysis (PCA, logistic regression, linear model, ANOVA, etc.) with a corresponding figure. Discuss your findings, in particular how your analysis’ results reveal (or don’t reveal) an answer to your proposed questions. Then, write a concluding discussion that discusses how your results reveal (or don’t reveal) an answer to your overarching question. Please limit each question’s discussion to a maximum of 10 sentences.

To receive full credit for Part II, you will have to do the following:

  • Come up with one clear, overarching conceptual question about the data, as explained above.
  • Come up with two clear, coherent, conceptual or procedural analysis questions that will help you elaborate on the overarching question.
  • The work for at least one of the two analysis questions must be multivariate (involve more than two columns of the data set at once.
  • None of your questions must repeat any part of the analysis of Part 1.
  • For each analysis question, provide one clear and easily understandable plot answering the question.
  • For each plot, provide a justification for why you chose to make the type of plot that you made.
  • For each plot/question, provide an interpretation of your results and a response to your question.
  • Use different primary geoms for the different plots.

Project responses should be entered below.

Part 1

# model to use: 
# Survived ~ Sex + PClass + Age
# My chosen significance level is 0.05
glm.out.train <- glm(Survived ~ Sex + PClass + Age, data=train_data, family=binomial)
## Call:
## glm(formula = Survived ~ Sex + PClass + Age, family = binomial, 
##     data = train_data)
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7362  -0.6697  -0.3690   0.6343   2.5755  
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.797810   0.444387   8.546  < 2e-16 ***
## Sexmale     -2.739763   0.228716 -11.979  < 2e-16 ***
## PClass2nd   -1.342208   0.292350  -4.591 4.41e-06 ***
## PClass3rd   -2.575275   0.314278  -8.194 2.52e-16 ***
## Age         -0.039165   0.008562  -4.574 4.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 821.80  on 606  degrees of freedom
## Residual deviance: 543.76  on 602  degrees of freedom
##   (443 observations deleted due to missingness)
## AIC: 553.76
## Number of Fisher Scoring iterations: 5
# results data frame for training data
df_train <- data.frame(predictor = predict(glm.out.train, train_data),
                       known_truth = train_data$Survived,
                       data_name = "training")

# results data frame for test data
df_test <- data.frame(predictor = predict(glm.out.train, test_data),
                      known_truth = test_data$Survived,
                      data_name = "test")

# combining data frames with results
df_combined <- rbind(df_train, df_test)

# plot ROC curves for train and test data
p <- ggplot(df_combined, aes(d = known_truth, m = predictor, color = data_name)) + 
  geom_roc(n.cuts = 0)+ scale_color_colorblind()

# calculate AUC values
model <- unique(df_combined$data_name)
model_info <- data.frame(model,
                         group = order(model))
left_join(model_info, calc_auc(p)) %>%
  select(-group, -PANEL) %>%
## Joining, by = "group"
##      model       AUC
## 1 training 0.8624330
## 2     test 0.8176471
# plot density plot that shows survival outcome separated by the linear predictor in the test data
ggplot(df_test, aes(x=predictor, fill=factor(known_truth))) + 
  geom_density(alpha=.5) +
## Warning: Removed 114 rows containing non-finite values (stat_density).

In my final model, the predictors of the survival status were passenger’s sex, class, and age (using significance level of 0.05). The predictors for passenger’s name and SexCode were removed because passenger’s name is irrelevant to survival and SexCode captures the same information as the variable for sex (Sex). From plot 1, ROC curves for both datasets are similar with the ROC curve for the training dataset higher than the one for the test dataset. These ROC curves show that my final predictors provide a good fit to both training and test datasets, and that my final model is less descriptive of the test dataset than the training dataset. The AUC for the test dataset is 0.82, and the AUC for the training dataset is 0.86. The two AUC values are similar, but the AUC for the test dataset is smaller than the AUC for the training dataset. This reinforces my previous conclusion. From plot 2, the density plots for the two survival outcomes are somewhat separated. This pattern suggests that the model can identify individuals that did not survive and individuals that did survive reasonably well. Thus, the model we built performs well and can assign survival status with somewhat high confidence based on the predictors.

Part 2

Overarching conceptual question: Please write your overarching question here.

Analysis question 1: Please write your analysis question 1 here.

# R code for analysis question 1 goes here

Discussion for analysis question 1 goes here.

Analysis question 2: Please write your analysis question 2 here.

# R code for analysis question 2 goes here

Discussion for analysis question 2 goes here.

Concluding Discussion

Discussion for the overarching question goes here.