Lab Worksheet 4

In my research, we are interested in protein interactions associated with the dynein proteins heatr2 and dnai2. In order to characterize these interactions, we engineered versions of heatr2 and dnai2 that are attached to a GFP tag. We put them in embryos, then break open cells, and can “pull down” the tagged proteins using an antibody that specifically binds to GFP. Finally, we run this mixture through a mass spectrometer to identify the proteins that are bound to GFP-heatr2 and GFP-dnai2.

# data frames with spectral counts from the mass spectrometer (MS) for each experiment:
heatr2_df <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/frog_apms_heatr2.csv")
## Parsed with column specification:
## cols(
##   accession = col_character(),
##   ctrl_PSMs = col_double(),
##   exp_PSMs = col_double(),
##   PSM_fc = col_double(),
##   PSM_log2fc = col_double(),
##   PSM_zscore = col_double()
## )
head(heatr2_df)
## # A tibble: 6 x 6
##   accession         ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>                 <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 ENOG411CUGY               0      274 241.         7.91      16.0 
## 2 ENOG411CWA4              31       79   2.19       1.13       3.98
## 3 ENOG411CTH9              17       54   2.68       1.42       3.94
## 4 ENOG411CSRQ              17       49   2.43       1.28       3.49
## 5 Xelaev18037980m.g         6       28   3.63       1.86       3.49
## 6 ENOG411CTE0               0       10   9.64       3.27       3.05
dnai2_df <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/frog_apms_dnai2.csv")
## Parsed with column specification:
## cols(
##   accession = col_character(),
##   ctrl_PSMs = col_double(),
##   exp_PSMs = col_double(),
##   PSM_fc = col_double(),
##   PSM_log2fc = col_double(),
##   PSM_zscore = col_double()
## )
head(dnai2_df)
## # A tibble: 6 x 6
##   accession   ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
##   <chr>           <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
## 1 ENOG411DGHA         0      202 200.        7.64       14.2 
## 2 ENOG411CW1E        80      265   3.23      1.69        9.90
## 3 ENOG411DGH7         0       46  46.2       5.53        6.76
## 4 ENOG411CT94         1       35  17.7       4.15        5.64
## 5 ENOG411CQ77         9       52   5.21      2.38        5.47
## 6 ENOG411DGHH        72      133   1.81      0.853       4.16
# data frame with annotations for each protein in the frog proteome:
frog_annotations <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/frog_annotations.csv")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   XENLA_XenBase_GeneNames = col_character(),
##   XENLA_GenBank_Description = col_character(),
##   HUMAN_UniProt_GeneNames = col_character(),
##   eggNOG_annotation = col_character()
## )
head(frog_annotations)
## # A tibble: 6 x 5
##   ID    XENLA_XenBase_G… XENLA_GenBank_D… HUMAN_UniProt_G… eggNOG_annotati…
##   <chr> <chr>            <chr>            <chr>            <chr>           
## 1 ENOG… rhpn1.L, rhpn2.… rhophilin-2-A [… RHPN2P1, RHPN2,… Rhophilin, Rho …
## 2 ENOG… arl14epl.L, arl… <NA>             ARL14EPL         ADP-ribosylatio…
## 3 ENOG… myh1-2-1.S, myh… myosin, heavy p… MYH3, MYH7B      myosin, heavy c…
## 4 ENOG… loc101733025.L,… <NA>             SHISAL1          kiaa1644        
## 5 ENOG… epha3.L, epha4.… ephrin type-A r… EPHA5, EPHA3, E… Eph receptor    
## 6 ENOG… txndc11.L        <NA>             TXNDC11          thioredoxin dom…

Problem 1: The experiment data frames are already sorted so the best hits are at the top. However, the protein identifiers in the column accession are essentially uninterpretable without annotations. Use left_join to join the annotation data frame to both experiments and save them into a new dataframe. What kind of proteins do we see at the top? HINT: The accession column in the experiment data frames is the same as the ID column in the annotation dataframe.

# Your R code goes here

Your answer goes here.

Problem 2: Use anti-join() on both dataframes to see which proteins identifications are unique to each experiment. How many protein identifications are unique to heatr2? How many protein identifications are unique to dnai2? Finally, use inner_join() to see how many proteins the experiments have in common. How many proteins are found in both experiments?

# Your R code goes here

Your answer goes here.

Problem 3: I have already precomputed statistics in the experimental data frames in the column PSM_zscore. The z-score describes how many standard deviations each point is away from the population mean, i.e., the higher the better. We are only interested in proteins that are positively enriched, so for a one-tailed test a z-score of 1.65 corresponds to a p-value < 0.05 (we use a special z-score formula that ensures this is multiple-hypothesis corrected).

Use filter() on the column PSM_zscores in both dataframes so you only keep z-scores >= 1.65. Then, use bind_rows() to combine them. Before binding the rows, make sure to mutate() a new column in each data frame containing the experiment identifier, i.e., mutate(exp_id = "heatr2").

# Your R code goes here

Problem 4: Recall the ldeaths dataset from last week’s homework. This dataset is is untidy, and we are going to make it tidy with rownames_to_column() and pivot_longer(). Take a few minutes to read the documentation on these two functions.

ldeaths
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
## 1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
## 1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
## 1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
## 1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
## 1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915
ldeaths_table <- read.table(text = "
Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915
")

# Your R code goes here