Lab Worksheet 4

In my research, we are interested in protein interactions associated with the dynein proteins heatr2 and dnai2. In order to characterize these interactions, we engineered versions of heatr2 and dnai2 that are attached to a GFP tag. We put them in embryos, then break open cells, and can “pull down” the tagged proteins using an antibody that specifically binds to GFP. Finally, we run this mixture through a mass spectrometer to identify the proteins that are bound to GFP-heatr2 and GFP-dnai2.

# data frames with spectral counts from the mass spectrometer (MS) for each experiment:
heatr2_df <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/frog_apms_heatr2.csv")
## Parsed with column specification:
## cols(
##   accession = col_character(),
##   ctrl_PSMs = col_double(),
##   exp_PSMs = col_double(),
##   PSM_fc = col_double(),
##   PSM_log2fc = col_double(),
##   PSM_zscore = col_double()
## )
dnai2_df <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/frog_apms_dnai2.csv")
## Parsed with column specification:
## cols(
##   accession = col_character(),
##   ctrl_PSMs = col_double(),
##   exp_PSMs = col_double(),
##   PSM_fc = col_double(),
##   PSM_log2fc = col_double(),
##   PSM_zscore = col_double()
## )
# data frame with annotations for each protein in the frog proteome:
frog_annotations <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/frog_annotations.csv")
## Parsed with column specification:
## cols(
##   ID = col_character(),
##   XENLA_XenBase_GeneNames = col_character(),
##   XENLA_GenBank_Description = col_character(),
##   HUMAN_UniProt_GeneNames = col_character(),
##   eggNOG_annotation = col_character()
## )

Problem 1: The experiment data frames are already sorted so the best hits are at the top. However, the protein identifiers in the column accession are essentially uninterpretable without annotations. Use left_join to join the annotation data frame to both experiments and save them into a new dataframe. What kind of proteins do we see at the top? What data was lost from the annotation data frame using the left_join() function?

HINT: The accession column in the experiment data frames is the same as the ID column in the annotation dataframe. You’ll need to make them the same using rename(), or specify that they are the same inside the left_join() function.

heatr2_annotated <- heatr2_df %>%
  rename(ID = accession) %>%
  left_join(frog_annotations, by = c("ID"))

heatr2_annotated
## # A tibble: 1,690 x 10
##    ID    ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore XENLA_XenBase_G…
##    <chr>     <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr>           
##  1 ENOG…         0      274 241.         7.91      16.0  dnaaf5.L, dnaaf…
##  2 ENOG…        31       79   2.19       1.13       3.98 gcn1.S, gcn1.L  
##  3 ENOG…        17       54   2.68       1.42       3.94 ipo7.L          
##  4 ENOG…        17       49   2.43       1.28       3.49 ipo9.L, ipo9.S  
##  5 Xela…         6       28   3.63       1.86       3.49 ubr4.S          
##  6 ENOG…         0       10   9.64       3.27       3.05 myo5b.S, myo5c.…
##  7 ENOG…         5       22   3.36       1.75       3.01 hsph1.L, hsph1.…
##  8 Xela…         2       15   4.67       2.22       2.97 mdn1.L          
##  9 ENOG…         2       15   4.67       2.22       2.97 Xelaev18029563m…
## 10 ENOG…         6       23   3.00       1.59       2.88 pgm1.L, pgm5.S,…
## # … with 1,680 more rows, and 3 more variables:
## #   XENLA_GenBank_Description <chr>, HUMAN_UniProt_GeneNames <chr>,
## #   eggNOG_annotation <chr>
dnai2_annotated <- dnai2_df %>%
  rename(ID = accession) %>%
  left_join(frog_annotations, by = c("ID"))

dnai2_annotated
## # A tibble: 1,703 x 10
##    ID    ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore XENLA_XenBase_G…
##    <chr>     <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr>           
##  1 ENOG…         0      202 200.        7.64       14.2  dnai2.S         
##  2 ENOG…        80      265   3.23      1.69        9.90 loc100494633.L,…
##  3 ENOG…         0       46  46.2       5.53        6.76 dnai1.L         
##  4 ENOG…         1       35  17.7       4.15        5.64 loc101730961.L,…
##  5 ENOG…         9       52   5.21      2.38        5.47 cps1.L, cad.L   
##  6 ENOG…        72      133   1.81      0.853       4.16 pc.S            
##  7 ENOG…       158      241   1.50      0.582       4.02 tubb.S, tubb2b.…
##  8 ENOG…        65      117   1.76      0.815       3.76 MGC97820.L      
##  9 ENOG…         1       16   8.36      3.06        3.62 oplah.L         
## 10 ENOG…        18       46   2.43      1.28        3.45 gfpt1.S, gfpt2.L
## # … with 1,693 more rows, and 3 more variables:
## #   XENLA_GenBank_Description <chr>, HUMAN_UniProt_GeneNames <chr>,
## #   eggNOG_annotation <chr>

We see the original bait proteins at the top of each data frame, which is great–exactly what we expect to be the most abundant. The left_join() function only joined on annotations for protein IDs present in the experimental dataframes, so any unobserved proteins were not included in the join.

Problem 2: Use anti-join() on both dataframes to see which proteins identifications are unique to each experiment. How many protein identifications are unique to heatr2? How many protein identifications are unique to dnai2? Finally, use inner_join() to see how many proteins the experiments have in common. How many proteins are found in both experiments?

dnai2_annotated %>% anti_join(heatr2_annotated, by = c("ID"))
## # A tibble: 424 x 10
##    ID    ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore XENLA_XenBase_G…
##    <chr>     <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr>           
##  1 ENOG…         0      202 200.         7.64      14.2  dnai2.S         
##  2 ENOG…         0       46  46.2        5.53       6.76 dnai1.L         
##  3 ENOG…         1       35  17.7        4.15       5.64 loc101730961.L,…
##  4 ENOG…         1       16   8.36       3.06       3.62 oplah.L         
##  5 ENOG…         2       13   4.59       2.20       2.82 ranbp2.L, ranbp…
##  6 ENOG…         0        7   7.87       2.98       2.63 wdr78.S         
##  7 ENOG…         0        6   6.89       2.78       2.44 nme9.L          
##  8 ENOG…         0        5   5.90       2.56       2.23 tes.S           
##  9 ENOG…         0        5   5.90       2.56       2.23 irf2bpl.S       
## 10 ENOG…         0        5   5.90       2.56       2.23 arhgap35.L, arh…
## # … with 414 more rows, and 3 more variables:
## #   XENLA_GenBank_Description <chr>, HUMAN_UniProt_GeneNames <chr>,
## #   eggNOG_annotation <chr>
heatr2_annotated %>% anti_join(dnai2_annotated, by = c("ID"))
## # A tibble: 411 x 10
##    ID    ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore XENLA_XenBase_G…
##    <chr>     <dbl>    <dbl>  <dbl>      <dbl>      <dbl> <chr>           
##  1 ENOG…         0      274 241.         7.91      16.0  dnaaf5.L, dnaaf…
##  2 Xela…         2       15   4.67       2.22       2.97 mdn1.L          
##  3 ENOG…         0        4   4.38       2.13       1.93 pigs.L          
##  4 ENOG…         0        4   4.38       2.13       1.93 <NA>            
##  5 ENOG…         0        3   3.50       1.81       1.67 anp32c.L, anp32…
##  6 ENOG…         0        3   3.50       1.81       1.67 Xelaev18032090m…
##  7 ENOG…         0        3   3.50       1.81       1.67 sord.L, sord.S  
##  8 ENOG…         0        3   3.50       1.81       1.67 loc101732146.L,…
##  9 ENOG…         0        3   3.50       1.81       1.67 tmx3.S, tmx3.L  
## 10 ENOG…         0        3   3.50       1.81       1.67 ap3d1.L, ap3d1.S
## # … with 401 more rows, and 3 more variables:
## #   XENLA_GenBank_Description <chr>, HUMAN_UniProt_GeneNames <chr>,
## #   eggNOG_annotation <chr>
dnai2_annotated %>% inner_join(heatr2_annotated, by = c("ID"))
## # A tibble: 1,279 x 19
##    ID    ctrl_PSMs.x exp_PSMs.x PSM_fc.x PSM_log2fc.x PSM_zscore.x
##    <chr>       <dbl>      <dbl>    <dbl>        <dbl>        <dbl>
##  1 ENOG…          80        265     3.23        1.69          9.90
##  2 ENOG…           9         52     5.21        2.38          5.47
##  3 ENOG…          72        133     1.81        0.853         4.16
##  4 ENOG…         158        241     1.50        0.582         4.02
##  5 ENOG…          65        117     1.76        0.815         3.76
##  6 ENOG…          18         46     2.43        1.28          3.45
##  7 ENOG…          14         39     2.62        1.39          3.39
##  8 ENOG…           0         11    11.8         3.56          3.30
##  9 ENOG…           0         10    10.8         3.44          3.15
## 10 ENOG…           2         15     5.25        2.39          3.13
## # … with 1,269 more rows, and 13 more variables:
## #   XENLA_XenBase_GeneNames.x <chr>, XENLA_GenBank_Description.x <chr>,
## #   HUMAN_UniProt_GeneNames.x <chr>, eggNOG_annotation.x <chr>,
## #   ctrl_PSMs.y <dbl>, exp_PSMs.y <dbl>, PSM_fc.y <dbl>,
## #   PSM_log2fc.y <dbl>, PSM_zscore.y <dbl>,
## #   XENLA_XenBase_GeneNames.y <chr>, XENLA_GenBank_Description.y <chr>,
## #   HUMAN_UniProt_GeneNames.y <chr>, eggNOG_annotation.y <chr>

Using anti_join(), we can see there are are 424 proteins uniquely observed with dnai2 and 411 proteins uniquely observed with heatr2. Using inner_join(), we can see that two experiments have 1279 proteins in common.

Problem 3: I have already precomputed statistics in the experimental data frames in the column PSM_zscore. The z-score describes how many standard deviations each point is away from the population mean, i.e., the higher the better. We are only interested in proteins that are positively enriched, so for a one-tailed test a z-score of 1.65 corresponds to a p-value < 0.05 (we use a special z-score formula that ensures this is multiple-hypothesis corrected).

Use filter() on the column PSM_zscores in both dataframes so you only keep z-scores >= 1.65. Then, use bind_rows() to combine them. Before binding the rows, make sure to mutate() a new column in each data frame containing the experiment identifier, i.e., mutate(exp_id = "heatr2").

heatr2_filtered <- heatr2_annotated %>%
  filter(PSM_zscore >= 1.65) %>%
  mutate(exp_id = "heatr2")

dnai2_filtered <- dnai2_annotated %>%
  filter(PSM_zscore >= 1.65) %>%
  mutate(exp_id = "dnai2")

combined <- bind_rows(dnai2_filtered, heatr2_filtered)

Problem 4: Recall the ldeaths dataset from last week’s homework. This dataset is is untidy, and we are going to make it tidy with rownames_to_column() and pivot_longer(). Take a few minutes to read the documentation on these two functions.

ldeaths
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
## 1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
## 1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
## 1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
## 1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
## 1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915
ldeaths_table <- read.table(text = "
Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915
")

ldeaths_tidy <- ldeaths_table %>%
  rownames_to_column(var = "year") %>%
  pivot_longer(-year, names_to = "month")

ldeaths_tidy
## # A tibble: 72 x 3
##    year  month value
##    <chr> <chr> <int>
##  1 1974  Jan    3035
##  2 1974  Feb    2552
##  3 1974  Mar    2704
##  4 1974  Apr    2554
##  5 1974  May    2014
##  6 1974  Jun    1655
##  7 1974  Jul    1721
##  8 1974  Aug    1524
##  9 1974  Sep    1596
## 10 1974  Oct    2074
## # … with 62 more rows