In my research, we are interested in protein interactions associated with the dynein proteins heatr2 and dnai2. In order to characterize these interactions, we engineered versions of heatr2 and dnai2 that are attached to a GFP tag. We put them in embryos, then break open cells, and can “pull down” the tagged proteins using an antibody that specifically binds to GFP. Finally, we run this mixture through a mass spectrometer to identify the proteins that are bound to GFP-heatr2 and GFP-dnai2.
# data frames with spectral counts from the mass spectrometer (MS) for each experiment:
heatr2_df <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/frog_apms_heatr2.csv")
## Parsed with column specification:
## cols(
## accession = col_character(),
## ctrl_PSMs = col_double(),
## exp_PSMs = col_double(),
## PSM_fc = col_double(),
## PSM_log2fc = col_double(),
## PSM_zscore = col_double()
## )
head(heatr2_df)
## # A tibble: 6 x 6
## accession ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENOG411CUGY 0 274 241. 7.91 16.0
## 2 ENOG411CWA4 31 79 2.19 1.13 3.98
## 3 ENOG411CTH9 17 54 2.68 1.42 3.94
## 4 ENOG411CSRQ 17 49 2.43 1.28 3.49
## 5 Xelaev18037980m.g 6 28 3.63 1.86 3.49
## 6 ENOG411CTE0 0 10 9.64 3.27 3.05
dnai2_df <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/frog_apms_dnai2.csv")
## Parsed with column specification:
## cols(
## accession = col_character(),
## ctrl_PSMs = col_double(),
## exp_PSMs = col_double(),
## PSM_fc = col_double(),
## PSM_log2fc = col_double(),
## PSM_zscore = col_double()
## )
head(dnai2_df)
## # A tibble: 6 x 6
## accession ctrl_PSMs exp_PSMs PSM_fc PSM_log2fc PSM_zscore
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENOG411DGHA 0 202 200. 7.64 14.2
## 2 ENOG411CW1E 80 265 3.23 1.69 9.90
## 3 ENOG411DGH7 0 46 46.2 5.53 6.76
## 4 ENOG411CT94 1 35 17.7 4.15 5.64
## 5 ENOG411CQ77 9 52 5.21 2.38 5.47
## 6 ENOG411DGHH 72 133 1.81 0.853 4.16
# data frame with annotations for each protein in the frog proteome:
frog_annotations <- read_csv("http://wilkelab.org/classes/SDS348/data_sets/frog_annotations.csv")
## Parsed with column specification:
## cols(
## ID = col_character(),
## XENLA_XenBase_GeneNames = col_character(),
## XENLA_GenBank_Description = col_character(),
## HUMAN_UniProt_GeneNames = col_character(),
## eggNOG_annotation = col_character()
## )
head(frog_annotations)
## # A tibble: 6 x 5
## ID XENLA_XenBase_G… XENLA_GenBank_D… HUMAN_UniProt_G… eggNOG_annotati…
## <chr> <chr> <chr> <chr> <chr>
## 1 ENOG… rhpn1.L, rhpn2.… rhophilin-2-A [… RHPN2P1, RHPN2,… Rhophilin, Rho …
## 2 ENOG… arl14epl.L, arl… <NA> ARL14EPL ADP-ribosylatio…
## 3 ENOG… myh1-2-1.S, myh… myosin, heavy p… MYH3, MYH7B myosin, heavy c…
## 4 ENOG… loc101733025.L,… <NA> SHISAL1 kiaa1644
## 5 ENOG… epha3.L, epha4.… ephrin type-A r… EPHA5, EPHA3, E… Eph receptor
## 6 ENOG… txndc11.L <NA> TXNDC11 thioredoxin dom…
Problem 1: The experiment data frames are already sorted so the best hits are at the top. However, the protein identifiers in the column accession
are essentially uninterpretable without annotations. Use left_join
to join the annotation data frame to both experiments and save them into a new dataframe. What kind of proteins do we see at the top? HINT: The accession
column in the experiment data frames is the same as the ID
column in the annotation dataframe.
# Your R code goes here
Your answer goes here.
Problem 2: Use anti-join()
on both dataframes to see which proteins identifications are unique to each experiment. How many protein identifications are unique to heatr2? How many protein identifications are unique to dnai2? Finally, use inner_join()
to see how many proteins the experiments have in common. How many proteins are found in both experiments?
# Your R code goes here
Your answer goes here.
Problem 3: I have already precomputed statistics in the experimental data frames in the column PSM_zscore
. The z-score describes how many standard deviations each point is away from the population mean, i.e., the higher the better. We are only interested in proteins that are positively enriched, so for a one-tailed test a z-score of 1.65 corresponds to a p-value < 0.05 (we use a special z-score formula that ensures this is multiple-hypothesis corrected).
Use filter()
on the column PSM_zscores
in both dataframes so you only keep z-scores >= 1.65. Then, use bind_rows()
to combine them. Before binding the rows, make sure to mutate()
a new column in each data frame containing the experiment identifier, i.e., mutate(exp_id = "heatr2")
.
# Your R code goes here
Problem 4: Recall the ldeaths
dataset from last week’s homework. This dataset is is untidy, and we are going to make it tidy with rownames_to_column()
and pivot_longer()
. Take a few minutes to read the documentation on these two functions.
ldeaths
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
## 1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
## 1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
## 1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
## 1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
## 1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915
ldeaths_table <- read.table(text = "
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915
")
# Your R code goes here