Project 3

Enter your name and EID here

Instructions

After completing this Jupyter notebook, please convert it to pdf and submit both the pdf and the original notebook on Canvas no later than 12:00 pm on May 7, 2020. The two documents will be graded jointly, so they must be consistent (as in, don't change the Jupyter notebook without also updating the pdf!).

All results presented must have corresponding code. Any answers/results given without the generative Python code will be considered absent. All code reported in your final project document should work properly.

Before submitting the Jupyter notebook part, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

The project consists of two problems. For both problems, please follow these guidelines:

  • Final output needs to be nicely formatted and human readable. For example, if your result is a count, don't just print the value of the count, print "The count is: ...".
  • For each problem, limit your total code to less than 100 lines.
  • Write comments and explanatory text, so we understand what you are doing.
  • Do not print out large datasets, such as thousands of publications, an entire genome, or a list of all genes in a genome, etc.
  • Verify that nothing of importance (code, comments, other text) is cut off in your final pdf.

Part 1

(50 pts) Since the start of SARS-CoV-2 pandemic, the number of academic publications concerning the virus has increased significantly. Using Python, answer the following questions:

  • (1) How many total publications have the term "SARS-CoV-2" in the title?
  • (2) How many publications were published per year with the term "SARS-CoV-2" in the title?
  • (3) Using the code you wrote for (1) and (2), perform the same search for "H1N1" in the publication title. What do you notice about the number of publications per year? Note: the same code may take longer to run the H1N1 search--the number of publications is much larger.

Hints: Set retmax = 10000 to ensure you get all of the publications for your searches. In your search terms, do not restrict the publication date; instead, download all records that match the search term and extract the publication date from the records. You should end up with two PMID lists, one for SARS-CoV-2 and one for H1N1. To get the date of publication, use record['DP']. To extract the year of publication from the full date of publication record['DP'], you'll need to use a regular expression. As an example, record['DP'] could be either 2017 Feb 12 or Winter 2017; your code should match 2017 in both cases.

Approach: Provide a brief description (1-2 paragraphs) of your strategy for answering the above questions.

In [1]:
# you will need the following libraries to answer these questions
from Bio import Entrez, Medline
import re
In [2]:
# your code goes here
In [3]:
# your code goes here
In [4]:
# your code goes here

Discussion: Provide a brief conclusion (1-2 paragraphs) explaining what you have learned about this question from your code.

Part 2

(50 pts)

Ask one bioinformatics question using the resources we've discussed in class (e.g., a literature search or a genomic query for your favorite organism). You may use the literature search you performed above, but query something different about the publication list. Then, write Python code to answer your question.

For full credit, the answer code must meet the following conditions:

  • contains at least one for loop
  • contains at least one if statement
  • uses at least one list or dictionary
  • uses at least one regular expression
  • visualizations are not required

Question: Your question goes here.

Approach: Provide a brief description (1-2 paragraphs) of your strategy for answering the above question.

In [5]:
# your code goes here
In [6]:
# your code goes here
In [7]:
# your code goes here

Discussion: Provide a brief conclusion (1-2 paragraphs) explaining what you have learned about your question from your code.