Enter your name and EID here
This homework is due on April 20, 2020 at 12:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."
Problem 1 (4 points): The complete genome of the virus SARS-CoV-2 can be accessed from the NCBI Entrez/PubMed website with the ID NC_045512
. Using Biopython and Pubmed, download the GenBank record associated with SARS-CoV-2. Then, for each CDS in the record, print the locus tag and the name of the protein product associated with the gene at that locus.
# you will need Entrez and SeqIO to solve this problem
from Bio import Entrez, SeqIO
Entrez.email = "your.email@utexas.edu" # put your email here
# your code here
Problem 2 (2 points): Frances Arnold is an American chemical engineer, recently winning the Nobel Prize in Chemistry for using directed evolution to engineer enzymes. Using Biopython and the Pubmed database, calculate the average number of papers per year that Dr. Arnold has published from 2015-2019 (inclusive, so that's 5 years total).
Hints: Dr. Arnold will always appear as "Arnold FH" in the Pubmed database. Also, make sure to set the retmax
argument to at least 100
in Entrez.esearch()
so that you retrieve all of the papers. See the Class 21 Worksheet as an additional resource for the syntax required to access these publications.
# you will need Entrez and Medline to solve this problem
from Bio import Entrez, Medline
Entrez.email = "your.email@utexas.edu" # put your email here
# your code here
Problem 3 (4 points): From 2015-2019 (inclusive), how many of Dr. Arnold's papers contain the terms "evolution" or "evolutionary" in the abstract? Use python and regular expressions to find an answer.
Hint #1: In class 21, we parsed the results of a literature search with Medline.parse(). This allows us to look at the references we found and to retrieve different parts of the reference with a key. For example, to retrieve the abstract, we would write record['AB'].
Hint #2: In a regular expression, you can match the same word with slightly different endings using the "|
" (or) operator. For example, the regex "bacteri(a|um)" would match both "bacteria" and "bacterium".
# you will need the module `re` for regular expressions to solve this problem
import re
Entrez.email = "your.email@utexas.edu" # put your email here
# your code here