Enter your name and EID here
This homework is due on April 20, 2020 at 12:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."
Problem 1 (4 points): The complete genome of the virus SARS-CoV-2 can be accessed from the NCBI Entrez/PubMed website with the ID NC_045512
. Using Biopython and Pubmed, download the GenBank record associated with SARS-CoV-2. Then, for each CDS in the record, print the locus tag and the name of the protein product associated with the gene at that locus.
from Bio import Entrez, SeqIO
Entrez.email = "rachaelcox@utexas.edu" # put your email here
# download sequence record for genbank id NC_045512
handle = Entrez.efetch(db="nucleotide", id="NC_045512", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# loop over all features in the record
for feature in record.features:
if feature.type == 'CDS':
# extract locus tag and protein product info
locus_tag = feature.qualifiers['locus_tag'][0]
product = feature.qualifiers['product'][0]
print(locus_tag + ": " + product)
Problem 2 (2 points): Frances Arnold is an American chemical engineer, recently winning the Nobel Prize in Chemistry for using directed evolution to engineer enzymes. Using Biopython and the Pubmed database, calculate the average number of papers per year that Dr. Arnold has published from 2015-2019 (inclusive, so that's 5 years total).
Hints: Dr. Arnold will always appear as "Arnold FH" in the Pubmed database. Also, make sure to set the retmax
argument to at least 100
in Entrez.esearch()
so that you retrieve all of the papers. See the Class 21 Worksheet as an additional resource for the syntax required to access these publications.
# you will need Entrez and Medline to solve this problem
from Bio import Entrez, Medline
Entrez.email = "rachaelcox@utexas.edu"
handle = Entrez.esearch(db="pubmed", # database to search
term="Arnold FH[Author] AND 2015[Date - Publication]:2019[Date - Publication]", # search term
retmax=100 # maximum number of results to return
)
record = Entrez.read(handle)
handle.close()
# search returns PubMed IDs (pmids)
pmid_list = record["IdList"]
print('total # of papers =', len(pmid_list))
# count the average number of items in pmid_list
average = len(pmid_list)/5
print('average # of papers/year =', average)
Problem 3 (4 points): From 2015-2019 (inclusive), how many of Dr. Arnold's papers contain the terms "evolution" or "evolutionary" in the abstract? Use python and regular expressions to find an answer.
Hint #1: In class 21, we parsed the results of a literature search with Medline.parse(). This allows us to look at the references we found and to retrieve different parts of the reference with a key. For example, to retrieve the abstract, we would write record['AB'].
Hint #2: In a regular expression, you can match the same word with slightly different endings using the "|
" (or) operator. For example, the regex "bacteri(a|um)" would match both "bacteria" and "bacterium".
# you will need the module `re` for regular expressions to solve this problem
import re
from Bio import Entrez, Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)
ab_count = 0 # start a counter
rec_count = 0 # start a counter for record number
for record in records:
rec_count += 1
# check if a record has an abstract
if "AB" in record:
# Check for the term "evolution" or "evolutionary" in the abstract
match = re.search(r"evolution(\b|ary)", record["AB"].lower())
# if "evolution" or "evolutionary" is in the abstract, increment the count by 1
if match:
ab_count += 1
print("The term '{}' was found in record #{}.".format(match.group(), rec_count))
# close the efetch handle
handle.close()
print()
print('Thus, {} of the {} abstracts contain "evolution" or "evolutionary."'.format(ab_count, rec_count))