Homework 9 Solutions

Enter your name and EID here

This homework is due on April 20, 2020 at 12:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

Problem 1 (4 points): The complete genome of the virus SARS-CoV-2 can be accessed from the NCBI Entrez/PubMed website with the ID NC_045512. Using Biopython and Pubmed, download the GenBank record associated with SARS-CoV-2. Then, for each CDS in the record, print the locus tag and the name of the protein product associated with the gene at that locus.

In [1]:
from Bio import Entrez, SeqIO
Entrez.email = "rachaelcox@utexas.edu" # put your email here

# download sequence record for genbank id NC_045512
handle = Entrez.efetch(db="nucleotide", id="NC_045512", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

# loop over all features in the record
for feature in record.features:
    if feature.type == 'CDS':
        # extract locus tag and protein product info
        locus_tag = feature.qualifiers['locus_tag'][0]
        product = feature.qualifiers['product'][0]
        print(locus_tag + ": " + product)
GU280_gp01: ORF1ab polyprotein
GU280_gp01: ORF1a polyprotein
GU280_gp02: surface glycoprotein
GU280_gp03: ORF3a protein
GU280_gp04: envelope protein
GU280_gp05: membrane glycoprotein
GU280_gp06: ORF6 protein
GU280_gp07: ORF7a protein
GU280_gp08: ORF7b
GU280_gp09: ORF8 protein
GU280_gp10: nucleocapsid phosphoprotein
GU280_gp11: ORF10 protein

Problem 2 (2 points): Frances Arnold is an American chemical engineer, recently winning the Nobel Prize in Chemistry for using directed evolution to engineer enzymes. Using Biopython and the Pubmed database, calculate the average number of papers per year that Dr. Arnold has published from 2015-2019 (inclusive, so that's 5 years total).

Hints: Dr. Arnold will always appear as "Arnold FH" in the Pubmed database. Also, make sure to set the retmax argument to at least 100 in Entrez.esearch() so that you retrieve all of the papers. See the Class 21 Worksheet as an additional resource for the syntax required to access these publications.

In [2]:
# you will need Entrez and Medline to solve this problem
from Bio import Entrez, Medline

Entrez.email = "rachaelcox@utexas.edu"

handle = Entrez.esearch(db="pubmed",  # database to search
                        term="Arnold FH[Author] AND 2015[Date - Publication]:2019[Date - Publication]",  # search term
                        retmax=100 # maximum number of results to return
                        )
record = Entrez.read(handle)
handle.close()

# search returns PubMed IDs (pmids)
pmid_list = record["IdList"]
print('total # of papers =', len(pmid_list))

# count the average number of items in pmid_list
average = len(pmid_list)/5

print('average # of papers/year =', average)
total # of papers = 62
average # of papers/year = 12.4

Problem 3 (4 points): From 2015-2019 (inclusive), how many of Dr. Arnold's papers contain the terms "evolution" or "evolutionary" in the abstract? Use python and regular expressions to find an answer.

Hint #1: In class 21, we parsed the results of a literature search with Medline.parse(). This allows us to look at the references we found and to retrieve different parts of the reference with a key. For example, to retrieve the abstract, we would write record['AB'].

Hint #2: In a regular expression, you can match the same word with slightly different endings using the "|" (or) operator. For example, the regex "bacteri(a|um)" would match both "bacteria" and "bacterium".

In [3]:
# you will need the module `re` for regular expressions to solve this problem
import re
from Bio import Entrez, Medline

handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)

ab_count = 0 # start a counter
rec_count = 0 # start a counter for record number

for record in records:
    
    rec_count += 1
    
    # check if a record has an abstract
    if "AB" in record:
        
        # Check for the term "evolution" or "evolutionary" in the abstract 
        match = re.search(r"evolution(\b|ary)", record["AB"].lower())
        
        # if "evolution" or "evolutionary" is in the abstract, increment the count by 1 
        if match: 
            ab_count += 1
            print("The term '{}' was found in record #{}.".format(match.group(), rec_count))

# close the efetch handle    
handle.close()

print()
print('Thus, {} of the {} abstracts contain "evolution" or "evolutionary."'.format(ab_count, rec_count))
The term 'evolutionary' was found in record #1.
The term 'evolution' was found in record #2.
The term 'evolution' was found in record #3.
The term 'evolution' was found in record #5.
The term 'evolution' was found in record #7.
The term 'evolution' was found in record #8.
The term 'evolution' was found in record #9.
The term 'evolution' was found in record #11.
The term 'evolution' was found in record #12.
The term 'evolution' was found in record #13.
The term 'evolution' was found in record #15.
The term 'evolution' was found in record #16.
The term 'evolution' was found in record #22.
The term 'evolution' was found in record #24.
The term 'evolution' was found in record #26.
The term 'evolution' was found in record #27.
The term 'evolution' was found in record #29.
The term 'evolution' was found in record #31.
The term 'evolution' was found in record #33.
The term 'evolution' was found in record #36.
The term 'evolution' was found in record #37.
The term 'evolution' was found in record #40.
The term 'evolution' was found in record #42.
The term 'evolution' was found in record #43.
The term 'evolution' was found in record #47.
The term 'evolutionary' was found in record #51.
The term 'evolution' was found in record #52.
The term 'evolution' was found in record #53.
The term 'evolution' was found in record #55.
The term 'evolution' was found in record #56.
The term 'evolution' was found in record #59.
The term 'evolutionary' was found in record #61.
The term 'evolution' was found in record #62.

Thus, 33 of the 62 abstracts contain "evolution" or "evolutionary."