Homework 9 Solutions

Enter your name and EID here

This homework is due on April 10, 2018 at 7:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

Problem 1 (2 points): Using Biopython and the Pubmed database, calculate the average number of papers that Dr. Wilke has published from 2013-2017 (inclusive, so that's 5 years total).

Hints: Dr. Wilke will always appear as "Wilke CO" in the Pubmed database. Also, make sure to set the retmax argument to at least 60 in Entrez.esearch() so that you retrieve all of the papers.

In [1]:
# You will need Entrez and Medline to solve this problem
from Bio import Entrez, Medline

Entrez.email = "dariya.k.sydykova@gmail.com"

handle = Entrez.esearch(db="pubmed",  # database to search
                        term="Wilke CO[Author] AND 2013:2017[Date - Publication]",  # search term
                        retmax=60 # Maximum number of results to return
record = Entrez.read(handle)

# search returns PubMed IDs (pmids)
pmid_list = record["IdList"]

# Count the number of items in pmid_list
average = len(pmid_list)/5


Problem 2 (4 points): From the years 2013-2017 (inclusive), in which journals did Dr. Wilke's papers appear and how many times in each journal did his papers appear? Print out each journal and the number of times a paper appeared in that journal. Make sure you don't print the same journal name twice.

Hint: In class 21, we parsed the results of a literature search with Medline.parse(). This allows us to look at the references we found and to retrieve different parts of the reference with a key. For example, to retrieve the abstract, we would write record['AB']. You can find a list of possible keys here.

In [2]:
# Your code goes here
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)

# Create an empty dictionary to keep journal names and counts
journal_dict = {}
for record in records:
    # retrieve a journal name 
    title = record['JT']
    # check if journal name is in the dictionary
    if title in journal_dict:
        journal_dict[title] += 1 # increment the count of journal by 1
        journal_dict[title] = 1 # set the count of journal to 1

# Close the efetch handle    

# print final journal name and count
print("Dr. Wilke's paper(s) appear in journals:")
for title in journal_dict:
    print(" ", title + ":", str(journal_dict[title]) + "x")
Dr. Wilke's paper(s) appear in journals:
  F1000Research: 1x
  Cell reports: 1x
  eLife: 3x
  G3 (Bethesda, Md.): 1x
  PeerJ: 7x
  Scientific reports: 1x
  BMC genomics: 1x
  PloS one: 4x
  Annual review of biophysics: 1x
  Journal of the Royal Society, Interface: 2x
  Virus evolution: 2x
  Genetics: 1x
  Molecular biology and evolution: 4x
  PLoS biology: 1x
  Proteins: 1x
  Protein science : a publication of the Protein Society: 1x
  Nature reviews. Genetics: 1x
  Proceedings of the National Academy of Sciences of the United States of America: 2x
  Journal of virology: 2x
  PLoS computational biology: 1x
  PLoS pathogens: 1x
  Science (New York, N.Y.): 1x
  Physical biology: 1x
  Journal of molecular evolution: 2x
  AIDS research and human retroviruses: 1x
  Epidemics: 1x
  The Laryngoscope: 1x
  Biology direct: 1x
  International forum of allergy & rhinology: 1x
  Philosophical transactions of the Royal Society of London. Series B, Biological sciences: 1x
  The Journal of general virology: 1x

Problem 3 (4 points): From 2013-2017 (inclusive), how many of Dr. Wilke's papers contain the terms "virus" or "viral" in the title? Use python and regular expressions to find an answer.

Hint: In a regular expression, you can match the same word with slightly different endings using the "|" (or) operator. For example, the regex "bacteri(a|um)" would match both "bacteria" and "bacterium".

In [3]:
# You'll need the module re for regular expressions
import re

handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)

# Start a counter
title_count = 0

for record in records:
    # Check for the term "virus" or "viral" in the title 
    match1 = re.search(r"vir(us|al)", record['TI'].lower())

    # if "virus" or "viral" is in the title, increment the count by 1 
    if match1: 
        title_count += 1

# Close the efetch handle    

print(title_count, 'of the titles contain "virus" or "viral"')
14 of the titles contain "virus" or "viral"
In [ ]: