April 14, 2020
Every time we download data through the Entrez module, we can interact with the results in different ways. First, we can just use the handle we obtain as an ordinary file handle and just store or process the raw data provided by Entrez. Second, we can process the data with an appropriate, existing Biopython module. The latter will generally be preferable if an appropriate module exists. However, the various choices that are available may make things confusing.
As an example, we will again download the genbank record with the ID "KT220438", containing an influenza HA protein. We will consider four different ways of looking at the data. First, we use retmode="text"
in Entrez.efetch()
and just download the raw data and print it. We get a regular genbank file as output:
from Bio import Entrez, SeqIO
Entrez.email = "wilke@austin.utexas.edu" # put your email here
# Download sequence record for genbank id KT220438 (HA from influenza A)
# Using text mode
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text")
record = handle.read() # read file directly
print(record)
handle.close()
We can also, as we have done before, process this file using the SeqIO
module:
# Download sequence record for genbank id KT220438 (HA from influenza A)
# Using text mode
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank") # parse with SeqIO
print(record)
handle.close()
In addition to text mode, we can also download the data in XML mode. XML is a structured data format that allows for easy machine-processing of complex data files. If we just print the raw data, though, it doesn't look appealing:
# Download sequence record for genbank id KT220438 (HA from influenza A)
# Using XML mode
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="xml")
record = handle.read() # read file directly
print(record)
handle.close()
The advantage of XML mode is that there is the generic Entrez.parse()
function that can parse XML files returned from Entrez.efetch()
. Also, some modules in Biopython cannot work with files obtained in text mode, they can only work on files obtained in XML mode. The documentation will generally tell you for each module what kind of input it expects.
Reading the above example with Entrez.parse()
gives us the following:
# Download sequence record for genbank id KT220438 (HA from influenza A)
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="xml")
parsed = Entrez.parse(handle)
record = list(parsed)[0] # We need to convert the parsed contents into a list. Here we want just the 0th element.
handle.close()
print(record)
While this output may not seem useful, we now just have a set of nested dictionaries that we can interrogate using standard Python techniques:
print(list(record.keys())) # print out all the keys in the dictionary
print(record['GBSeq_sequence']) # print out the sequence
features = record['GBSeq_feature-table'] # extract all the features
for feature in features: # loop over features and print feature key and feature location
print(feature['GBFeature_key'] + ": " + feature['GBFeature_location'])
So far we have only downloaded specific records from Entrez. In addition to just downloading records, however, we can also run searches directly from python. Any query that you can do on the Entrez website (https://www.ncbi.nlm.nih.gov/) can also be executed directly from python. This allows you to find a large number of records all at once and process them in an automated fashion.
For example, below we will see how to automatically run and retrieve the results for the following search term: "influenza a virus texas h1n1 hemagglutinin complete cds". A direct link to the search results on the Entrez website is here:
https://www.ncbi.nlm.nih.gov/nuccore/?term=influenza+a+virus+texas+h1n1+hemagglutinin+complete+cds
(Note that in the following Python code, we limit the number of search hits returned to the first 10.)
# let's do a search for complete genomes of the SARS-COV2 virus
handle = Entrez.esearch(
db="nucleotide", # database to search
term="sars-cov2 complete cds", # search term
retmax=10 # maximum number of results that are returned
)
record = Entrez.read(handle)
handle.close()
gi_list = record["IdList"] # list of genbank identifiers found
print(gi_list)
Note that even though NCBI is phasing out sequence GI numbers, for now the esearch()
function still returns GI numbers (numerical sequence identifiers without version information).
We can download all the genbank records in the list of identifiers using the Entrez.efetch()
function. This function provides us with a handle that needs to be processed with SeqIO.parse()
. (We used SeqIO.read()
previously, which reads a single record. SeqIO.parse()
reads multiple records. See here for details.)
handle = Entrez.efetch(db="nucleotide", id=gi_list, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "genbank")
for record in records:
print(record.description)
handle.close() # important, close the handle only after you have iterated over the records. Otherwise you will get an error!
As another example, let's search the "pubmed" database (database of scientific publications) for papers from 2015 written by "Wilke CO". The exact search term we need to use is the following: "Wilke CO[Author] AND 2015[Date - Publication]"
You can click here to see the result from that search online.
handle = Entrez.esearch(
db="pubmed", # database to search
term="Wilke CO[Author] AND 2015[Date - Publication]", # search term
retmax=10 # number of results that are returned
)
record = Entrez.read(handle)
handle.close()
# search returns PubMed IDs (pmids)
pmid_list = record["IdList"]
print(pmid_list)
Just like with genes and genomes, we can download the records corresponding to these identifiers. They are references. We'll print the author(s), title, and reference (source).
from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)
for record in records:
print(record['AU']) # author list
print(record['TI']) # title
print(record['SO']) # source (reference)
print()
handle.close()
Problem 1
Use the following code to download the genbank record KT220438
in XML and parse it with the Entrez.parse()
function:
# Download sequence record for genbank id KT220438 (HA from influenza A)
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="xml")
parsed = Entrez.parse(handle)
record = list(parsed)[0] # Convert the parsed contents into a list and take element 0.
handle.close()
Then:
(a) Print out the value for the key GBSeq_definition
.
(b) Find the CDS
feature and print out all its qualifiers. Note that qualifiers are provided under the keyword GBFeature_quals
.
# Problem 1a
# Your code goes here.
# Problem 1b
# Your code goes here.
Problem 2:
(a) Use an Entrez esearch query of the pubmed database to find out how many publications "Spielman SJ" wrote in 2015.
(b) From the results of part (a), compile a combined list of all the co-authors of "Spielman SJ" in 2015.
# Problem 2a
# Your code goes here.
# Problem 2b
# Your code goes here.
Problem 3:
For larger searches, NCBI wants you to use the WebEnv method to download all your search results. This is explained in the Biopython tutorial here. Rewrite the SARS-COV2 search from the section "Running search queries on through Entrez" in such a way that it uses the WebEnv method. For this downloading method, it makes sense to write all the results into a file and then read the results back in.
# Problem 3
# Your code goes here.