Class 21: Searching the NCBI databases

April 14, 2020

Different ways of processing Entrez results

Every time we download data through the Entrez module, we can interact with the results in different ways. First, we can just use the handle we obtain as an ordinary file handle and just store or process the raw data provided by Entrez. Second, we can process the data with an appropriate, existing Biopython module. The latter will generally be preferable if an appropriate module exists. However, the various choices that are available may make things confusing.

As an example, we will again download the genbank record with the ID "KT220438", containing an influenza HA protein. We will consider four different ways of looking at the data. First, we use retmode="text" in Entrez.efetch() and just download the raw data and print it. We get a regular genbank file as output:

In [1]:
from Bio import Entrez, SeqIO
Entrez.email = "wilke@austin.utexas.edu" # put your email here

# Download sequence record for genbank id KT220438 (HA from influenza A)
# Using text mode
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text")
record = handle.read() # read file directly
print(record)
handle.close()
LOCUS       KT220438                1701 bp    cRNA    linear   VRL 20-JUL-2015
DEFINITION  Influenza A virus (A/NewJersey/NHRC_93219/2015(H3N2)) segment 4
            hemagglutinin (HA) gene, complete cds.
ACCESSION   KT220438
VERSION     KT220438.1
KEYWORDS    .
SOURCE      Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))
  ORGANISM  Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))
            Viruses; Riboviria; Negarnaviricota; Polyploviricotina;
            Insthoviricetes; Articulavirales; Orthomyxoviridae;
            Alphainfluenzavirus.
REFERENCE   1  (bases 1 to 1701)
  AUTHORS   Sitz,C.R., Thammavong,H.L., Balansay-Ames,M.S., Hawksworth,A.W.,
            Myers,C.A. and Brice,G.T.
  TITLE     GEISS Influenza Surveillance Response Program
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 1701)
  AUTHORS   Sitz,C.R., Thammavong,H.L., Balansay-Ames,M.S., Hawksworth,A.W.,
            Myers,C.A. and Brice,G.T.
  TITLE     Direct Submission
  JOURNAL   Submitted (29-JUN-2015) Operational Infectious Diseases, Naval
            Health Research Center, 140 Sylvester Rd., San Diego, CA 92106, USA
COMMENT     ##Assembly-Data-START##
            Sequencing Technology :: Sanger dideoxy sequencing
            ##Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..1701
                     /organism="Influenza A virus (A/New
                     Jersey/NHRC_93219/2015(H3N2))"
                     /mol_type="viral cRNA"
                     /strain="A/NewJersey/NHRC_93219/2015"
                     /serotype="H3N2"
                     /isolation_source="nasopharyngeal swab"
                     /host="Homo sapiens"
                     /db_xref="taxon:1682360"
                     /segment="4"
                     /lab_host="MDCK"
                     /country="USA: New Jersey"
                     /collection_date="17-Jan-2015"
     gene            1..1701
                     /gene="HA"
     CDS             1..1701
                     /gene="HA"
                     /function="receptor binding and fusion protein"
                     /codon_start=1
                     /product="hemagglutinin"
                     /protein_id="AKQ43545.1"
                     /translation="MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKT
                     ITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDL
                     FVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSS
                     SSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRI
                     TVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKI
                     RSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLAT
                     GMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQ
                     INGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQ
                     HTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYR
                     DEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI
                     "
     mat_peptide     49..1035
                     /gene="HA"
                     /product="HA1"
     mat_peptide     1036..1698
                     /gene="HA"
                     /product="HA2"
ORIGIN      
        1 atgaagacta tcattgcttt gagctacatt ctatgtctgg ttttcgctca aaaaattcct
       61 ggaaatgaca atagcacggc aacgctgtgc cttgggcacc atgcagtacc aaacggaacg
      121 atagtgaaaa caatcacaaa tgaccgaatt gaagttacta atgctactga gctggttcag
      181 aattcctcaa taggtgaaat atgcgacagt cctcatcaga tccttgatgg agaaaactgc
      241 acactaatag atgctctatt gggagaccct cagtgtgatg gctttcaaaa taagaaatgg
      301 gacctttttg ttgaacgaag caaagcctac agcaactgct acccttatga tgtgccggat
      361 tatgcctccc ttaggtcact agttgcctca tccggcacac tggagtttaa caatgaaagc
      421 ttcaattgga ctggagtcac tcaaaacgga acaagttctg cttgcataag gagatctagt
      481 agtagtttct ttagtagatt aaattggttg acccacttaa actacacata cccagcattg
      541 aacgtgacta tgccaaacaa tgaacaattt gacaaattgt acatttgggg ggttcaccac
      601 ccgggtacgg acaaggacca aatcttcctg tatgctcaat catcaggaag aatcacagta
      661 tctaccaaaa gaagccaaca agctgtaatc ccaaatatcg gatctagacc cagaataagg
      721 gatatcccta gcagaataag catctattgg acaatagtaa aaccgggaga catacttttg
      781 attaacagca cagggaatct aattgctcct aggggttact tcaaaatacg aagtgggaaa
      841 agctcaataa tgagatcaga tgcacccatt ggcaaatgca agtctgaatg catcactcca
      901 aatggaagca ttcccaatga caaaccattc caaaatgtaa acaggatcac atacggggcc
      961 tgtcccagat atgttaagca tagcactcta aaattggcaa caggaatgcg aaatgtacca
     1021 gagaaacaaa ctagaggcat atttggcgca atagcgggtt tcatagaaaa tggttgggag
     1081 ggaatggtgg atggttggta cggtttcagg catcaaaatt ctgagggaag aggacaagca
     1141 gcagatctca aaagcactca agcagcaatc gatcaaatca atgggaagct gaatcgattg
     1201 atcgggaaaa ccaacgagaa attccatcag attgaaaaag aattctcaga agtagaagga
     1261 agaattcagg accttgagaa atatgttgag gacactaaaa tagatctctg gtcatacaac
     1321 gcggagcttc ttgttgccct ggagaaccaa catacarttg atctaactga ctcagaaatg
     1381 aacaaactgt ttgaaaaaac aaagaagcaa ctgagggaaa atgctgagga tatgggaaat
     1441 ggttgtttca aaatatacca caaatgtgac aatgcctgca taggatcaat aagaaatgga
     1501 acttatgacc acaatgtgta cagggatgaa gcattaaaca accggttcca gatcaaggga
     1561 gttgagctga agtcagggta caaagattgg atcctatgga tttcctytgc catatcatgt
     1621 tttttgcttt gtgttgcttt gttggggttc atcatgtggg cctgccaaaa gggcaacatt
     1681 aggtgcaaca tttgcatttg a
//


We can also, as we have done before, process this file using the SeqIO module:

In [2]:
# Download sequence record for genbank id KT220438 (HA from influenza A)
# Using text mode
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank") # parse with SeqIO
print(record)
handle.close()
ID: KT220438.1
Name: KT220438
Description: Influenza A virus (A/NewJersey/NHRC_93219/2015(H3N2)) segment 4 hemagglutinin (HA) gene, complete cds
Number of features: 5
/molecule_type=cRNA
/topology=linear
/data_file_division=VRL
/date=20-JUL-2015
/accessions=['KT220438']
/sequence_version=1
/keywords=['']
/source=Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))
/organism=Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))
/taxonomy=['Viruses', 'Riboviria', 'Negarnaviricota', 'Polyploviricotina', 'Insthoviricetes', 'Articulavirales', 'Orthomyxoviridae', 'Alphainfluenzavirus']
/references=[Reference(title='GEISS Influenza Surveillance Response Program', ...), Reference(title='Direct Submission', ...)]
/structured_comment=OrderedDict([('Assembly-Data', OrderedDict([('Sequencing Technology', 'Sanger dideoxy sequencing')]))])
Seq('ATGAAGACTATCATTGCTTTGAGCTACATTCTATGTCTGGTTTTCGCTCAAAAA...TGA', IUPACAmbiguousDNA())

In addition to text mode, we can also download the data in XML mode. XML is a structured data format that allows for easy machine-processing of complex data files. If we just print the raw data, though, it doesn't look appealing:

In [3]:
# Download sequence record for genbank id KT220438 (HA from influenza A)
# Using XML mode
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="xml")
record = handle.read() # read file directly
print(record)
handle.close()
<?xml version="1.0" encoding="UTF-8"  ?>
<!DOCTYPE GBSet PUBLIC "-//NCBI//NCBI GBSeq/EN" "https://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd">
<GBSet>
  <GBSeq>

    <GBSeq_locus>KT220438</GBSeq_locus>
    <GBSeq_length>1701</GBSeq_length>
    <GBSeq_strandedness>single</GBSeq_strandedness>
    <GBSeq_moltype>cRNA</GBSeq_moltype>
    <GBSeq_topology>linear</GBSeq_topology>
    <GBSeq_division>VRL</GBSeq_division>
    <GBSeq_update-date>20-JUL-2015</GBSeq_update-date>
    <GBSeq_create-date>20-JUL-2015</GBSeq_create-date>
    <GBSeq_definition>Influenza A virus (A/NewJersey/NHRC_93219/2015(H3N2)) segment 4 hemagglutinin (HA) gene, complete cds</GBSeq_definition>
    <GBSeq_primary-accession>KT220438</GBSeq_primary-accession>
    <GBSeq_accession-version>KT220438.1</GBSeq_accession-version>
    <GBSeq_other-seqids>
      <GBSeqid>gb|KT220438.1|</GBSeqid>
      <GBSeqid>gi|887493048</GBSeqid>
    </GBSeq_other-seqids>
    <GBSeq_source>Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))</GBSeq_source>
    <GBSeq_organism>Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))</GBSeq_organism>
    <GBSeq_taxonomy>Viruses; Riboviria; Negarnaviricota; Polyploviricotina; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus</GBSeq_taxonomy>
    <GBSeq_references>
      <GBReference>
        <GBReference_reference>1</GBReference_reference>
        <GBReference_position>1..1701</GBReference_position>
        <GBReference_authors>
          <GBAuthor>Sitz,C.R.</GBAuthor>
          <GBAuthor>Thammavong,H.L.</GBAuthor>
          <GBAuthor>Balansay-Ames,M.S.</GBAuthor>
          <GBAuthor>Hawksworth,A.W.</GBAuthor>
          <GBAuthor>Myers,C.A.</GBAuthor>
          <GBAuthor>Brice,G.T.</GBAuthor>
        </GBReference_authors>
        <GBReference_title>GEISS Influenza Surveillance Response Program</GBReference_title>
        <GBReference_journal>Unpublished</GBReference_journal>
      </GBReference>
      <GBReference>
        <GBReference_reference>2</GBReference_reference>
        <GBReference_position>1..1701</GBReference_position>
        <GBReference_authors>
          <GBAuthor>Sitz,C.R.</GBAuthor>
          <GBAuthor>Thammavong,H.L.</GBAuthor>
          <GBAuthor>Balansay-Ames,M.S.</GBAuthor>
          <GBAuthor>Hawksworth,A.W.</GBAuthor>
          <GBAuthor>Myers,C.A.</GBAuthor>
          <GBAuthor>Brice,G.T.</GBAuthor>
        </GBReference_authors>
        <GBReference_title>Direct Submission</GBReference_title>
        <GBReference_journal>Submitted (29-JUN-2015) Operational Infectious Diseases, Naval Health Research Center, 140 Sylvester Rd., San Diego, CA 92106, USA</GBReference_journal>
      </GBReference>
    </GBSeq_references>
    <GBSeq_comment>##Assembly-Data-START## ; Sequencing Technology :: Sanger dideoxy sequencing ; ##Assembly-Data-END##</GBSeq_comment>
    <GBSeq_feature-table>
      <GBFeature>
        <GBFeature_key>source</GBFeature_key>
        <GBFeature_location>1..1701</GBFeature_location>
        <GBFeature_intervals>
          <GBInterval>
            <GBInterval_from>1</GBInterval_from>
            <GBInterval_to>1701</GBInterval_to>
            <GBInterval_accession>KT220438.1</GBInterval_accession>
          </GBInterval>
        </GBFeature_intervals>
        <GBFeature_quals>
          <GBQualifier>
            <GBQualifier_name>organism</GBQualifier_name>
            <GBQualifier_value>Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>mol_type</GBQualifier_name>
            <GBQualifier_value>viral cRNA</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>strain</GBQualifier_name>
            <GBQualifier_value>A/NewJersey/NHRC_93219/2015</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>serotype</GBQualifier_name>
            <GBQualifier_value>H3N2</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>isolation_source</GBQualifier_name>
            <GBQualifier_value>nasopharyngeal swab</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>host</GBQualifier_name>
            <GBQualifier_value>Homo sapiens</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>db_xref</GBQualifier_name>
            <GBQualifier_value>taxon:1682360</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>segment</GBQualifier_name>
            <GBQualifier_value>4</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>lab_host</GBQualifier_name>
            <GBQualifier_value>MDCK</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>country</GBQualifier_name>
            <GBQualifier_value>USA: New Jersey</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>collection_date</GBQualifier_name>
            <GBQualifier_value>17-Jan-2015</GBQualifier_value>
          </GBQualifier>
        </GBFeature_quals>
      </GBFeature>
      <GBFeature>
        <GBFeature_key>gene</GBFeature_key>
        <GBFeature_location>1..1701</GBFeature_location>
        <GBFeature_intervals>
          <GBInterval>
            <GBInterval_from>1</GBInterval_from>
            <GBInterval_to>1701</GBInterval_to>
            <GBInterval_accession>KT220438.1</GBInterval_accession>
          </GBInterval>
        </GBFeature_intervals>
        <GBFeature_quals>
          <GBQualifier>
            <GBQualifier_name>gene</GBQualifier_name>
            <GBQualifier_value>HA</GBQualifier_value>
          </GBQualifier>
        </GBFeature_quals>
      </GBFeature>
      <GBFeature>
        <GBFeature_key>CDS</GBFeature_key>
        <GBFeature_location>1..1701</GBFeature_location>
        <GBFeature_intervals>
          <GBInterval>
            <GBInterval_from>1</GBInterval_from>
            <GBInterval_to>1701</GBInterval_to>
            <GBInterval_accession>KT220438.1</GBInterval_accession>
          </GBInterval>
        </GBFeature_intervals>
        <GBFeature_quals>
          <GBQualifier>
            <GBQualifier_name>gene</GBQualifier_name>
            <GBQualifier_value>HA</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>function</GBQualifier_name>
            <GBQualifier_value>receptor binding and fusion protein</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>codon_start</GBQualifier_name>
            <GBQualifier_value>1</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>transl_table</GBQualifier_name>
            <GBQualifier_value>1</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>product</GBQualifier_name>
            <GBQualifier_value>hemagglutinin</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>protein_id</GBQualifier_name>
            <GBQualifier_value>AKQ43545.1</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>translation</GBQualifier_name>
            <GBQualifier_value>MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI</GBQualifier_value>
          </GBQualifier>
        </GBFeature_quals>
      </GBFeature>
      <GBFeature>
        <GBFeature_key>mat_peptide</GBFeature_key>
        <GBFeature_location>49..1035</GBFeature_location>
        <GBFeature_intervals>
          <GBInterval>
            <GBInterval_from>49</GBInterval_from>
            <GBInterval_to>1035</GBInterval_to>
            <GBInterval_accession>KT220438.1</GBInterval_accession>
          </GBInterval>
        </GBFeature_intervals>
        <GBFeature_quals>
          <GBQualifier>
            <GBQualifier_name>gene</GBQualifier_name>
            <GBQualifier_value>HA</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>product</GBQualifier_name>
            <GBQualifier_value>HA1</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>peptide</GBQualifier_name>
            <GBQualifier_value>QKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTR</GBQualifier_value>
          </GBQualifier>
        </GBFeature_quals>
      </GBFeature>
      <GBFeature>
        <GBFeature_key>mat_peptide</GBFeature_key>
        <GBFeature_location>1036..1698</GBFeature_location>
        <GBFeature_intervals>
          <GBInterval>
            <GBInterval_from>1036</GBInterval_from>
            <GBInterval_to>1698</GBInterval_to>
            <GBInterval_accession>KT220438.1</GBInterval_accession>
          </GBInterval>
        </GBFeature_intervals>
        <GBFeature_quals>
          <GBQualifier>
            <GBQualifier_name>gene</GBQualifier_name>
            <GBQualifier_value>HA</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>product</GBQualifier_name>
            <GBQualifier_value>HA2</GBQualifier_value>
          </GBQualifier>
          <GBQualifier>
            <GBQualifier_name>peptide</GBQualifier_name>
            <GBQualifier_value>GIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI</GBQualifier_value>
          </GBQualifier>
        </GBFeature_quals>
      </GBFeature>
    </GBSeq_feature-table>
    <GBSeq_sequence>atgaagactatcattgctttgagctacattctatgtctggttttcgctcaaaaaattcctggaaatgacaatagcacggcaacgctgtgccttgggcaccatgcagtaccaaacggaacgatagtgaaaacaatcacaaatgaccgaattgaagttactaatgctactgagctggttcagaattcctcaataggtgaaatatgcgacagtcctcatcagatccttgatggagaaaactgcacactaatagatgctctattgggagaccctcagtgtgatggctttcaaaataagaaatgggacctttttgttgaacgaagcaaagcctacagcaactgctacccttatgatgtgccggattatgcctcccttaggtcactagttgcctcatccggcacactggagtttaacaatgaaagcttcaattggactggagtcactcaaaacggaacaagttctgcttgcataaggagatctagtagtagtttctttagtagattaaattggttgacccacttaaactacacatacccagcattgaacgtgactatgccaaacaatgaacaatttgacaaattgtacatttggggggttcaccacccgggtacggacaaggaccaaatcttcctgtatgctcaatcatcaggaagaatcacagtatctaccaaaagaagccaacaagctgtaatcccaaatatcggatctagacccagaataagggatatccctagcagaataagcatctattggacaatagtaaaaccgggagacatacttttgattaacagcacagggaatctaattgctcctaggggttacttcaaaatacgaagtgggaaaagctcaataatgagatcagatgcacccattggcaaatgcaagtctgaatgcatcactccaaatggaagcattcccaatgacaaaccattccaaaatgtaaacaggatcacatacggggcctgtcccagatatgttaagcatagcactctaaaattggcaacaggaatgcgaaatgtaccagagaaacaaactagaggcatatttggcgcaatagcgggtttcatagaaaatggttgggagggaatggtggatggttggtacggtttcaggcatcaaaattctgagggaagaggacaagcagcagatctcaaaagcactcaagcagcaatcgatcaaatcaatgggaagctgaatcgattgatcgggaaaaccaacgagaaattccatcagattgaaaaagaattctcagaagtagaaggaagaattcaggaccttgagaaatatgttgaggacactaaaatagatctctggtcatacaacgcggagcttcttgttgccctggagaaccaacatacarttgatctaactgactcagaaatgaacaaactgtttgaaaaaacaaagaagcaactgagggaaaatgctgaggatatgggaaatggttgtttcaaaatataccacaaatgtgacaatgcctgcataggatcaataagaaatggaacttatgaccacaatgtgtacagggatgaagcattaaacaaccggttccagatcaagggagttgagctgaagtcagggtacaaagattggatcctatggatttcctytgccatatcatgttttttgctttgtgttgctttgttggggttcatcatgtgggcctgccaaaagggcaacattaggtgcaacatttgcatttga</GBSeq_sequence>
  </GBSeq>

</GBSet>

The advantage of XML mode is that there is the generic Entrez.parse() function that can parse XML files returned from Entrez.efetch(). Also, some modules in Biopython cannot work with files obtained in text mode, they can only work on files obtained in XML mode. The documentation will generally tell you for each module what kind of input it expects.

Reading the above example with Entrez.parse() gives us the following:

In [4]:
# Download sequence record for genbank id KT220438 (HA from influenza A)
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="xml")
parsed = Entrez.parse(handle)
record = list(parsed)[0] # We need to convert the parsed contents into a list. Here we want just the 0th element.
handle.close()
print(record)
DictElement({'GBSeq_locus': 'KT220438', 'GBSeq_length': '1701', 'GBSeq_strandedness': 'single', 'GBSeq_moltype': 'cRNA', 'GBSeq_topology': 'linear', 'GBSeq_division': 'VRL', 'GBSeq_update-date': '20-JUL-2015', 'GBSeq_create-date': '20-JUL-2015', 'GBSeq_definition': 'Influenza A virus (A/NewJersey/NHRC_93219/2015(H3N2)) segment 4 hemagglutinin (HA) gene, complete cds', 'GBSeq_primary-accession': 'KT220438', 'GBSeq_accession-version': 'KT220438.1', 'GBSeq_other-seqids': ['gb|KT220438.1|', 'gi|887493048'], 'GBSeq_source': 'Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))', 'GBSeq_organism': 'Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))', 'GBSeq_taxonomy': 'Viruses; Riboviria; Negarnaviricota; Polyploviricotina; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus', 'GBSeq_references': [DictElement({'GBReference_reference': '1', 'GBReference_position': '1..1701', 'GBReference_authors': ['Sitz,C.R.', 'Thammavong,H.L.', 'Balansay-Ames,M.S.', 'Hawksworth,A.W.', 'Myers,C.A.', 'Brice,G.T.'], 'GBReference_title': 'GEISS Influenza Surveillance Response Program', 'GBReference_journal': 'Unpublished'}, attributes={}), DictElement({'GBReference_reference': '2', 'GBReference_position': '1..1701', 'GBReference_authors': ['Sitz,C.R.', 'Thammavong,H.L.', 'Balansay-Ames,M.S.', 'Hawksworth,A.W.', 'Myers,C.A.', 'Brice,G.T.'], 'GBReference_title': 'Direct Submission', 'GBReference_journal': 'Submitted (29-JUN-2015) Operational Infectious Diseases, Naval Health Research Center, 140 Sylvester Rd., San Diego, CA 92106, USA'}, attributes={})], 'GBSeq_comment': '##Assembly-Data-START## ; Sequencing Technology :: Sanger dideoxy sequencing ; ##Assembly-Data-END##', 'GBSeq_feature-table': [DictElement({'GBFeature_key': 'source', 'GBFeature_location': '1..1701', 'GBFeature_intervals': [DictElement({'GBInterval_from': '1', 'GBInterval_to': '1701', 'GBInterval_accession': 'KT220438.1'}, attributes={})], 'GBFeature_quals': [DictElement({'GBQualifier_name': 'organism', 'GBQualifier_value': 'Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))'}, attributes={}), DictElement({'GBQualifier_name': 'mol_type', 'GBQualifier_value': 'viral cRNA'}, attributes={}), DictElement({'GBQualifier_name': 'strain', 'GBQualifier_value': 'A/NewJersey/NHRC_93219/2015'}, attributes={}), DictElement({'GBQualifier_name': 'serotype', 'GBQualifier_value': 'H3N2'}, attributes={}), DictElement({'GBQualifier_name': 'isolation_source', 'GBQualifier_value': 'nasopharyngeal swab'}, attributes={}), DictElement({'GBQualifier_name': 'host', 'GBQualifier_value': 'Homo sapiens'}, attributes={}), DictElement({'GBQualifier_name': 'db_xref', 'GBQualifier_value': 'taxon:1682360'}, attributes={}), DictElement({'GBQualifier_name': 'segment', 'GBQualifier_value': '4'}, attributes={}), DictElement({'GBQualifier_name': 'lab_host', 'GBQualifier_value': 'MDCK'}, attributes={}), DictElement({'GBQualifier_name': 'country', 'GBQualifier_value': 'USA: New Jersey'}, attributes={}), DictElement({'GBQualifier_name': 'collection_date', 'GBQualifier_value': '17-Jan-2015'}, attributes={})]}, attributes={}), DictElement({'GBFeature_key': 'gene', 'GBFeature_location': '1..1701', 'GBFeature_intervals': [DictElement({'GBInterval_from': '1', 'GBInterval_to': '1701', 'GBInterval_accession': 'KT220438.1'}, attributes={})], 'GBFeature_quals': [DictElement({'GBQualifier_name': 'gene', 'GBQualifier_value': 'HA'}, attributes={})]}, attributes={}), DictElement({'GBFeature_key': 'CDS', 'GBFeature_location': '1..1701', 'GBFeature_intervals': [DictElement({'GBInterval_from': '1', 'GBInterval_to': '1701', 'GBInterval_accession': 'KT220438.1'}, attributes={})], 'GBFeature_quals': [DictElement({'GBQualifier_name': 'gene', 'GBQualifier_value': 'HA'}, attributes={}), DictElement({'GBQualifier_name': 'function', 'GBQualifier_value': 'receptor binding and fusion protein'}, attributes={}), DictElement({'GBQualifier_name': 'codon_start', 'GBQualifier_value': '1'}, attributes={}), DictElement({'GBQualifier_name': 'transl_table', 'GBQualifier_value': '1'}, attributes={}), DictElement({'GBQualifier_name': 'product', 'GBQualifier_value': 'hemagglutinin'}, attributes={}), DictElement({'GBQualifier_name': 'protein_id', 'GBQualifier_value': 'AKQ43545.1'}, attributes={}), DictElement({'GBQualifier_name': 'translation', 'GBQualifier_value': 'MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI'}, attributes={})]}, attributes={}), DictElement({'GBFeature_key': 'mat_peptide', 'GBFeature_location': '49..1035', 'GBFeature_intervals': [DictElement({'GBInterval_from': '49', 'GBInterval_to': '1035', 'GBInterval_accession': 'KT220438.1'}, attributes={})], 'GBFeature_quals': [DictElement({'GBQualifier_name': 'gene', 'GBQualifier_value': 'HA'}, attributes={}), DictElement({'GBQualifier_name': 'product', 'GBQualifier_value': 'HA1'}, attributes={}), DictElement({'GBQualifier_name': 'peptide', 'GBQualifier_value': 'QKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTR'}, attributes={})]}, attributes={}), DictElement({'GBFeature_key': 'mat_peptide', 'GBFeature_location': '1036..1698', 'GBFeature_intervals': [DictElement({'GBInterval_from': '1036', 'GBInterval_to': '1698', 'GBInterval_accession': 'KT220438.1'}, attributes={})], 'GBFeature_quals': [DictElement({'GBQualifier_name': 'gene', 'GBQualifier_value': 'HA'}, attributes={}), DictElement({'GBQualifier_name': 'product', 'GBQualifier_value': 'HA2'}, attributes={}), DictElement({'GBQualifier_name': 'peptide', 'GBQualifier_value': 'GIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI'}, attributes={})]}, attributes={})], 'GBSeq_sequence': 'atgaagactatcattgctttgagctacattctatgtctggttttcgctcaaaaaattcctggaaatgacaatagcacggcaacgctgtgccttgggcaccatgcagtaccaaacggaacgatagtgaaaacaatcacaaatgaccgaattgaagttactaatgctactgagctggttcagaattcctcaataggtgaaatatgcgacagtcctcatcagatccttgatggagaaaactgcacactaatagatgctctattgggagaccctcagtgtgatggctttcaaaataagaaatgggacctttttgttgaacgaagcaaagcctacagcaactgctacccttatgatgtgccggattatgcctcccttaggtcactagttgcctcatccggcacactggagtttaacaatgaaagcttcaattggactggagtcactcaaaacggaacaagttctgcttgcataaggagatctagtagtagtttctttagtagattaaattggttgacccacttaaactacacatacccagcattgaacgtgactatgccaaacaatgaacaatttgacaaattgtacatttggggggttcaccacccgggtacggacaaggaccaaatcttcctgtatgctcaatcatcaggaagaatcacagtatctaccaaaagaagccaacaagctgtaatcccaaatatcggatctagacccagaataagggatatccctagcagaataagcatctattggacaatagtaaaaccgggagacatacttttgattaacagcacagggaatctaattgctcctaggggttacttcaaaatacgaagtgggaaaagctcaataatgagatcagatgcacccattggcaaatgcaagtctgaatgcatcactccaaatggaagcattcccaatgacaaaccattccaaaatgtaaacaggatcacatacggggcctgtcccagatatgttaagcatagcactctaaaattggcaacaggaatgcgaaatgtaccagagaaacaaactagaggcatatttggcgcaatagcgggtttcatagaaaatggttgggagggaatggtggatggttggtacggtttcaggcatcaaaattctgagggaagaggacaagcagcagatctcaaaagcactcaagcagcaatcgatcaaatcaatgggaagctgaatcgattgatcgggaaaaccaacgagaaattccatcagattgaaaaagaattctcagaagtagaaggaagaattcaggaccttgagaaatatgttgaggacactaaaatagatctctggtcatacaacgcggagcttcttgttgccctggagaaccaacatacarttgatctaactgactcagaaatgaacaaactgtttgaaaaaacaaagaagcaactgagggaaaatgctgaggatatgggaaatggttgtttcaaaatataccacaaatgtgacaatgcctgcataggatcaataagaaatggaacttatgaccacaatgtgtacagggatgaagcattaaacaaccggttccagatcaagggagttgagctgaagtcagggtacaaagattggatcctatggatttcctytgccatatcatgttttttgctttgtgttgctttgttggggttcatcatgtgggcctgccaaaagggcaacattaggtgcaacatttgcatttga'}, attributes={})

While this output may not seem useful, we now just have a set of nested dictionaries that we can interrogate using standard Python techniques:

In [5]:
print(list(record.keys())) # print out all the keys in the dictionary
['GBSeq_locus', 'GBSeq_length', 'GBSeq_strandedness', 'GBSeq_moltype', 'GBSeq_topology', 'GBSeq_division', 'GBSeq_update-date', 'GBSeq_create-date', 'GBSeq_definition', 'GBSeq_primary-accession', 'GBSeq_accession-version', 'GBSeq_other-seqids', 'GBSeq_source', 'GBSeq_organism', 'GBSeq_taxonomy', 'GBSeq_references', 'GBSeq_comment', 'GBSeq_feature-table', 'GBSeq_sequence']
In [6]:
print(record['GBSeq_sequence']) # print out the sequence
atgaagactatcattgctttgagctacattctatgtctggttttcgctcaaaaaattcctggaaatgacaatagcacggcaacgctgtgccttgggcaccatgcagtaccaaacggaacgatagtgaaaacaatcacaaatgaccgaattgaagttactaatgctactgagctggttcagaattcctcaataggtgaaatatgcgacagtcctcatcagatccttgatggagaaaactgcacactaatagatgctctattgggagaccctcagtgtgatggctttcaaaataagaaatgggacctttttgttgaacgaagcaaagcctacagcaactgctacccttatgatgtgccggattatgcctcccttaggtcactagttgcctcatccggcacactggagtttaacaatgaaagcttcaattggactggagtcactcaaaacggaacaagttctgcttgcataaggagatctagtagtagtttctttagtagattaaattggttgacccacttaaactacacatacccagcattgaacgtgactatgccaaacaatgaacaatttgacaaattgtacatttggggggttcaccacccgggtacggacaaggaccaaatcttcctgtatgctcaatcatcaggaagaatcacagtatctaccaaaagaagccaacaagctgtaatcccaaatatcggatctagacccagaataagggatatccctagcagaataagcatctattggacaatagtaaaaccgggagacatacttttgattaacagcacagggaatctaattgctcctaggggttacttcaaaatacgaagtgggaaaagctcaataatgagatcagatgcacccattggcaaatgcaagtctgaatgcatcactccaaatggaagcattcccaatgacaaaccattccaaaatgtaaacaggatcacatacggggcctgtcccagatatgttaagcatagcactctaaaattggcaacaggaatgcgaaatgtaccagagaaacaaactagaggcatatttggcgcaatagcgggtttcatagaaaatggttgggagggaatggtggatggttggtacggtttcaggcatcaaaattctgagggaagaggacaagcagcagatctcaaaagcactcaagcagcaatcgatcaaatcaatgggaagctgaatcgattgatcgggaaaaccaacgagaaattccatcagattgaaaaagaattctcagaagtagaaggaagaattcaggaccttgagaaatatgttgaggacactaaaatagatctctggtcatacaacgcggagcttcttgttgccctggagaaccaacatacarttgatctaactgactcagaaatgaacaaactgtttgaaaaaacaaagaagcaactgagggaaaatgctgaggatatgggaaatggttgtttcaaaatataccacaaatgtgacaatgcctgcataggatcaataagaaatggaacttatgaccacaatgtgtacagggatgaagcattaaacaaccggttccagatcaagggagttgagctgaagtcagggtacaaagattggatcctatggatttcctytgccatatcatgttttttgctttgtgttgctttgttggggttcatcatgtgggcctgccaaaagggcaacattaggtgcaacatttgcatttga
In [7]:
features = record['GBSeq_feature-table'] # extract all the features
for feature in features: # loop over features and print feature key and feature location
    print(feature['GBFeature_key'] + ": " + feature['GBFeature_location'])
source: 1..1701
gene: 1..1701
CDS: 1..1701
mat_peptide: 49..1035
mat_peptide: 1036..1698

Running search queries through Entrez

So far we have only downloaded specific records from Entrez. In addition to just downloading records, however, we can also run searches directly from python. Any query that you can do on the Entrez website (https://www.ncbi.nlm.nih.gov/) can also be executed directly from python. This allows you to find a large number of records all at once and process them in an automated fashion.

For example, below we will see how to automatically run and retrieve the results for the following search term: "influenza a virus texas h1n1 hemagglutinin complete cds". A direct link to the search results on the Entrez website is here:
https://www.ncbi.nlm.nih.gov/nuccore/?term=influenza+a+virus+texas+h1n1+hemagglutinin+complete+cds

(Note that in the following Python code, we limit the number of search hits returned to the first 10.)

In [8]:
# let's do a search for complete genomes of the SARS-COV2 virus
handle = Entrez.esearch(
    db="nucleotide",  # database to search
    term="sars-cov2 complete cds",  # search term
    retmax=10  # maximum number of results that are returned
)
record = Entrez.read(handle)
handle.close()

gi_list = record["IdList"] # list of genbank identifiers found
print(gi_list)
['1829138230', '1829138218', '1829138206', '1829138194', '1829138182', '1829138170', '1829138158', '1829138146', '1829138134', '1829138121']

Note that even though NCBI is phasing out sequence GI numbers, for now the esearch() function still returns GI numbers (numerical sequence identifiers without version information).

We can download all the genbank records in the list of identifiers using the Entrez.efetch() function. This function provides us with a handle that needs to be processed with SeqIO.parse(). (We used SeqIO.read() previously, which reads a single record. SeqIO.parse() reads multiple records. See here for details.)

In [9]:
handle = Entrez.efetch(db="nucleotide", id=gi_list, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "genbank")

for record in records:
    print(record.description)
    
handle.close() # important, close the handle only after you have iterated over the records. Otherwise you will get an error!
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/MI-SC2-0009/2020 ORF1ab polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/MI-SC2-0008/2020 ORF1ab polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/MI-SC2-0007/2020 ORF1ab polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/MI-SC2-0006/2020 ORF1ab polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/MI-SC2-0005/2020 ORF1ab polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/MI-SC2-0004/2020 ORF1ab polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/MI-SC2-0003/2020 ORF1ab polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/MI-SC2-0002/2020 ORF1ab polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/MI-SC2-0001/2020 ORF1ab polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UNC_200189/2020, complete genome

As another example, let's search the "pubmed" database (database of scientific publications) for papers from 2015 written by "Wilke CO". The exact search term we need to use is the following: "Wilke CO[Author] AND 2015[Date - Publication]"

You can click here to see the result from that search online.

In [10]:
handle = Entrez.esearch(
    db="pubmed",  # database to search
    term="Wilke CO[Author] AND 2015[Date - Publication]",  # search term
    retmax=10  # number of results that are returned
)
record = Entrez.read(handle)
handle.close()

# search returns PubMed IDs (pmids)
pmid_list = record["IdList"]
print(pmid_list)
['26770819', '26468068', '26430238', '26397960', '26355089', '26275208', '26020774', '25999509', '25787027', '25737813']

Just like with genes and genomes, we can download the records corresponding to these identifiers. They are references. We'll print the author(s), title, and reference (source).

In [11]:
from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=pmid_list, rettype="medline", retmode="text")
records = Medline.parse(handle)
for record in records:
    print(record['AU']) # author list
    print(record['TI']) # title
    print(record['SO']) # source (reference)
    print()
handle.close()
['Meyer AG', 'Spielman SJ', 'Bedford T', 'Wilke CO']
Time dependence of evolutionary metrics during the 2009 pandemic influenza virus outbreak.
Virus Evol. 2015 Jan;1(1). doi: 10.1093/ve/vev006. Epub 2015 Jan 1.

['Meyer AG', 'Wilke CO']
The utility of protein structure as a predictor of site-wise dN/dS varies widely among HIV-1 proteins.
J R Soc Interface. 2015 Oct 6;12(111):20150579. doi: 10.1098/rsif.2015.0579.

['Wilke CO']
Evolutionary paths of least resistance.
Proc Natl Acad Sci U S A. 2015 Oct 13;112(41):12553-4. doi: 10.1073/pnas.1517390112. Epub 2015 Oct 1.

['Spielman SJ', 'Wilke CO']
Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies.
PLoS One. 2015 Sep 23;10(9):e0139047. doi: 10.1371/journal.pone.0139047. eCollection 2015.

['Kerr SA', 'Jackson EL', 'Lungu OI', 'Meyer AG', 'Demogines A', 'Ellington AD', 'Georgiou G', 'Wilke CO', 'Sawyer SL']
Computational and Functional Analysis of the Virus-Receptor Interface Reveals Host Range Trade-Offs in New World Arenaviruses.
J Virol. 2015 Nov;89(22):11643-53. doi: 10.1128/JVI.01408-15. Epub 2015 Sep 9.

['Houser JR', 'Barnhart C', 'Boutz DR', 'Carroll SM', 'Dasgupta A', 'Michener JK', 'Needham BD', 'Papoulas O', 'Sridhara V', 'Sydykova DK', 'Marx CJ', 'Trent MS', 'Barrick JE', 'Marcotte EM', 'Wilke CO']
Controlled Measurement and Comparative Analysis of Cellular Components in E. coli Reveals Broad Regulatory Changes in Response to Glucose Starvation.
PLoS Comput Biol. 2015 Aug 14;11(8):e1004400. doi: 10.1371/journal.pcbi.1004400. eCollection 2015 Aug.

['Meyer AG', 'Wilke CO']
Geometric Constraints Dominate the Antigenic Evolution of Influenza H3N2 Hemagglutinin.
PLoS Pathog. 2015 May 28;11(5):e1004940. doi: 10.1371/journal.ppat.1004940. eCollection 2015 May.

['Kachroo AH', 'Laurent JM', 'Yellman CM', 'Meyer AG', 'Wilke CO', 'Marcotte EM']
Evolution. Systematic humanization of yeast genes reveals conserved functions and genetic modularity.
Science. 2015 May 22;348(6237):921-5. doi: 10.1126/science.aaa0769.

['Echave J', 'Jackson EL', 'Wilke CO']
Relationship between protein thermodynamic constraints and variation of evolutionary rates among sites.
Phys Biol. 2015 Mar 19;12(2):025002. doi: 10.1088/1478-3975/12/2/025002.

['Spielman SJ', 'Kumar K', 'Wilke CO']
Comprehensive, structurally-informed alignment and phylogeny of vertebrate biogenic amine receptors.
PeerJ. 2015 Feb 17;3:e773. doi: 10.7717/peerj.773. eCollection 2015.

Problems

Problem 1

Use the following code to download the genbank record KT220438 in XML and parse it with the Entrez.parse() function:

In [12]:
# Download sequence record for genbank id KT220438 (HA from influenza A)
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="xml")
parsed = Entrez.parse(handle)
record = list(parsed)[0] # Convert the parsed contents into a list and take element 0.
handle.close()

Then:

(a) Print out the value for the key GBSeq_definition.

(b) Find the CDS feature and print out all its qualifiers. Note that qualifiers are provided under the keyword GBFeature_quals.

In [13]:
# Problem 1a

# Your code goes here.
In [14]:
# Problem 1b

# Your code goes here.

Problem 2:

(a) Use an Entrez esearch query of the pubmed database to find out how many publications "Spielman SJ" wrote in 2015.

(b) From the results of part (a), compile a combined list of all the co-authors of "Spielman SJ" in 2015.

In [15]:
# Problem 2a

# Your code goes here.
In [16]:
# Problem 2b

# Your code goes here.

If this was easy

Problem 3:

For larger searches, NCBI wants you to use the WebEnv method to download all your search results. This is explained in the Biopython tutorial here. Rewrite the SARS-COV2 search from the section "Running search queries on through Entrez" in such a way that it uses the WebEnv method. For this downloading method, it makes sense to write all the results into a file and then read the results back in.

In [17]:
# Problem 3

# Your code goes here.