Class 26: BLAST

April 19, 2018

The web interface to BLAST is available here: http://blast.ncbi.nlm.nih.gov/Blast.cgi

Let's search for proteins related to the following query sequence, which is the glycoprotein of Machupo virus (causative agent of Bolivian hemorrhagic fever):

>GI:45825963|Machupo virus glycoprotein
MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCS
DGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYYMKGGANIFLIRVSDVS
VLMKEYDVSVYEPEDLGNCLNKSDSSWAIHWFSIALGHDWLMDPPMLCRNKTKKEGSN
IQFNISKADESRVYGKKIRNGMRHLFRGFYDPCEEGKVCYVTINQCGDPSSFEYCGTN
YLSKCQFDHVNTLHFLVRSKTHLNF

We can download the blast results from the NCBI website in XML format and store them as Machupo_BLAST.xml. This file is available here.

Now we can process this file with Biopython.

In [1]:
from Bio.Blast import NCBIXML
from urllib.request import urlretrieve # to download xml file

# download file from course website and store locally
urlretrieve('http://wilkelab.org/classes/SDS348/data_sets/Machupo_BLAST.xml', 'Machupo_BLAST.xml')

# open the downloaded file and parse with NCBIXML.read()
blast_handle = open("Machupo_BLAST.xml")
blast_record = NCBIXML.read(blast_handle)
blast_handle.close()

imax = 30 # process the first 30 alignments
i = 0
for alignment in blast_record.alignments:
    i += 1
    if i > imax:
        break
    # we need a for loop here because in theory we could have
    # more than one hsp (High-scoring Segment Pair) per alignment
    for hsp in alignment.hsps:
        print('\n****Alignment****')
        print('sequence ID:', alignment.title)
        print('length:', alignment.length)
        print('score:', hsp.score)
        print('e value:', hsp.expect)
        print("Query:", hsp.query[0:100] + '...')
        print("Match:", hsp.match[0:100] + '...')
        print("  Hit:", hsp.sbjct[0:100] + '...')
****Alignment****
sequence ID: gi|45825964|gb|AAS77647.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1381.0
e value: 0.0
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45826506|gb|AAS77879.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1379.0
e value: 0.0
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825936|gb|AAS77633.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1274.0
e value: 4.8109e-175
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQL+SFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLVSFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825934|gb|AAS77632.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1274.0
e value: 5.5461e-175
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825948|gb|AAS77639.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825950|gb|AAS77640.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1269.0
e value: 3.05564e-174
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825952|gb|AAS77641.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825954|gb|AAS77642.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825956|gb|AAS77643.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825958|gb|AAS77644.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1266.0
e value: 7.65872e-174
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825944|gb|AAS77637.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825946|gb|AAS77638.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1262.0
e value: 3.31687e-173
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825960|gb|AAS77645.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1258.0
e value: 1.2876e-172
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLC+LNN+FYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCILNNNFYY...

****Alignment****
sequence ID: gi|45825932|gb|AAS77631.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1257.0
e value: 1.86764e-172
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825912|gb|AAS77621.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825914|gb|AAS77622.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825916|gb|AAS77623.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825918|gb|AAS77624.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825920|gb|AAS77625.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825922|gb|AAS77626.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825924|gb|AAS77627.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825926|gb|AAS77628.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825928|gb|AAS77629.1| glycoprotein 1, partial [Machupo mammarenavirus] >gi|45825930|gb|AAS77630.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1253.0
e value: 9.52979e-172
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKG+INLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHS+ELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGVINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSSELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825942|gb|AAS77636.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1250.0
e value: 2.60674e-171
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|62766416|gb|AAX99337.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1272.0
e value: 4.20087e-171
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825938|gb|AAS77634.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1248.0
e value: 4.31124e-171
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825940|gb|AAS77635.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1242.0
e value: 3.92735e-170
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|34365533|ref|NP_899212.1| glycoprotein precursor [Machupo mammarenavirus] >gi|22901291|gb|AAN09942.1| glycoprotein precursor [Machupo mammarenavirus] >gi|23307851|gb|AAN05425.1| glycoprotein precursor [Machupo mammarenavirus] >gi|45826503|gb|AAS77877.1| glycoprotein precursor [Machupo mammarenavirus] >gi|48095766|gb|AAT40451.1| glycoprotein precursor [Machupo mammarenavirus] >gi|62766413|gb|AAX99335.1| glycoprotein precursor [Machupo mammarenavirus] >gi|365976987|gb|AEX08372.1| glycoprotein precursor [Machupo mammarenavirus] >gi|666915575|gb|AIG51558.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1265.0
e value: 6.06597e-170
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|62766419|gb|AAX99339.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1253.0
e value: 3.68973e-168
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|62766404|gb|AAX99329.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1249.0
e value: 1.37897e-167
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKG+INLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHS+ELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGVINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSSELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|45825962|gb|AAS77646.1| glycoprotein 1, partial [Machupo mammarenavirus]
length: 257
score: 1224.0
e value: 1.91596e-167
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|82002961|sp|Q6IUF7.1|GLYC_MACHU RecName: Full=Pre-glycoprotein polyprotein GP complex; Contains: RecName: Full=Stable signal peptide; Short=SSP; Contains: RecName: Full=Glycoprotein G1; Short=GP1; Contains: RecName: Full=Glycoprotein G2; Short=GP2 >gi|48525711|gb|AAT45081.1| glycoprotein precursor [Machupo mammarenavirus] >gi|62766401|gb|AAX99327.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1248.0
e value: 2.10907e-167
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|62766410|gb|AAX99333.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1238.0
e value: 6.66728e-166
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|48095772|gb|AAT40455.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1213.0
e value: 4.05713e-162
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNS YY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSLYY...

****Alignment****
sequence ID: gi|62766407|gb|AAX99331.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1212.0
e value: 5.15547e-162
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYY...

****Alignment****
sequence ID: gi|365976993|gb|AEX08376.1| glycoprotein precursor [Machupo mammarenavirus]
length: 496
score: 1197.0
e value: 9.28986e-160
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFL L GRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNS YY...
  Hit: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLLLXGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSXYY...

****Alignment****
sequence ID: gi|255648557|gb|ACU24736.1| glycoprotein precursor, partial [Machupo mammarenavirus]
length: 473
score: 1093.0
e value: 2.75951e-144
Query: VAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYYMKGGANIFLIRVSDVSVLMKEYD...
Match: VAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQ LLANHSNELPSLCMLNNSFYYMKGG N FLIRVSD+SVLMKE+D...
  Hit: VAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQGLLANHSNELPSLCMLNNSFYYMKGGVNTFLIRVSDISVLMKEHD...

****Alignment****
sequence ID: gi|255648545|gb|ACU24728.1| glycoprotein precursor, partial [Machupo mammarenavirus] >gi|255648548|gb|ACU24730.1| glycoprotein precursor, partial [Machupo mammarenavirus] >gi|255648551|gb|ACU24732.1| glycoprotein precursor, partial [Machupo mammarenavirus]
length: 473
score: 1086.0
e value: 3.4405e-143
Query: VAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYYMKGGANIFLIRVSDVSVLMKEYD...
Match: VAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQRLLANHSNELPSLCMLNNSFYYMKGG N FLIRVS +SVL +E+D...
  Hit: VAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQRLLANHSNELPSLCMLNNSFYYMKGGVNTFLIRVSSISVLSREHD...

****Alignment****
sequence ID: gi|255648554|gb|ACU24734.1| glycoprotein precursor, partial [Machupo mammarenavirus]
length: 473
score: 1048.0
e value: 1.77582e-137
Query: VAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYYMKGGANIFLIRVSDVSVLMKEYD...
Match: VAVSLIAVIKGIINLYKSGLFQFIFFL LAGRSCSDGTFKIGLHTEFQSVT TMQ LLANHSNELPSLCMLNNSFYYMKGG N FLIRVS VSV+ +E+D...
  Hit: VAVSLIAVIKGIINLYKSGLFQFIFFLLLAGRSCSDGTFKIGLHTEFQSVTLTMQGLLANHSNELPSLCMLNNSFYYMKGGVNTFLIRVSSVSVVSREHD...

****Alignment****
sequence ID: gi|240104274|pdb|2WFO|A Chain A, Crystal Structure Of Machupo Virus Envelope Glycoprotein Gp1
length: 182
score: 938.0
e value: 4.9368e-125
Query: ELPSLCMLNNSFYYMKGGANIFLIRVSDVSVLMKEYDVSVYEPEDLGNCLNKSDSSWAIHWFSIALGHDWLMDPPMLCRNKTKKEGSNIQFNISKADESR...
Match: ELPSLCMLNNSFYYMKGGANIFLIRVSDVSVLMKEYDVSVYEPEDLGNCLNKSDSSWAIHWFSIALGHDWLMDPPMLCRNKTKKEGSNIQFNISKADESR...
  Hit: ELPSLCMLNNSFYYMKGGANIFLIRVSDVSVLMKEYDVSVYEPEDLGNCLNKSDSSWAIHWFSIALGHDWLMDPPMLCRNKTKKEGSNIQFNISKADESR...

****Alignment****
sequence ID: gi|290790109|pdb|3KAS|B Chain B, Machupo Virus Gp1 Bound To Human Transferrin Receptor 1
length: 162
score: 841.0
e value: 1.04041e-110
Query: NHSNELPSLCMLNNSFYYMKGGANIFLIRVSDVSVLMKEYDVSVYEPEDLGNCLNKSDSSWAIHWFSIALGHDWLMDPPMLCRNKTKKEGSNIQFNISKA...
Match: NHSNELPSLCMLNNSFYYM+GG N FLIRVSD+SVLMKEYDVS+YEPEDLGNCLNKSDSSWAIHWFS ALGHDWLMDPPMLCRNKTKKEGSNIQFNISKA...
  Hit: NHSNELPSLCMLNNSFYYMRGGVNTFLIRVSDISVLMKEYDVSIYEPEDLGNCLNKSDSSWAIHWFSNALGHDWLMDPPMLCRNKTKKEGSNIQFNISKA...

****Alignment****
sequence ID: gi|40807309|ref|NP_955756.1| glycoprotein G1 [Junin mammarenavirus]
length: 247
score: 710.0
e value: 1.29708e-89
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQ ISF QEIP FLQEALNIALVAVSLIA+IKGI+NLYKSGLFQF  FL LAGRSC++  FKIGLHTEFQ+V+F+M  L +N+ ++LP LC LN S  Y...
  Hit: MGQFISFMQEIPTFLQEALNIALVAVSLIAIIKGIVNLYKSGLFQFFVFLALAGRSCTEEAFKIGLHTEFQTVSFSMVGLFSNNPHDLPLLCTLNKSHLY...

****Alignment****
sequence ID: gi|115510974|gb|ABI99475.1| glycoprotein precursor [Junin mammarenavirus]
length: 485
score: 718.0
e value: 1.11343e-87
Query: MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQFIFFLFLAGRSCSDGTFKIGLHTEFQSVTFTMQRLLANHSNELPSLCMLNNSFYY...
Match: MGQ ISF QEIP FLQEALNIALVAVSLIA+IKGI+NLYKSGLFQF  FL LAGRSC++  FKIGLHTEFQ+V+F+M  LL+N  ++LP LC LN S  Y...
  Hit: MGQFISFMQEIPTFLQEALNIALVAVSLIAIIKGIVNLYKSGLFQFFVFLALAGRSCTEEAFKIGLHTEFQTVSFSMVGLLSNSPHDLPLLCTLNKSHLY...

Problems

Problem 1:

Count the number of hits with an E value of less than or equal to 1e-100.

In [2]:
# Your code goes here.

Problem 2:

Extract the genbank identifiers (written as gb|string|, where string is the actual identifier, consisting of letters, numbers, and the period symbol) for all matches with an E value of less than or equal to 1e-100, and store them in a python list. For matches that list multiple genbank identifiers, only extract the first one.

In [3]:
# Your code goes here.

If this was easy

Problem 3:

Using the list of genbank identifiers obtained in the previous exercise, download the corresponding sequences from genbank and print them out in FASTA format. Hint: You will have to specify the database as "protein" for this to work, since the previous exercise generated identifiers for protein sequences.

Hint: Use the function SeqIO.write() to output your results in FASTA format, and use sys.stdout from the sys module as your output handle.

In [4]:
# Your code goes here.