The web interface to BLAST is available here: http://blast.ncbi.nlm.nih.gov/Blast.cgi
Let's search for proteins related to the following query sequence, which is the human chemokine receptor 4 (a receptor that plays a fundamental role in the immune system):
>human
MSIPLPLLQIYTSDNYTEEMGSGDYDSMKEPCFREENANFNKIFLPTIYSIIFLTGIVGN
GLVILVMGYQKKLRSMTDKYRLHLSVADLLFVITLPFWAVDAVANWYFGNFLCKAVHVIY
TVNLYSSVLILAFISLDRYLAIVHATNSQRPRKLLAEKVVYVGVWIPALLLTIPDFIFAN
VSEADDRYICDRFYPNDLWVVVFQFQHIMVGLILPGIVILSCYCIIISKLSHSKGHQKRK
ALKTTVILILAFFACWLPYYIGISIDSFILLEIIKQGCEFENTVHKWISITEALAFFHCC
LNPILYAFLGAKFKTSAQHALTSVSRGSSLKILSKGKRGGHSSVSTESESSSFHSS
Problem 1:
Download the blast results from the NCBI website in XML format and store them as cxcr4_BLAST.xml
. Extract the genbank identifiers (written as gb|string|, where string is the actual identifier, consisting of letters, numbers, and the period symbol) for all matches with a score greater than or equal to 1600 and less than or equal 1800, and store them in a python list. For matches that list multiple genbank identifiers, only extract the first one.
from Bio.Blast import NCBIXML
import re
# open the downloaded file and parse with NCBIXML.read()
blast_handle = open("cxcr4_BLAST.xml")
blast_record = NCBIXML.read(blast_handle)
blast_handle.close()
gb_list = []
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.score >= 1600 and hsp.score <= 1800:
match = re.search(r'gb\|([\w\d\.]+)\|', alignment.title)
if match:
gb_id = match.group(1)
gb_list.append(gb_id)
print(gb_list)
Problem 2:
Using the list of genbank identifiers obtained in the previous exercise, download the corresponding sequences from genbank and print them out in FASTA format.
Hints:
SeqIO.write()
to output your results in FASTA format, and use sys.stdout
from the sys
module as your output handle.from Bio import Entrez, SeqIO
import sys
Entrez.email = "dariya.k.sydykova@gmail.com" # put your email here
handle = Entrez.efetch(db="protein", id=gb_list, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "genbank")
for record in records:
SeqIO.write(record, sys.stdout, "fasta")
handle.close() # important, close the handle only after you have iterated over the records. Otherwise you will get an error!
Problem 3:
Use the FASTA format of the sequences from problem 2 and make a multiple sequence alignment and a phylogenetic tree with the Clustal Omega web interface: http://www.ebi.ac.uk/Tools/msa/clustalo/.