Lab Worksheet 14¶

The web interface to BLAST is available here: http://blast.ncbi.nlm.nih.gov/Blast.cgi

Let's search for proteins related to the following query sequence, which is the human chemokine receptor 4 (a receptor that plays a fundamental role in the immune system):

>human
MSIPLPLLQIYTSDNYTEEMGSGDYDSMKEPCFREENANFNKIFLPTIYSIIFLTGIVGN
GLVILVMGYQKKLRSMTDKYRLHLSVADLLFVITLPFWAVDAVANWYFGNFLCKAVHVIY
TVNLYSSVLILAFISLDRYLAIVHATNSQRPRKLLAEKVVYVGVWIPALLLTIPDFIFAN
VSEADDRYICDRFYPNDLWVVVFQFQHIMVGLILPGIVILSCYCIIISKLSHSKGHQKRK
ALKTTVILILAFFACWLPYYIGISIDSFILLEIIKQGCEFENTVHKWISITEALAFFHCC
LNPILYAFLGAKFKTSAQHALTSVSRGSSLKILSKGKRGGHSSVSTESESSSFHSS

Problems¶

Problem 1:

Download the blast results from the NCBI website in XML format and store them as cxcr4_BLAST.xml. Extract the genbank identifiers (written as gb|string|, where string is the actual identifier, consisting of letters, numbers, and the period symbol) for all matches with a score greater than or equal to 1600 and less than or equal 1800, and store them in a python list. For matches that list multiple genbank identifiers, only extract the first one.

# Your code goes here

Problem 2:

Using the list of genbank identifiers obtained in the previous exercise, download the corresponding sequences from genbank and print them out in FASTA format.

Hints:

You will have to specify the database as "protein" for this to work, since the previous exercise generated identifiers for protein sequences.
Use the function SeqIO.write() to output your results in FASTA format, and use sys.stdout from the sys module as your output handle.

# Your code goes here

Problem 3:

Use the FASTA format of the sequences from problem 2 and make a multiple sequence alignment and phylogenetic tree with the Clustal Omega web interface: http://www.ebi.ac.uk/Tools/msa/clustalo/.