Lab Worksheet 14

The web interface to BLAST is available here: http://blast.ncbi.nlm.nih.gov/Blast.cgi

Let's search for proteins related to the following query sequence, which is the human chemokine receptor 4 (a receptor that plays a fundamental role in the immune system):

>human
MSIPLPLLQIYTSDNYTEEMGSGDYDSMKEPCFREENANFNKIFLPTIYSIIFLTGIVGN
GLVILVMGYQKKLRSMTDKYRLHLSVADLLFVITLPFWAVDAVANWYFGNFLCKAVHVIY
TVNLYSSVLILAFISLDRYLAIVHATNSQRPRKLLAEKVVYVGVWIPALLLTIPDFIFAN
VSEADDRYICDRFYPNDLWVVVFQFQHIMVGLILPGIVILSCYCIIISKLSHSKGHQKRK
ALKTTVILILAFFACWLPYYIGISIDSFILLEIIKQGCEFENTVHKWISITEALAFFHCC
LNPILYAFLGAKFKTSAQHALTSVSRGSSLKILSKGKRGGHSSVSTESESSSFHSS

Problems

Problem 1:

Download the blast results from the NCBI website in XML format and store them as cxcr4_BLAST.xml. Extract the genbank identifiers (written as gb|string|, where string is the actual identifier, consisting of letters, numbers, and the period symbol) for all matches with a score greater than or equal to 1600 and less than or equal 1800, and store them in a python list. For matches that list multiple genbank identifiers, only extract the first one.

In [1]:
# Your code goes here

Problem 2:

Using the list of genbank identifiers obtained in the previous exercise, download the corresponding sequences from genbank and print them out in FASTA format.

Hints:

  • You will have to specify the database as "protein" for this to work, since the previous exercise generated identifiers for protein sequences.
  • Use the function SeqIO.write() to output your results in FASTA format, and use sys.stdout from the sys module as your output handle.
In [2]:
# Your code goes here

Problem 3:

Use the FASTA format of the sequences from problem 2 and make a multiple sequence alignment and phylogenetic tree with the Clustal Omega web interface: http://www.ebi.ac.uk/Tools/msa/clustalo/.