Mar 27, 2018
The Biopython package, available at biopython.org, consists of a large set of helpful functions and tools to solve frequently encountered problems in computational biology. In particular, it has excellent functionality to download and analyze sequence data. It also has a useful module for carrying out analysis of protein structure.
Here, we will start by doing some basic sequence analysis. We will use the biopython modules
from Bio import Entrez, SeqIO
Entrez provides a computational interface to the widely-use entrez online database provided by the National Institutes of Health: http://www.ncbi.nlm.nih.gov/pubmed. In this database you can find almost any information of interest in biological research.
As an example, let's look at the gene of a recent influenza virus. Specifically, we look at hemagglutinin (HA) from a 2009 swine-flu strain. It is listed by ID number
FJ966082.1, and it can be found online here: http://www.ncbi.nlm.nih.gov/nuccore/FJ966082.1
We can download this record directly from python using the following Biopython code:
Entrez.email = "firstname.lastname@example.org" # put your email here # Download sequence record for genbank id KT220438 # This is HA gene of Influenza A virus, strain A/NewJersey/NHRC_93219/2015(H3N2) handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close() # The sequence record is now stored in the variable `record` and # we can print it to see what it contains print(record)
ID: KT220438.1 Name: KT220438 Description: Influenza A virus (A/NewJersey/NHRC_93219/2015(H3N2)) segment 4 hemagglutinin (HA) gene, complete cds Number of features: 5 /molecule_type=cRNA /topology=linear /data_file_division=VRL /date=20-JUL-2015 /accessions=['KT220438'] /sequence_version=1 /keywords=[''] /source=Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2)) /organism=Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2)) /taxonomy=['Viruses', 'ssRNA viruses', 'ssRNA negative-strand viruses', 'Orthomyxoviridae', 'Influenzavirus A'] /references=[Reference(title='GEISS Influenza Surveillance Response Program', ...), Reference(title='Direct Submission', ...)] /structured_comment=OrderedDict([('Assembly-Data', OrderedDict([('Sequencing Technology', 'Sanger dideoxy sequencing')]))]) Seq('ATGAAGACTATCATTGCTTTGAGCTACATTCTATGTCTGGTTTTCGCTCAAAAA...TGA', IUPACAmbiguousDNA())
Since the record is a simple python object, we can access elements of it as usual:
print("- ID of the record:") print(record.id) print("\n- Brief description of the record:") print(record.description) print("\n- Annotations that come with the record (given as a python dictionary):") print(record.annotations) print("\n- The sequence in this record:") print(record.seq)
We can also do things like extract the DNA sequence and translate it into a protein sequence:
# extract the sequence from the record DNA_seq = record.seq # translate into a protein sequence protein_seq = DNA_seq.translate() print(protein_seq)
(a) Download the sequence record with the ID number
FJ966082 and print it out. What kind of a sequence is this?
(b) Print out the comments section of the annotation
(c) Translate the DNA sequence into a protein sequence and print it out
# Problem 1a # you will need to import the appropriate modules for this to work from Bio import Entrez, SeqIO # always give Entrez your email Entrez.email = "Your.Name.Here@example.org" # replace with your email # your code goes here
# Problem 1b # your code goes here
# Problem 1c # your code goes here
Print the record you downloaded under Problem 1 in FASTA format. This means that you need to first print a line starting with ">" plus some description of the record. Then you need to print a line containing the sequence in the record.
# your code goes here
Write a function that takes a sequence record as input and prints it out in FASTA format. Write the function in such a way that it breaks the sequence over multiple lines, such that each line contains at most 60 characters.
# your code goes here
Write a function that takes a DNA sequence as input, translates the sequence in all three reading frames, and counts the number of stop codons found in each translation. Remember that stop codons are indicated by a "*".
Note that biopython doesn't like to translate sequences whose length is not a multiple of three, so you will have to pad the sequence with trailing Ns to avoid a warning or error.
Hint: The solution to this problem will be simpler if you first write a function that can count the stop codons in one sequence.
# your code goes here