Apr 7, 2020
The Biopython package, available at biopython.org, consists of a large set of helpful functions and tools to solve frequently encountered problems in computational biology. In particular, it has excellent functionality to download and analyze sequence data. It also has a useful module for carrying out analysis of protein structure.
Here, we will start by doing some basic sequence analysis. We will use the biopython modules Entrez
and SeqIO
:
from Bio import Entrez, SeqIO
Entrez
provides a computational interface to the widely-use entrez online database provided by the National Institutes of Health: http://www.ncbi.nlm.nih.gov/pubmed. In this database you can find almost any information of interest in biological research.
As an example, let's look at the gene of a recent influenza virus. Specifically, we look at hemagglutinin (HA) from a 2009 swine-flu strain. It is listed by ID number FJ966082.1
, and it can be found online here: http://www.ncbi.nlm.nih.gov/nuccore/FJ966082.1
We can download this record directly from python using the following Biopython code:
Entrez.email = "wilke@austin.utexas.edu" # put your email here
# Download sequence record for genbank id KT220438
# This is HA gene of Influenza A virus, strain A/NewJersey/NHRC_93219/2015(H3N2)
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# The sequence record is now stored in the variable `record` and
# we can print it to see what it contains
print(record)
Since the record is a simple python object, we can access elements of it as usual:
print("- ID of the record:")
print(record.id)
print("\n- Brief description of the record:")
print(record.description)
print("\n- Annotations that come with the record (given as a python dictionary):")
print(record.annotations)
print("\n- The sequence in this record:")
print(record.seq)
We can also do things like extract the DNA sequence and translate it into a protein sequence:
# extract the sequence from the record
DNA_seq = record.seq
# translate into a protein sequence
protein_seq = DNA_seq.translate()
print(protein_seq)
Problem 1:
(a) Download the sequence record with the ID number FJ966082
and print it out. What kind of a sequence is this?
(b) Print out the comments section of the annotation
(c) Translate the DNA sequence into a protein sequence and print it out
# Problem 1a
# you will need to import the appropriate modules for this to work
from Bio import Entrez, SeqIO
# always give Entrez your email
Entrez.email = "Your.Name.Here@example.org" # replace with your email
# your code goes here
# Problem 1b
# your code goes here
# Problem 1c
# your code goes here
Problem 2:
Print the record you downloaded under Problem 1 in FASTA format. This means that you need to first print a line starting with ">" plus some description of the record. Then you need to print a line containing the sequence in the record.
# your code goes here
Problem 3:
Write a function that takes a sequence record as input and prints it out in FASTA format. Write the function in such a way that it breaks the sequence over multiple lines, such that each line contains at most 60 characters.
# your code goes here
Problem 4:
Write a function that takes a DNA sequence as input, translates the sequence in all three reading frames, and counts the number of stop codons found in each translation. Remember that stop codons are indicated by a "*".
Note that biopython doesn't like to translate sequences whose length is not a multiple of three, so you will have to pad the sequence with trailing Ns to avoid a warning or error.
Hint: The solution to this problem will be simpler if you first write a function that can count the stop codons in one sequence.
# your code goes here