In-class worksheet 19

Apr 2, 2019

Introduction to Biopython

The Biopython package, available at biopython.org, consists of a large set of helpful functions and tools to solve frequently encountered problems in computational biology. In particular, it has excellent functionality to download and analyze sequence data. It also has a useful module for carrying out analysis of protein structure.

Here, we will start by doing some basic sequence analysis. We will use the biopython modules Entrez and SeqIO:

In [1]:
from Bio import Entrez, SeqIO

Entrez provides a computational interface to the widely-use entrez online database provided by the National Institutes of Health: http://www.ncbi.nlm.nih.gov/pubmed. In this database you can find almost any information of interest in biological research.

As an example, let's look at the gene of a recent influenza virus. Specifically, we look at hemagglutinin (HA) from a 2009 swine-flu strain. It is listed by ID number FJ966082.1, and it can be found online here: http://www.ncbi.nlm.nih.gov/nuccore/FJ966082.1

We can download this record directly from python using the following Biopython code:

In [2]:
Entrez.email = "wilke@austin.utexas.edu" # put your email here

# Download sequence record for genbank id KT220438
# This is HA gene of Influenza A virus, strain A/NewJersey/NHRC_93219/2015(H3N2)
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

# The sequence record is now stored in the variable `record` and
# we can print it to see what it contains 
print(record)
ID: KT220438.1
Name: KT220438
Description: Influenza A virus (A/NewJersey/NHRC_93219/2015(H3N2)) segment 4 hemagglutinin (HA) gene, complete cds
Number of features: 5
/structured_comment=OrderedDict([('Assembly-Data', OrderedDict([('Sequencing Technology', 'Sanger dideoxy sequencing')]))])
/organism=Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))
/accessions=['KT220438']
/topology=linear
/keywords=['']
/source=Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))
/date=20-JUL-2015
/data_file_division=VRL
/taxonomy=['Viruses', 'ssRNA viruses', 'ssRNA negative-strand viruses', 'Orthomyxoviridae', 'Influenzavirus A']
/molecule_type=cRNA
/sequence_version=1
/references=[Reference(title='GEISS Influenza Surveillance Response Program', ...), Reference(title='Direct Submission', ...)]
Seq('ATGAAGACTATCATTGCTTTGAGCTACATTCTATGTCTGGTTTTCGCTCAAAAA...TGA', IUPACAmbiguousDNA())

Since the record is a simple python object, we can access elements of it as usual:

In [3]:
print("- ID of the record:")
print(record.id)

print("\n- Brief description of the record:")
print(record.description)

print("\n- Annotations that come with the record (given as a python dictionary):")
print(record.annotations)

print("\n- The sequence in this record:")
print(record.seq)
- ID of the record:
KT220438.1

- Brief description of the record:
Influenza A virus (A/NewJersey/NHRC_93219/2015(H3N2)) segment 4 hemagglutinin (HA) gene, complete cds

- Annotations that come with the record (given as a python dictionary):
{'structured_comment': OrderedDict([('Assembly-Data', OrderedDict([('Sequencing Technology', 'Sanger dideoxy sequencing')]))]), 'organism': 'Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))', 'accessions': ['KT220438'], 'topology': 'linear', 'keywords': [''], 'source': 'Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))', 'date': '20-JUL-2015', 'data_file_division': 'VRL', 'taxonomy': ['Viruses', 'ssRNA viruses', 'ssRNA negative-strand viruses', 'Orthomyxoviridae', 'Influenzavirus A'], 'molecule_type': 'cRNA', 'sequence_version': 1, 'references': [Reference(title='GEISS Influenza Surveillance Response Program', ...), Reference(title='Direct Submission', ...)]}

- The sequence in this record:
ATGAAGACTATCATTGCTTTGAGCTACATTCTATGTCTGGTTTTCGCTCAAAAAATTCCTGGAAATGACAATAGCACGGCAACGCTGTGCCTTGGGCACCATGCAGTACCAAACGGAACGATAGTGAAAACAATCACAAATGACCGAATTGAAGTTACTAATGCTACTGAGCTGGTTCAGAATTCCTCAATAGGTGAAATATGCGACAGTCCTCATCAGATCCTTGATGGAGAAAACTGCACACTAATAGATGCTCTATTGGGAGACCCTCAGTGTGATGGCTTTCAAAATAAGAAATGGGACCTTTTTGTTGAACGAAGCAAAGCCTACAGCAACTGCTACCCTTATGATGTGCCGGATTATGCCTCCCTTAGGTCACTAGTTGCCTCATCCGGCACACTGGAGTTTAACAATGAAAGCTTCAATTGGACTGGAGTCACTCAAAACGGAACAAGTTCTGCTTGCATAAGGAGATCTAGTAGTAGTTTCTTTAGTAGATTAAATTGGTTGACCCACTTAAACTACACATACCCAGCATTGAACGTGACTATGCCAAACAATGAACAATTTGACAAATTGTACATTTGGGGGGTTCACCACCCGGGTACGGACAAGGACCAAATCTTCCTGTATGCTCAATCATCAGGAAGAATCACAGTATCTACCAAAAGAAGCCAACAAGCTGTAATCCCAAATATCGGATCTAGACCCAGAATAAGGGATATCCCTAGCAGAATAAGCATCTATTGGACAATAGTAAAACCGGGAGACATACTTTTGATTAACAGCACAGGGAATCTAATTGCTCCTAGGGGTTACTTCAAAATACGAAGTGGGAAAAGCTCAATAATGAGATCAGATGCACCCATTGGCAAATGCAAGTCTGAATGCATCACTCCAAATGGAAGCATTCCCAATGACAAACCATTCCAAAATGTAAACAGGATCACATACGGGGCCTGTCCCAGATATGTTAAGCATAGCACTCTAAAATTGGCAACAGGAATGCGAAATGTACCAGAGAAACAAACTAGAGGCATATTTGGCGCAATAGCGGGTTTCATAGAAAATGGTTGGGAGGGAATGGTGGATGGTTGGTACGGTTTCAGGCATCAAAATTCTGAGGGAAGAGGACAAGCAGCAGATCTCAAAAGCACTCAAGCAGCAATCGATCAAATCAATGGGAAGCTGAATCGATTGATCGGGAAAACCAACGAGAAATTCCATCAGATTGAAAAAGAATTCTCAGAAGTAGAAGGAAGAATTCAGGACCTTGAGAAATATGTTGAGGACACTAAAATAGATCTCTGGTCATACAACGCGGAGCTTCTTGTTGCCCTGGAGAACCAACATACARTTGATCTAACTGACTCAGAAATGAACAAACTGTTTGAAAAAACAAAGAAGCAACTGAGGGAAAATGCTGAGGATATGGGAAATGGTTGTTTCAAAATATACCACAAATGTGACAATGCCTGCATAGGATCAATAAGAAATGGAACTTATGACCACAATGTGTACAGGGATGAAGCATTAAACAACCGGTTCCAGATCAAGGGAGTTGAGCTGAAGTCAGGGTACAAAGATTGGATCCTATGGATTTCCTYTGCCATATCATGTTTTTTGCTTTGTGTTGCTTTGTTGGGGTTCATCATGTGGGCCTGCCAAAAGGGCAACATTAGGTGCAACATTTGCATTTGA

We can also do things like extract the DNA sequence and translate it into a protein sequence:

In [4]:
# extract the sequence from the record
DNA_seq = record.seq

# translate into a protein sequence
protein_seq = DNA_seq.translate()
print(protein_seq)
MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI*

Problems

Problem 1:

(a) Download the sequence record with the ID number FJ966082 and print it out. What kind of a sequence is this?

(b) Print out the comments section of the annotation

(c) Translate the DNA sequence into a protein sequence and print it out

In [5]:
# Problem 1a

# you will need to import the appropriate modules for this to work
from Bio import Entrez, SeqIO

# always give Entrez your email
Entrez.email = "Your.Name.Here@example.org" # replace with your email

# your code goes here
In [6]:
# Problem 1b

# your code goes here
In [7]:
# Problem 1c

# your code goes here

Problem 2:

Print the record you downloaded under Problem 1 in FASTA format. This means that you need to first print a line starting with ">" plus some description of the record. Then you need to print a line containing the sequence in the record.

In [8]:
# your code goes here

If this was easy

Problem 3:

Write a function that takes a sequence record as input and prints it out in FASTA format. Write the function in such a way that it breaks the sequence over multiple lines, such that each line contains at most 60 characters.

In [9]:
# your code goes here

Problem 4:

Write a function that takes a DNA sequence as input, translates the sequence in all three reading frames, and counts the number of stop codons found in each translation. Remember that stop codons are indicated by a "*".

Note that biopython doesn't like to translate sequences whose length is not a multiple of three, so you will have to pad the sequence with trailing Ns to avoid a warning or error.

Hint: The solution to this problem will be simpler if you first write a function that can count the stop codons in one sequence.

In [10]:
# your code goes here