March 29, 2018
Many important pieces of information in genbank files are stored in features. These features can be queried through Biopython by working with the
features list of a genbank record object (
record.features). More information about sequence features is available here. We will usually iterate over all features in the list with a
from Bio import Entrez, SeqIO Entrez.email = "firstname.lastname@example.org" # put your email here # Download sequence record for genbank id KT220438 # This is HA gene of Influenza A virus, strain A/NewJersey/NHRC_93219/2015(H3N2) handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close() # Loop over all features in the record to see # what features there are and what information # they contain for feature in record.features: print(feature)
type: source location: [0:1701](+) qualifiers: Key: collection_date, Value: ['17-Jan-2015'] Key: country, Value: ['USA: New Jersey'] Key: db_xref, Value: ['taxon:1682360'] Key: host, Value: ['Homo sapiens'] Key: isolation_source, Value: ['nasopharyngeal swab'] Key: lab_host, Value: ['MDCK'] Key: mol_type, Value: ['viral cRNA'] Key: organism, Value: ['Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))'] Key: segment, Value: ['4'] Key: serotype, Value: ['H3N2'] Key: strain, Value: ['A/NewJersey/NHRC_93219/2015'] type: gene location: [0:1701](+) qualifiers: Key: gene, Value: ['HA'] type: CDS location: [0:1701](+) qualifiers: Key: codon_start, Value: ['1'] Key: function, Value: ['receptor binding and fusion protein'] Key: gene, Value: ['HA'] Key: product, Value: ['hemagglutinin'] Key: protein_id, Value: ['AKQ43545.1'] Key: translation, Value: ['MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI'] type: mat_peptide location: [48:1035](+) qualifiers: Key: gene, Value: ['HA'] Key: product, Value: ['HA1'] type: mat_peptide location: [1035:1698](+) qualifiers: Key: gene, Value: ['HA'] Key: product, Value: ['HA2']
The type of each feature is stored in
feature.type, and its location on the DNA sequence is stored in
feature.location. The additional feature information is stored in a dictionary called
qualifiers. We can query this dictionary to retrieve individual elements of a feature. Note that these elements are all stored as lists, and so we generally have to add
 at the end of our query to retrieve the first element in the list.
for feature in record.features: if feature.type == "CDS": print("Coding sequence found at position:", feature.location) print("Gene product:", feature.qualifiers["product"]) print("Protein ID:", feature.qualifiers["protein_id"]) print("Protein sequence:", feature.qualifiers["translation"])
Coding sequence found at position: [0:1701](+) Gene product: hemagglutinin Protein ID: AKQ43545.1 Protein sequence: MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI
When we are working with larger genbank files (e.g. entire genomes), we generally first download them and store them as a file on our local drive. Then we work with those local files. To practice this workflow, write code that downloads the complete genome of bacteriophage T7 and stores it in a file called
T7.gb. The accession number for that genome is NC_001604. Make sure you can find the file you have created on your harddrive. Also verify that the file contains the same genome you can see on the NCBI website.
# Problem 1 # your code goes here.
Now read in the T7 genome, and for each coding sequence (CDS) in that genome print out the name (
locus_tag) and the product name (
record.featuresgives you a list of all these features.
feature.type. We are only interested in features of type "CDS".
product, etc., are stored in the dictionary
# Problem 2 # your code goes here.
Calculate the mean gene length (in amino acids) for all CDSs in the T7 genome.
Hint: Use the function
round() to round your result to 2 decimal digits.
# Problem 3 # your code goes here.
(a) Download the E. coli K12 genome, available here: http://www.ncbi.nlm.nih.gov/nuccore/CP009685, and store it in a file called
(b) Calculate the fraction of coding sequences that are marked as "hypothetical protein". (This information is provided in the "product" qualifier of a "CDS" feature.)
(c) Calculate the GC content of the entire genome sequence. GC content is the fraction of G and C nucleotides in the sequence.
# Problem 4a # your code goes here.
# Problem 4b # your code goes here.
# Problem 4c # your code goes here.
In the T7 genome, find the CDS feature corresponding to locus tag "T7p45" (major capsid protein), and then use the
feature.extract() function to extract the DNA sequence corresponding to that gene. Then translate that sequence to verify (by inspection) that it corresponds to the protein sequence that is listed for gene T7p45.
# Problem 5 # your code goes here.
For the influenza HA gene with ID "KT220438", write python code that checks whether the protein sequence listed in the CDS feature corresponds to the protein sequence you obtain when you translate the gene sequence. The code should print "The two translations match" or "The two translations do not match" depending on whether there is a match or not.
# Problem 5 # your code goes here.