Class 20: Working with gene features and genomes

Apr 9, 2020

Features in genbank files

Many important pieces of information in genbank files are stored in features. These features can be queried through Biopython by working with the features list of a genbank record object (record.features). More information about sequence features is available here. We will usually iterate over all features in the list with a for loop:

In [1]:
from Bio import Entrez, SeqIO
Entrez.email = "wilke@austin.utexas.edu" # put your email here

# Download sequence record for genbank id KT220438
# This is HA gene of Influenza A virus, strain A/NewJersey/NHRC_93219/2015(H3N2)
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

# Loop over all features in the record to see
# what features there are and what information
# they contain
for feature in record.features:
    print(feature)
type: source
location: [0:1701](+)
qualifiers:
    Key: collection_date, Value: ['17-Jan-2015']
    Key: country, Value: ['USA: New Jersey']
    Key: db_xref, Value: ['taxon:1682360']
    Key: host, Value: ['Homo sapiens']
    Key: isolation_source, Value: ['nasopharyngeal swab']
    Key: lab_host, Value: ['MDCK']
    Key: mol_type, Value: ['viral cRNA']
    Key: organism, Value: ['Influenza A virus (A/New Jersey/NHRC_93219/2015(H3N2))']
    Key: segment, Value: ['4']
    Key: serotype, Value: ['H3N2']
    Key: strain, Value: ['A/NewJersey/NHRC_93219/2015']

type: gene
location: [0:1701](+)
qualifiers:
    Key: gene, Value: ['HA']

type: CDS
location: [0:1701](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: function, Value: ['receptor binding and fusion protein']
    Key: gene, Value: ['HA']
    Key: product, Value: ['hemagglutinin']
    Key: protein_id, Value: ['AKQ43545.1']
    Key: translation, Value: ['MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI']

type: mat_peptide
location: [48:1035](+)
qualifiers:
    Key: gene, Value: ['HA']
    Key: product, Value: ['HA1']

type: mat_peptide
location: [1035:1698](+)
qualifiers:
    Key: gene, Value: ['HA']
    Key: product, Value: ['HA2']

The type of each feature is stored in feature.type, and its location on the DNA sequence is stored in feature.location. The additional feature information is stored in a dictionary called qualifiers. We can query this dictionary to retrieve individual elements of a feature. Note that these elements are all stored as lists, and so we generally have to add [0] at the end of our query to retrieve the first element in the list.

In [2]:
for feature in record.features:
    if feature.type == "CDS":
        print("Coding sequence found at position:", feature.location)
        print("Gene product:", feature.qualifiers["product"][0])
        print("Protein ID:", feature.qualifiers["protein_id"][0])
        print("Protein sequence:", feature.qualifiers["translation"][0])
Coding sequence found at position: [0:1701](+)
Gene product: hemagglutinin
Protein ID: AKQ43545.1
Protein sequence: MKTIIALSYILCLVFAQKIPGNDNSTATLCLGHHAVPNGTIVKTITNDRIEVTNATELVQNSSIGEICDSPHQILDGENCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSSSSFFSRLNWLTHLNYTYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQSSGRITVSTKRSQQAVIPNIGSRPRIRDIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCKSECITPNGSIPNDKPFQNVNRITYGACPRYVKHSTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTXDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHNVYRDEALNNRFQIKGVELKSGYKDWILWISXAISCFLLCVALLGFIMWACQKGNIRCNICI

Problems

Problem 1:

When we are working with larger genbank files (e.g. entire genomes), we generally first download them and store them as a file on our local drive. Then we work with those local files. To practice this workflow, write code that downloads the complete genome of bacteriophage T7 and stores it in a file called T7.gb. The accession number for that genome is NC_001604. Make sure you can find the file you have created on your harddrive. Also verify that the file contains the same genome you can see on the NCBI website.

In [3]:
# Problem 1

# your code goes here.

Problem 2:

Now read in the T7 genome, and for each coding sequence (CDS) in that genome print out the name (locus_tag) and the product name (product).

Hints:

  • First you need to parse the genbank file into a record, using SeqIO.read().
  • All gene annotations are stored as features, and record.features gives you a list of all these features.
  • Each feature has a type, accessible through feature.type. We are only interested in features of type "CDS".
  • The actual annotations we are interested in, such as locus_tag, product, etc., are stored in the dictionary feature.qualifiers.
In [4]:
# Problem 2

# your code goes here.

Problem 3:

Calculate the mean gene length (in amino acids) for all CDSs in the T7 genome.

Hint: Use the function round() to round your result to 2 decimal digits.

In [5]:
# Problem 3

# your code goes here.

Problem 4:

(a) Download the E. coli K12 genome, available here: http://www.ncbi.nlm.nih.gov/nuccore/CP009685, and store it in a file called Ecoli_K12.gb.

(b) Calculate the fraction of coding sequences that are marked as "hypothetical protein". (This information is provided in the "product" qualifier of a "CDS" feature.)

(c) Calculate the GC content of the entire genome sequence. GC content is the fraction of G and C nucleotides in the sequence.

In [6]:
# Problem 4a

# your code goes here.
In [7]:
# Problem 4b

# your code goes here.
In [8]:
# Problem 4c

# your code goes here.

If this was easy

Problem 5:

In the T7 genome, find the CDS feature corresponding to locus tag "T7p45" (major capsid protein), and then use the feature.extract() function to extract the DNA sequence corresponding to that gene. Then translate that sequence to verify (by inspection) that it corresponds to the protein sequence that is listed for gene T7p45.

In [9]:
# Problem 5

# your code goes here.

Problem 6:

For the influenza HA gene with ID "KT220438", write python code that checks whether the protein sequence listed in the CDS feature corresponds to the protein sequence you obtain when you translate the gene sequence. The code should print "The two translations match" or "The two translations do not match" depending on whether there is a match or not.

In [10]:
# Problem 5

# your code goes here.