Apr 9, 2020
Many important pieces of information in genbank files are stored in features. These features can be queried through Biopython by working with the features
list of a genbank record object (record.features
). More information about sequence features is available here. We will usually iterate over all features in the list with a for
loop:
from Bio import Entrez, SeqIO
Entrez.email = "wilke@austin.utexas.edu" # put your email here
# Download sequence record for genbank id KT220438
# This is HA gene of Influenza A virus, strain A/NewJersey/NHRC_93219/2015(H3N2)
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# Loop over all features in the record to see
# what features there are and what information
# they contain
for feature in record.features:
print(feature)
The type of each feature is stored in feature.type
, and its location on the DNA sequence is stored in feature.location
. The additional feature information is stored in a dictionary called qualifiers
. We can query this dictionary to retrieve individual elements of a feature. Note that these elements are all stored as lists, and so we generally have to add [0]
at the end of our query to retrieve the first element in the list.
for feature in record.features:
if feature.type == "CDS":
print("Coding sequence found at position:", feature.location)
print("Gene product:", feature.qualifiers["product"][0])
print("Protein ID:", feature.qualifiers["protein_id"][0])
print("Protein sequence:", feature.qualifiers["translation"][0])
Problem 1:
When we are working with larger genbank files (e.g. entire genomes), we generally first download them and store them as a file on our local drive. Then we work with those local files. To practice this workflow, write code that downloads the complete genome of bacteriophage T7 and stores it in a file called T7.gb
. The accession number for that genome is NC_001604. Make sure you can find the file you have created on your harddrive. Also verify that the file contains the same genome you can see on the NCBI website.
# Problem 1
# Download T7 genome:
download_handle = Entrez.efetch(db="nucleotide", id="NC_001604", rettype="gb", retmode="text")
data = download_handle.read()
download_handle.close()
# Store data into file "T7.gb":
out_handle = open("T7.gb", "w")
out_handle.write(data)
out_handle.close()
Problem 2:
Now read in the T7 genome, and for each coding sequence (CDS) in that genome print out the name (locus_tag
) and the product name (product
).
Hints:
SeqIO.read()
.record.features
gives you a list of all these features.feature.type
. We are only interested in features of type "CDS".locus_tag
, product
, etc., are stored in the dictionary feature.qualifiers
.# Problem 2
in_handle = open("T7.gb", "r")
record = SeqIO.read(in_handle, "genbank")
in_handle.close()
for feature in record.features:
if feature.type == 'CDS':
# all the qualifiers are returned as lists, so we have to add `[0]` at the end to just get
# the first element of each list
locus_tag = feature.qualifiers['locus_tag'][0]
product = feature.qualifiers['product'][0]
print(locus_tag + ": " + product)
Problem 3:
Calculate the mean gene length (in amino acids) for all CDSs in the T7 genome.
Hint: Use the function round()
to round your result to 2 decimal digits.
in_handle = open("T7.gb", "r")
record = SeqIO.read(in_handle, "genbank")
in_handle.close()
count = 0 # counts the number of genes in the genome
length_sum = 0. # holds the total sum of all lengths of all genes
for feature in record.features:
if feature.type == 'CDS':
seq = feature.qualifiers['translation'][0]
length_sum += len(seq)
count += 1
print("mean length:", round(length_sum/count, 2), "amino acids")
Problem 4:
(a) Download the E. coli K12 genome, available here: http://www.ncbi.nlm.nih.gov/nuccore/CP009685, and store it in a file called Ecoli_K12.gb
.
(b) Calculate the fraction of coding sequences that are marked as "hypothetical protein". (This information is provided in the "product" qualifier of a "CDS" feature.)
(c) Calculate the GC content of the entire genome sequence. GC content is the fraction of G and C nucleotides in the sequence.
# Problem 4a
# Download E. coli K12 genome:
download_handle = Entrez.efetch(db="nucleotide", id="CP009685", rettype="gb", retmode="text")
data = download_handle.read()
download_handle.close()
# Store data into file "Ecoli_K12.gb":
out_handle = open("Ecoli_K12.gb", "w")
out_handle.write(data)
out_handle.close()
# Problem 4b
# Read in E. coli K12 genome:
in_handle = open("Ecoli_K12.gb", "r")
record = SeqIO.read(in_handle, "genbank")
in_handle.close()
# count all CDSs and the hypothetical ones
CDS_count = 0
hyp_CDS_count = 0
for feature in record.features:
if feature.type == 'CDS':
CDS_count += 1
if "product" in feature.qualifiers:
if feature.qualifiers["product"][0] == "hypothetical protein":
hyp_CDS_count += 1
print("Total coding sequences in the E. coli K12 genome:", CDS_count)
print("Number of hypothetical proteins in the E. coli K12 genome:", hyp_CDS_count)
print("Fraction:", round(100*hyp_CDS_count/CDS_count, 2), "%")
# Problem 4c
# Read in E. coli K12 genome:
in_handle = open("Ecoli_K12.gb", "r")
record = SeqIO.read(in_handle, "genbank")
in_handle.close()
# Count all the G and C nucleotides
GC_count = 0
for nucleotide in record.seq:
if nucleotide=='G' or nucleotide=='C':
GC_count += 1
# Calculate GC percentage
GC_perc = round(100*GC_count/len(record.seq), 2)
print("GC content: " + str(GC_perc) + "%")
Problem 5:
In the T7 genome, find the CDS feature corresponding to locus tag "T7p45" (major capsid protein), and then use the feature.extract()
function to extract the DNA sequence corresponding to that gene. Then translate that sequence to verify (by inspection) that it corresponds to the protein sequence that is listed for gene T7p45.
in_handle = open("T7.gb", "r")
record = SeqIO.read(in_handle, "genbank")
in_handle.close()
for feature in record.features:
if feature.type == 'CDS':
locus_tag = feature.qualifiers['locus_tag'][0]
if locus_tag == 'T7p45':
target_feature = feature
break
print(target_feature)
DNA_seq = target_feature.extract(record).seq
print("DNA:", str(DNA_seq))
print("protein:", DNA_seq.translate())
Problem 6:
For the influenza HA gene with ID "KT220438", write python code that checks whether the protein sequence listed in the CDS feature corresponds to the protein sequence you obtain when you translate the gene sequence. The code should print "The two translations match" or "The two translations do not match" depending on whether there is a match or not.
# download the record again
handle = Entrez.efetch(db="nucleotide", id="KT220438", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# find the CDS feature
for feature in record.features:
if feature.type == "CDS":
CDS_feature = feature
# Extract the amino-acid sequence from the CDS feature. We must not forget the [0]
# at the end to extract the first element of the list
CDS_feature_seq = CDS_feature.qualifiers["translation"][0]
# Extract the translated sequence from the DNA sequence in the record.
# We need to remove the trailing '*' indicating a stop codon,
# and we also need to convert the whole thing into a string.
translated_seq = str(record.seq.translate()[:-1])
if CDS_feature_seq == translated_seq:
print("The two translations match.")
else:
print("The two translations do not match.")