Problem 1: Write a function that searches the NCBI Nucleotide Database for any given search term. Your function should require the following parameters: an email, a search term, and the max number of results to return. The function should return a list of GenBank accession numbers.
# you will need Entrez and Medline to solve this problem
from Bio import Entrez, SeqIO
def get_accessions(email, search_term, max_return):
Entrez.email = email
handle = Entrez.esearch(db = "nucleotide", # database to search
term = search_term, # search term
retmax = max_return) # maximum number of results to return
record = Entrez.read(handle)
handle.close()
# search returns PubMed IDs (pmids)
accession_list = record["IdList"]
print('total number of results from search term "{}" = {}'.format(search_term, len(accession_list)))
print('accession list =', accession_list)
return(accession_list)
archaea = get_accessions('rachaelcox@utexas.edu', 'Archaea[Organism] AND tRNA[Feature Key] AND "complete genome"', 5)
bacteria = get_accessions('rachaelcox@utexas.edu', 'Bacteria[Organism] AND tRNA[Feature Key] AND "complete genome"', 5)
Problem 2: Write a function that takes a list of GenBank identifiers (obtained from your function above) and fetches the records associated with those IDs from the NCBI Nucleotide database. Your function should require the following parameters: an email, a list of GenBank accessions, and the name of an output file to hold the records. The function should write a file containing all of the data from each GenBank record given to the function.
def write_gb_file(email, accession_list, outfile_name):
# fetch records from the database
Entrez.email = email
handle = Entrez.efetch(db = "nucleotide",
id = accession_list,
rettype = "gbwithparts",
retmode = "text")
data = handle.read()
handle.close() # close the handle
# write data to a file
with open(outfile_name, "w") as outfile:
outfile.write(data)
print("GenBank files successfully written to {}".format(outfile_name))
write_gb_file('rachaelcox@utexas.edu', archaea, 'archaea_genomes.gb')
write_gb_file('rachaelcox@utexas.edu', bacteria, 'bacteria_genomes.gb')
Problem 3: Write a function that takes a GenBank file and computes the tRNA count for each record in that file. The code in your function should find the "tRNA" type
in record.features
and count the number of occurrences for each record. Your function should require one parameter: a file containing one or more GenBank records.
def get_tRNA_counts(gb_file):
with open (gb_file, 'r') as infile:
# use `SeqIO.parse` to read a file with many GenBank records
all_records = SeqIO.parse(infile, "genbank")
count_dict = {} # dictionary to hold the tRNA counts for each GenBank accession
# loop through each record and count the number of tRNAs found
for record in all_records:
i = 0 # counter that will keep track of the number of tRNAs
for feature in record.features:
# match if the feature type is "tRNA"
if feature.type == "tRNA":
i += 1
# add the final count `i` to the dictionary for each record
count_dict[record.id] = i
print("The record specifying the \"{}\" has {} tRNAs.".format(record.description, i))
return(count_dict)
get_tRNA_counts('archaea_genomes.gb')
get_tRNA_counts('bacteria_genomes.gb')