Lab Worksheet 12 Solutions

Problem 1: Write a function that searches the NCBI Nucleotide Database for any given search term. Your function should require the following parameters: an email, a search term, and the max number of results to return. The function should return a list of GenBank accession numbers.

In [1]:
# you will need Entrez and Medline to solve this problem
from Bio import Entrez, SeqIO

def get_accessions(email, search_term, max_return):

    Entrez.email = email
    handle = Entrez.esearch(db = "nucleotide",  # database to search
                        term = search_term,  # search term
                        retmax = max_return) # maximum number of results to return
    record = Entrez.read(handle)
    handle.close()

    # search returns PubMed IDs (pmids)
    accession_list = record["IdList"]
    print('total number of results from search term "{}" = {}'.format(search_term, len(accession_list)))
    print('accession list =', accession_list)
    
    return(accession_list)
In [2]:
archaea = get_accessions('rachaelcox@utexas.edu', 'Archaea[Organism] AND tRNA[Feature Key] AND "complete genome"', 5)
total number of results from search term "Archaea[Organism] AND tRNA[Feature Key] AND "complete genome"" = 5
accession list = ['1147362626', '1125709198', '313124788', '14518450', '1832147357']
In [3]:
bacteria = get_accessions('rachaelcox@utexas.edu', 'Bacteria[Organism] AND tRNA[Feature Key] AND "complete genome"', 5)
total number of results from search term "Bacteria[Organism] AND tRNA[Feature Key] AND "complete genome"" = 5
accession list = ['1832554406', '1832553669', '1832553647', '1832478644', '1832159315']

Problem 2: Write a function that takes a list of GenBank identifiers (obtained from your function above) and fetches the records associated with those IDs from the NCBI Nucleotide database. Your function should require the following parameters: an email, a list of GenBank accessions, and the name of an output file to hold the records. The function should write a file containing all of the data from each GenBank record given to the function.

In [4]:
def write_gb_file(email, accession_list, outfile_name):
    
    # fetch records from the database
    Entrez.email = email
    handle = Entrez.efetch(db = "nucleotide", 
                           id = accession_list, 
                           rettype = "gbwithparts", 
                           retmode = "text")
    data = handle.read()
    handle.close() # close the handle

    # write data to a file
    with open(outfile_name, "w") as outfile:
        outfile.write(data)
    
    print("GenBank files successfully written to {}".format(outfile_name))
In [5]:
write_gb_file('rachaelcox@utexas.edu', archaea, 'archaea_genomes.gb')
GenBank files successfully written to archaea_genomes.gb
In [6]:
write_gb_file('rachaelcox@utexas.edu', bacteria, 'bacteria_genomes.gb')
GenBank files successfully written to bacteria_genomes.gb

Problem 3: Write a function that takes a GenBank file and computes the tRNA count for each record in that file. The code in your function should find the "tRNA" type in record.features and count the number of occurrences for each record. Your function should require one parameter: a file containing one or more GenBank records.

In [7]:
def get_tRNA_counts(gb_file):
    
    with open (gb_file, 'r') as infile:
        
        # use `SeqIO.parse` to read a file with many GenBank records
        all_records = SeqIO.parse(infile, "genbank")
        count_dict = {} # dictionary to hold the tRNA counts for each GenBank accession
        
        # loop through each record and count the number of tRNAs found
        for record in all_records:

            i = 0 # counter that will keep track of the number of tRNAs
            for feature in record.features:
                
                # match if the feature type is "tRNA"
                if feature.type == "tRNA":
                    i += 1
            
            # add the final count `i` to the dictionary for each record
            count_dict[record.id] = i

            print("The record specifying the \"{}\" has {} tRNAs.".format(record.description, i))
    
    return(count_dict)
In [8]:
get_tRNA_counts('archaea_genomes.gb')
The record specifying the "Haloterrigena daqingensis strain JX313 chromosome, complete genome" has 46 tRNAs.
The record specifying the "Ignicoccus islandicus DSM 13165 strain DSM 13166 chromosome, complete genome" has 45 tRNAs.
The record specifying the "Halogeometricum borinquense DSM 11551, complete sequence" has 52 tRNAs.
The record specifying the "Pyrococcus abyssi GE5, complete genome" has 46 tRNAs.
The record specifying the "Acidianus brierleyi strain DSM 1651 chromosome, complete genome" has 46 tRNAs.
Out[8]:
{'NZ_CP019327.1': 46,
 'NZ_CP006867.1': 45,
 'NC_014729.1': 52,
 'NC_000868.1': 46,
 'NZ_CP029289.2': 46}
In [9]:
get_tRNA_counts('bacteria_genomes.gb')
The record specifying the "Limnospira fusiformis SAG 85.79 chromosome, complete genome" has 41 tRNAs.
The record specifying the "Pectobacterium punjabense strain SS95 chromosome, complete genome" has 76 tRNAs.
The record specifying the "Halomonas sp. NBT06E8 chromosome, complete genome" has 60 tRNAs.
The record specifying the "[Clostridium] innocuum strain ATCC 14501 chromosome, complete genome" has 48 tRNAs.
The record specifying the "Halomonas sp. NBT06E8 chromosome, complete genome" has 60 tRNAs.
Out[9]:
{'CP051185.1': 41,
 'NZ_CP038498.1': 76,
 'NZ_CP048602.1': 60,
 'NZ_CP048838.1': 48,
 'CP048602.1': 60}