In this class, we will discuss a few more real-world scenarios of how we can use regular expressions to analyze data. We will work with the E. coli genome. As usual, we first download it:
from Bio import Entrez Entrez.email = "firstname.lastname@example.org" # put your email here # Download E. coli K12 genome: download_handle = Entrez.efetch(db="nucleotide", id="CP009685", rettype="gb", retmode="text") data = download_handle.read() download_handle.close() # Store data into file "Ecoli_K12.gb": out_handle = open("Ecoli_K12.gb", "w") out_handle.write(data) out_handle.close()
Let's assume that we want to find E. coli genes that are enzymes. Enzymes can be identified because their name ends in "ase". The gene name is stored in the "product" feature qualifier of CDS features.
We will write code that loops over all CDS features in the genome, find the protein-coding sequences (CDSs), and analyze their product feature. To analyze the name of the product, we will use the following regular expression:
r"ase($|\s)". Remember that the vertical line
| indicates logical or. So this regular expression searches for two alternative patterns. The first pattern,
r"ase$" looks for strings that end in
ase. The second pattern,
r"ase\s" looks for strings that contain a word ending in
ase. (Word ends are indicated by subsequent whitespace, which is matched by
Note that we will limit our search to the first 100 protein-coding sequences only, to make the code run more quickly.
import re from Bio import SeqIO # read in the E. coli genome from local storage in_handle = open("Ecoli_K12.gb", "r") record = SeqIO.read(in_handle, "genbank") in_handle.close() max_i = 100 # number of protein-coding sequences we will analyze i = 0 # counter that will keep track of the number of CDSs found enzyme_count = 0 # number of enzymes found for feature in record.features: if feature.type == 'CDS': i += 1 # we can only proceed if the CDS has a 'product' qualifier if "product" in feature.qualifiers: product = feature.qualifiers["product"] # the heart of the matter. does the product string end in 'ase' # or contain a word that ends in 'ase'? match = re.search(r"ase($|\s)", product) if match: # yes, we found something that looks like an enzyme print(product) enzyme_count += 1 # stop after max_i CDSs have been processed if i >= max_i: break print("\nTotal number of probable enzymes found:", enzyme_count)
cellulose synthase cellulose synthase endo-1,4-D-glucanase cellulose synthase ketodeoxygluconokinase ketodeoxygluconokinase c-di-GMP phosphodiesterase trehalase cytochrome C peroxidase glutamate decarboxylase transposase arsenate reductase glutathione reductase ribosomal RNA large subunit methyltransferase J oligopeptidase A methyltransferase peptide ABC transporter permease nickel transporter permease NikC nickel transporter permease NikB ACP synthase permease zinc ABC transporter ATPase 16S rRNA methyltransferase RNA polymerase factor sigma-32 branched-chain amino acid transporter permease subunit LivH leucine/isoleucine/valine transporter permease subunit glycerol-3-phosphate transporter permease glycerophosphodiester phosphodiesterase gamma-glutamyltranspeptidase transposase transposase Total number of probable enzymes found: 31
Find out if there are any products that contain the letters "ase" in the middle of a word. For example, the word "based" contains these letters but does not end in them.
max_i=5000 to search the entire genome.
# your code goes here
Find products whose description starts with the letters "RNA". Again search the entire genome.
# your code goes here
Transcriptional regulators can belong to different families. These families are generally listed in the
product field, e.g. "LysR family transcriptional regulator" or "AraC family transcriptional regulator". Write a program that extracts the family name for each transcriptional regulator and then counts how many regulators for each family are found.
# your code goes here
Write a function that takes a string holding a full name as input and that prints the first name as output. The function should be able to handle the following cases:
In all cases, the output should be "first". Assume that initials are given as one letter and a period.
Hint: First separate the last name from first + initial, and then extract the first name from first + initial.
def extract_first_name(name): # your code goes here pass # delete this extract_first_name("John Smith") extract_first_name("Miller, Jack") extract_first_name("Susie R. Benner") extract_first_name("Smith, April B.") extract_first_name("Miller, R. Ben") extract_first_name("A. Jane Doe") extract_first_name("abcde") # not a valid name, creates an error