April 21, 2020
In this class, we will discuss a few more real-world scenarios of how we can use regular expressions to analyze data. We will work with the E. coli genome. As usual, we first download it:
from Bio import Entrez
Entrez.email = "wilke@austin.utexas.edu" # put your email here
# Download E. coli K12 genome:
download_handle = Entrez.efetch(db="nucleotide", id="CP009685", rettype="gb", retmode="text")
data = download_handle.read()
download_handle.close()
# Store data into file "Ecoli_K12.gb":
out_handle = open("Ecoli_K12.gb", "w")
out_handle.write(data)
out_handle.close()
Let's assume that we want to find E. coli genes that are enzymes. Enzymes can be identified because their name ends in "ase". The gene name is stored in the "product" feature qualifier of CDS features.
We will write code that loops over all CDS features in the genome, find the protein-coding sequences (CDSs), and analyze their product feature. To analyze the name of the product, we will use the following regular expression: r"ase($|\s)"
. Remember that the vertical line |
indicates logical or. So this regular expression searches for two alternative patterns. The first pattern, r"ase$"
looks for strings that end in ase
. The second pattern, r"ase\s"
looks for strings that contain a word ending in ase
. (Word ends are indicated by subsequent whitespace, which is matched by \s
.)
Note that we will limit our search to the first 100 protein-coding sequences only, to make the code run more quickly.
import re
from Bio import SeqIO
# read in the E. coli genome from local storage
in_handle = open("Ecoli_K12.gb", "r")
record = SeqIO.read(in_handle, "genbank")
in_handle.close()
max_i = 100 # number of protein-coding sequences we will analyze
i = 0 # counter that will keep track of the number of CDSs found
enzyme_count = 0 # number of enzymes found
for feature in record.features:
if feature.type == 'CDS':
i += 1
# we can only proceed if the CDS has a 'product' qualifier
if "product" in feature.qualifiers:
product = feature.qualifiers["product"][0]
# the heart of the matter. does the product string end in 'ase'
# or contain a word that ends in 'ase'?
match = re.search(r"ase($|\s)", product)
if match:
# yes, we found something that looks like an enzyme
print(product)
enzyme_count += 1
# stop after max_i CDSs have been processed
if i >= max_i:
break
print("\nTotal number of probable enzymes found:", enzyme_count)
Problem 1:
Find out if there are any products that contain the letters "ase" in the middle of a word. For example, the word "based" contains these letters but does not end in them.
Hint: Set max_i=5000
to search the entire genome.
# your code goes here
Problem 2:
Find products whose description starts with the letters "RNA". Again search the entire genome.
# your code goes here
Problem 3:
Transcriptional regulators can belong to different families. These families are generally listed in the product
field, e.g. "LysR family transcriptional regulator" or "AraC family transcriptional regulator". Write a program that extracts the family name for each transcriptional regulator and then counts how many regulators for each family are found.
# your code goes here
Problem 4:
Write a function that takes a string holding a full name as input and that prints the first name as output. The function should be able to handle the following cases:
In all cases, the output should be "first". Assume that initials are given as one letter and a period.
Hint: First separate the last name from first + initial, and then extract the first name from first + initial.
def extract_first_name(name):
# your code goes here
pass # delete this
extract_first_name("John Smith")
extract_first_name("Miller, Jack")
extract_first_name("Susie R. Benner")
extract_first_name("Smith, April B.")
extract_first_name("Miller, R. Ben")
extract_first_name("A. Jane Doe")
extract_first_name("abcde") # not a valid name, creates an error