April 23, 2020
In this class, we will continue discuss a few more real-world scenarios of how we can use regular expressions to analyze data. We will also encounter some other useful methods and techniques, such as reading a file directly from the internet.
To read a file from the internet (i.e., from a URL), we can use the urlopen()
function that is available in the package urllib.request
. One subtle difficulty with this function is that it returns encoded strings (also called "byte objects") rather than raw strings. For example, the following code downloads the poem "The Road Not Taken" from the class website and prints it out. You can see how the output is encapsulated in quotes with a 'b' in front (for "byte object"), and also that the newline character is explicitly written out as '\n'.
from urllib.request import urlopen
file_URL = "http://wilkelab.org/classes/SDS348/data_sets/road_not_taken.txt"
with urlopen(file_URL) as infile:
for line_encoded in infile:
print (line_encoded)
To convert the encoded byte objects into raw strings that you can work with, you can call the function decode()
on the byte object:
with urlopen(file_URL) as infile:
for line_encoded in infile:
line = line_encoded.decode()
print(line, end='') # need to set end='' because each line comes with a newline character already
Problem 1:
Write a function that can tell whether a string is DNA, RNA, or neither. Strings that contain some DNA/RNA and some other stuff, such as "This is a gene: AGTACCGTAG", should be flagged as neither.
import re
def what_is_it(string):
if re.search(r"^[AGCTagct]+$", string):
print("DNA:", string)
elif re.search(r"^[AGCUagcu]+$", string):
print("RNA:", string)
else:
print("neither:", string)
what_is_it("AGCTCGAGCTA") # DNA
what_is_it("AGCUCGAGCUA") # RNA
what_is_it("AGCTCGAGCUA") # neither, uses Ts and Us at the same time
what_is_it("This is a gene: AGTACCGTAG") # neither, contains other characters
what_is_it("ucgcuucgacacgu") # RNA
what_is_it("atgtctacact") # DNA
Problem 2:
Write code that counts the number of lines in Robert Frost's "The Road Not Taken" that contain the article "the". Also keep track of the words appearing immediately after the "the", by recording them in a list.
Hint 1: Each line contains the article "the" either 0 or 1 times. You don't have to worry about multiple occurrences of "the" in the same line. However, make sure you don't count "them", "there", or similar words that contain "the" as a substring.
Hint 2: Make sure that you capture both lower-case, upper-case, and all-caps spellings of "the". See if you can do this with a single regular expression.
# We will use urlopen to read the file directly from the class website
from urllib.request import urlopen
file_URL = "http://wilkelab.org/classes/SDS348/data_sets/road_not_taken.txt"
count = 0
wordlist = []
with urlopen(file_URL) as infile:
for line_encoded in infile:
# urllib returns encoded strings (byte objects), and
# we need to use the `decode()` function to turn them
# into strings we can work with
line = line_encoded.decode()
match = re.search(r"\s+[tT][hH][eE]\s+(\w+)", line)
if match:
count += 1
wordlist.append(match.group(1))
print('Number of lines containing the article "the":', count)
print('Words appearing after "the":', wordlist)
Problem 3:
For this problem, we will parse the strain names of influenza A virus. First, we will perform an Entrez search to retrieve influenza A viruses from 2014 for which we have a complete hemagglutinin coding sequence. (We limit the search to 50 results to keep things manageable.)
from Bio import Entrez, SeqIO
Entrez.email = 'wilke@austin.utexas.edu'
# let's do a search for influenza H1N1 viruses from 2014
handle = Entrez.esearch(
db="nucleotide", # database to search
term="influenza a virus 2014 h1n1 hemagglutinin complete cds", # search term
retmax=50 # limit the search to the first 50 results
)
record = Entrez.read(handle)
handle.close()
# download the results and turn into a list of records
gi_list = record["IdList"] # list of genbank identifiers found
handle = Entrez.efetch(db="nucleotide", id=gi_list, rettype="gb", retmode="text")
records = list(SeqIO.parse(handle, "genbank")) # the list(...) statement ensures we will still have
# access to the results after handle.close()
handle.close()
for record in records:
print(record.description)
Now write code that goes over all records in the list and counts (1) how many human strains there are and (2) how many strains were collected in Texas. To do this, you need to know how influenza strains are named. The naming scheme is "A/host (if not human)/region of origin/lineage number/year/(antigen type)" (see also here). The tricky part is that the host field is optional. For example, A/swine/Mexico/10466772/2014(H1N1) is a 2014 H1N1 swine strain from Mexico, but A/Texas/6180/2017(H1N1) is a 2017 H1N1 human strain from Texas. Note that even though we searched for 2014, we are retrieving some strains for 2017.
human_count = 0
TX_count = 0
for record in records:
match = re.search(r"^Influenza A virus \(A/(\w*/)*(\w.*)/[-\w]+/20\d\d\s*\(H1N1\)\)", record.description)
if match: # should always be true
if match.group(1) == None: # for human strains, match group 1 will be empty
human_count += 1
if match.group(2) == 'Texas':
TX_count += 1
else:
print("Cannot match:", record.description)
print("Total number of human strains:", human_count)
print("Total number of strains from Texas:", TX_count)