Enter your name and EID here
This homework is due on Apr. 27, 2020 at 12:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."
Problem 1 (5 pts): Often in bioinformatics, we need to format unique gene and/or protein identifiers. For instance, FASTA files downloaded from the UniProt database will have sequence identifiers that look like >sp|Q8WZ42|TITIN_HUMAN
. For cross-referencing purposes (i.e., the way this ID is stored in other databases), we just need the Q8WZ42
part. Write code that extracts this group between the |
characters. Use this code to extract the UniProt IDs from both strings given below.
Hint: Remember, the |
symbol is normally used to say "this or this" or this|this
. To match |
in a string, as opposed to using it as a Boolean operator, you will need to escape the character with a backslash like so: \|
.
# You will need re to solve this problem
import re
titin_human = ">sp|Q8WZ42|TITIN_HUMAN"
lysozyme_frog = ">tr|A0A060A0J8|A0A060A0J8_XENLA"
def extract_id(input_string):
pattern = r'^>(tr|sp)\|(.*)\|\S*'
match = re.search(pattern, input_string)
if match:
print(match.group(2))
extract_id(titin_human)
extract_id(lysozyme_frog)
Problem 2 (5 pts): We will work with the Microcystis aeruginosa genome. This cyanobacteria is partially responsible (along with Anabaena) for the toxic "blue-green algal" blooms affecting bodies of water in Central Texas in the latter half of 2019. First, we download it and save it locally (note, this code may take a minute or two to run):
from Bio import Entrez
Entrez.email = "rachaelcox@utexas.edu" # put your email here
# download Microcystic aeruginosa genome & save it locally:
with open("Maeruginosa.gb", "w") as outfile:
handle = Entrez.efetch(db="nucleotide", id="NC_010296", rettype="gbwithparts", retmode="text")
data = handle.read()
outfile.write(data)
handle.close()
Write code that loops over all features in the M. aeruginosa genome, and counts the number of tRNAs and rRNAs that are contained within it. Use regular expressions to find the answer.
# you will need re and SeqIO to solve this problem
import re
from Bio import SeqIO
# read in the M. aeruginosa genome from local storage
in_handle = open("Maeruginosa.gb", "r")
record = SeqIO.read(in_handle, "genbank")
in_handle.close()
i = 0 # counter that will keep track of the number of tRNAs and rRNAs found
for feature in record.features:
# match if the feature type starts with "t" or "r" and ends with "RNA"
match = re.search(r"^(t|r)RNA$", feature.type)
if match:
# yes, we found a match
i += 1
print(i)