Enter your name and EID here
This homework is due on April 13, 2020 at 12:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."
Problem 1 (3 pts): The interleukin 6 gene (IL6) encodes for a cytokine that mediates a variety of immune response pathways in humans. Download the file "IL6_gene_human_lower.txt" to your computer, upload the file to your Jupyter session, then read the sequence in line-by-line using open()
and readlines()
. Print out the sequence of the gene such that all nucleotides have been converted to uppercase and white space has been removed.
# open the file and read in its contents as a list of lines
handle = open("IL6_gene_human_lower.txt", "r")
tp53_sequence = handle.readlines()
handle.close()
new_seq = ''
for line in tp53_sequence:
line = line.rstrip().upper()
new_seq += line
print(new_seq)
Problem 2 (4 points): In bioinformatics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA sequencing. For example, if the DNA sequencing read is "ATCATCATG", then the 3-mers in that read include "ATC" (which occurs twice), "TCA" (which occurs twice), "CAT" (occurs twice), and "ATG" (occurs once). You can read more about k-mers on Wikipedia.
a) Write a function that takes a string of nucleotides as input and returns a dictionary with all 3-mers present in that string, and the number of times that each 3-mer occurs. Then, validate your function by finding the 3-mers in the DNA sequence test_seq
defined below.
The output of your function should be a dictionary that is structured like this (although it will have several more entries):
{"ATC": 2, "TCA": 2, "CAT": 2, "ATG": 1}
where each key is a 3-mer itself (e.g., "ATC") and each value is the number of times that 3-mer occurs. Visually inspect the output of your function to ensure it is counting the 3-mers in the test sequence correctly. *HINT: You will need to use range() and len() to loop through 3-mer slices of a sequence.*
# test case; verify your code works by finding all 3-mers in this sequence
test_seq = "ATCATGCGCATG"
def find_3mer(seq):
# create an empty dictionary to hold 3-mers
out_dict = {}
# loop over every position in the sequence except for the last 1
for i in range(len(seq) - 2):
# check if 3-mer is already in the output dictionary
if seq[i:i+3] in out_dict:
out_dict[seq[i:i+3]] += 1
else:
out_dict[seq[i:i+3]] = 1
return out_dict
print(find_3mer(test_seq))
Problem 3 (3 points): Download the file "covid19_genome.txt" to your computer, upload the file to your Jupyter session, then read and load the sequence in using open()
and read()
. Use your function to count the different 3-mers in the sequence.
# open the file and read in its contents
handle = open("covid19_genome.txt", "r")
covid19_genome = handle.read().strip()
handle.close()
# execute function
covid19_3mers = find_3mer(covid19_genome)
print(covid19_3mers)