Homework 8 Solutions

Enter your name and EID here

This homework is due on April 13, 2020 at 12:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

Problem 1 (3 pts): The interleukin 6 gene (IL6) encodes for a cytokine that mediates a variety of immune response pathways in humans. Download the file "IL6_gene_human_lower.txt" to your computer, upload the file to your Jupyter session, then read the sequence in line-by-line using open() and readlines(). Print out the sequence of the gene such that all nucleotides have been converted to uppercase and white space has been removed.

In [1]:
# open the file and read in its contents as a list of lines
handle = open("IL6_gene_human_lower.txt", "r")
tp53_sequence = handle.readlines()
handle.close()

new_seq = ''
for line in tp53_sequence:
    line = line.rstrip().upper()
    new_seq += line

print(new_seq)
ATTCTGCCCTCGAGCCCACCGGGAACGAAAGAGAAGCTCTATCTCCCCTCCAGGAGCCCAGCTATGAACTCCTTCTCCACAAGTAAGTGCAGGAAATCCTTAGCCCTGGAACTGCCAGCGGCGGTCGAGCCCTGTGTGAGGGAGGGGTGTGTGGCCCAGGGAGGGCTGGCGGGCGGCCAGCAGCAGAGGCAGGCTCCCAGCTGTGCTGTCAGCTCACCCCTGCGCTCGCTCCCCTCCGGCACAGGCGCCTTCGGTCCAGTTGCCTTCTCCCTGGGGCTGCTCCTGGTGTTGCCTGCTGCCTTCCCTGCCCCAGTACCCCCAGGAGAAGATTCCAAAGATGTAGCCGCCCCACACAGACAGCCACTCACCTCTTCAGAACGAATTGACAAACAAATTCGGTACATCCTCGACGGCATCTCAGCCCTGAGAAAGGAGGTGGGTAGGCTTGGCGATGGGGTTGAAGGGCCCGGTGCGCATGCGTTCCCCTTGCCCCTGCGTGTGGCCGGGGGCTGCCTGCATTAGGAGGTCTTTGCTGGGTTCTAGAGCACTGTAGATTTGAGGCCAACGGGGCCGACTAGACTGACTTCTGTATTTATCCTTTGCTGGTGTCAGGAAGTTCCTTTCCTTTCTGGAAAATGCAGAATGGGTCTGAAATCCATGCCCACCTTTGGCATGAGCTGAGGGTTATTGCTTCTCAGGGCTTCCTTTTCCCTTTCCAAAAAATTAGGTCTGTGAAGCTCCTTTTTGTCCCCCGGGCTTTGGAAGGACTAGAAAAGTGCCACCTGAAAGGCATGTTCAGCTTCTCAGAGCAGTTGCAGTACTTTTTGGTTATGTAAACTCAATGGCTAGGATTCCTCAAAGCCATTCCAGCTAAGATTCATACCTCAGAGCCCACCAAAGTGGCAAATCATAAATAGGTTAAAGCATCTCCCCACTTTCAATGCAAGGTATTTTGGTCCTGTTTGGTAGAAAGAAAAGAACACAGGAGGGGAGATTGGGAGCCCACACTCGAATTCTGGTTCTGCCAAACCAGCCTTGTGATCTTGGGTAAATTCCCTACCACCTCTGGACTCCATCAGTAAAATTGGGCGTGGACTAGGTGATCTCATAGATCCTTCCTGCTGGAACATTCTATGGCTTGAATTATATTCTCCTAATTATTGTCAAAATTGCTGTTATTAAGTATCTACTGTGTGCCAGGCACTTTAAATAAATATTGTGTCTAATCTTCAAAACAAATTTGCAAGGAAGGTTTTTGGAGATAAGGAAACTGAGACTCAGGATTAAGTAACACACCTAAAGTCACAGGTGAGCTTGGAACTGAACCCAAGTGTGCCCCCACTCCACTGGAATTTGCTTGCCAGGATGCCAATGAGTTGTAGCTTCATTTTTCTTAGAGACTTTCCTGGCTGTGGTTGAACAATGAAAAGGCCCTCTAGTGGTGTTTGTTTTAGGGACACTTAGGTGATAACAATTCTGGTATTCTTTCCCAGACATGTAACAAGAGTAACATGTGTGAAAGCAGCAAAGAGGCACTGGCAGAAAACAACCTGAACCTTCCAAAGATGGCTGAAAAAGATGGATGCTTCCAATCTGGATTCAATGAGGTACCAACTTGTCGCACTCACTTTTCACTATTCCTTAGGCAAAACTTCTCCCTCTTGCATGCAGTGCCTGTATACATATAGATCCAGGCAGCAACAAAAAGTGGGTAAATGTAAAGAATGTTATGTAAATTTCATGAGGAGGCCAACTTCAAGCTTTTTTAAAGGCAGTTTATTCTTGGACAGGTATGGCCAGAGATGGTGCCACTGTGGTGAGATTTTAACAACTGTCAAATGTTTAAAACTCCCACAGGTTTAATTAGTTCATCCTGGGAAAGGTACTCTCAGGGCCTTTTCCCTCTCTGGCTGCCCCTGGCAGGGTCCAGGTCTGCCCTCCCTCCCTGCCCAGCTCATTCTCCACAGTGAGATAACCTGCACTGTCTTCTGATTATTTTATAAAAGGAGGTTCCAGCCCAGCATTAACAAGGGCAAGAGTGCAGGAAGAACATCAAGGGGGACAATCAGAGAAGGATCCCCATTGCCACATTCTAGCATCTGTTGGGCTTTGGATAAAACTAATTACATGGGGCCTCTGATTGTCCAGTTATTTAAAATGGTGCTGTCCAATGTCCCAAAACATGCTGCCTAAGAGGTACTTGAAGTTCTCTAGAGGAGCAGAGGGAAAAGATGTCGAACTGTGGCAATTTTAACTTTTCAAATTGATTCTATCTCCTGGCGATAACCAATTTTCCCACCATCTTTCCTCTTAGGAGACTTGCCTGGTGAAAATCATCACTGGTCTTTTGGAGTTTGAGGTATACCTAGAGTACCTCCAGAACAGATTTGAGAGTAGTGAGGAACAAGCCAGAGCTGTGCAGATGAGTACAAAAGTCCTGATCCAGTTCCTGCAGAAAAAGGTGGGTGTGTCCTCATTCCCTCAACTTGGTGTGGGGGAAGACAGGCTCAAAGACAGTGTCCTGGACAACTCAGGGATGCAATGCCACTTCCAAAAGAGAAGGCTACACGTAAACAAAAGAGTCTGAGAAATAGTTTCTGATTGTTATTGTTAAATCTTTTTTTGTTTGTTTGGTTGGTTGGCTCTCTTCTGCAAAGGACATCAATAACTGTATTTTAAACTATATATTAACTGAGGTGGATTTTAACATCAATTTTTAATAGTGCAAGAGATTTAAAACCAAAGGCGGGGGGGCGGGCAGAAAAAAGTGCATCCAACTCCAGCCAGTGATCCACAGAAACAAAGACCAAGGAGCACAAAATGATTTTAAGATTTTAGTCATTGCCAAGTGACATTCTTCTCACTGTGGTTGTTTCAATTCTTTTTCCTACCTTTTACCAGAGAGTTAGTTCAGAGAAATGGTCAGAGACTCAAGGGTGGAAAGAGGTACCAAAGGCTTTGGCCACCAGTAGCTGGCTATTCAGACAGCAGGGAGTAGACTTGCTGGCTAGCATGTGGAGGAGCCAAAGCTCAATAAGAAGGGGCCTAGAATGAAACCCTTGGTGCTGATCCTGCCTCTGCCATTTCTACTTAAGCCAGGGTTTCTCATATGTTAACATGCATGGGAATTCCCTGGGCATCTTCTTGTGGTGTGGAGTCTGACTTAGCAAGCCTCGGGTGGGTTTGAGGGTCAAATTTCTACCAGGCTTATATCCCTGGTGATGCTGCAGAATTCCAGGACCACACTTGGAGGTTTAAGGCCTTCCACAAGTTACTTATCCCATATGGTGGGTCTATGGAAAGGTGTTTCCCAGTCCTCTTTACACCACCGGATCAGTGGTCTTTCAACAGATCCTAAAGGGATGGTGAGAGGGAAACTGGAGAAAAGTATCAGATTTAGAGGCCACTGAAGAACCCATATTAAAATGCCTTTAAGTATGGGCTCTTCATTCATATACTAAATATGAACTATGTGCCAGGCATTATTTCATATGACAGAATACAAACAAATAAGATAGTGATGCTGGTCAGGCTTGGTGGCTCATGCCTGTATTCCCTAAACTTTGGGAGCCTAAGGTGAGAACTCCTTGAACTCCTAAGGCCAGGAGTTCAAGACCAGCCTGGATAACATAGCAAGACCCCATCTCTACAAAAAACCAAAACCAAACAAACAAAAATGATAGTGGTGCTTCCCTCAGGATGCTTGTGGTCTAATGGGAGACAGAACAGCAAAGGGATGATTAGAAGTTGGTTGCTGTGAGCCAGGCACAGTGCTGATATAATCCCAGCGCTATGGGAGGCTGAGGTGGGTGGATCATTTGAGGCCAGGAGTTTAAGACCAGCCTGGTCAACATGGTAAAACCCCATCTCTACTTAAAAATACAAAAAAGTTAGCCAGGCATGGTGGCATACACCTGTAACCCAGCTACTCAGGAGGCTGAGGCACATGAATCACTTGAACCCAGGAGGCAGAGGTTGCTGTGCACCACTGCACTCCAGCCTGGGTGACAGAACGAGACCTTGACTCAAAAAAAAAAAAAAGAAGTTTGTTGCTATGGAAGGGTCCTACTCAGAGCAGGCACCCCAGTTAATCTCATTCACCCCACATTTCACATTTGAACATCATCCCATAGCCCAGAGCATCCCTCCACTGCAAAGGATTTATTCAACATTTAAACAATCCTTTTTACTTTCATTTTCCTTCAGGCAAAGAATCTAGATGCAATAACCACCCCTGACCCAACCACAAATGCCAGCCTGCTGACGAAGCTGCAGGCACAGAACCAGTGGCTGCAGGACATGACAACTCATCTCATTCTGCGCAGCTTTAAGGAGTTCCTGCAGTCCAGCCTGAGGGCTCTTCGGCAAATGTAGCATGGGCACCTCAGATTGTTGTTGTTAATGGGCATTCCTTCTTCTGGTCAGAAACCTGTCCACTGGGCACAGAACTTATGTTGTTCTCTATGGAGAACTAAAAGTATGAGCGTTAGGACACTATTTTAATTATTTTTAATTTATTAATATTTAAATATGTGAAGCTGAGTTAATTTATGTAAGTCATATTTATATTTTTAAGAAGTACCACTTGAAACATTTTATGTATTAGTTTTGAAATAATAATGGAAAGTGGCTATGCAGTTTGAATATCCTTTGTTTCAGAGCCAGATCATTTCTTGGAAAGTGTAGGCTTACCTCAAATAAATGGCTAACTTATACATATTTTTAAAGAAATATTTATATTGTATTTATATAATGTATAAATGGTTTTTATACCAATAAATGGCATTTTAAAAAATTCA

Problem 2 (4 points): In bioinformatics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA sequencing. For example, if the DNA sequencing read is "ATCATCATG", then the 3-mers in that read include "ATC" (which occurs twice), "TCA" (which occurs twice), "CAT" (occurs twice), and "ATG" (occurs once). You can read more about k-mers on Wikipedia.

a) Write a function that takes a string of nucleotides as input and returns a dictionary with all 3-mers present in that string, and the number of times that each 3-mer occurs. Then, validate your function by finding the 3-mers in the DNA sequence test_seq defined below.

The output of your function should be a dictionary that is structured like this (although it will have several more entries):

{"ATC": 2, "TCA": 2, "CAT": 2, "ATG": 1}

where each key is a 3-mer itself (e.g., "ATC") and each value is the number of times that 3-mer occurs. Visually inspect the output of your function to ensure it is counting the 3-mers in the test sequence correctly. *HINT: You will need to use range() and len() to loop through 3-mer slices of a sequence.*

In [2]:
# test case; verify your code works by finding all 3-mers in this sequence
test_seq = "ATCATGCGCATG"

def find_3mer(seq):
    # create an empty dictionary to hold 3-mers
    out_dict = {}
    # loop over every position in the sequence except for the last 1
    for i in range(len(seq) - 2):
        # check if 3-mer is already in the output dictionary
        if seq[i:i+3] in out_dict:
            out_dict[seq[i:i+3]] += 1
        else:
            out_dict[seq[i:i+3]] = 1
    return out_dict

print(find_3mer(test_seq))
{'ATC': 1, 'TCA': 1, 'CAT': 2, 'ATG': 2, 'TGC': 1, 'GCG': 1, 'CGC': 1, 'GCA': 1}

Problem 3 (3 points): Download the file "covid19_genome.txt" to your computer, upload the file to your Jupyter session, then read and load the sequence in using open() and read(). Use your function to count the different 3-mers in the sequence.

In [3]:
# open the file and read in its contents
handle = open("covid19_genome.txt", "r")
covid19_genome = handle.read().strip()
handle.close()

# execute function
covid19_3mers = find_3mer(covid19_genome)
print(covid19_3mers)
{'ATT': 773, 'TTA': 876, 'TAA': 719, 'AAA': 923, 'AAG': 580, 'AGG': 329, 'GGT': 454, 'GTT': 700, 'TTT': 1004, 'TAT': 622, 'ATA': 471, 'TAC': 609, 'ACC': 376, 'CCT': 344, 'CTT': 738, 'TTC': 518, 'TCC': 209, 'CCC': 116, 'CCA': 354, 'CAG': 438, 'GTA': 469, 'AAC': 615, 'ACA': 809, 'CAA': 703, 'ACT': 674, 'TCG': 113, 'CGA': 95, 'GAT': 440, 'ATC': 339, 'TCT': 542, 'CTC': 287, 'TTG': 817, 'TGT': 858, 'TAG': 427, 'AGA': 605, 'CTG': 495, 'CTA': 561, 'ACG': 164, 'GAA': 535, 'AAT': 761, 'GTG': 552, 'TGG': 554, 'GGC': 223, 'GCT': 521, 'GTC': 269, 'TCA': 549, 'CAC': 459, 'CGG': 76, 'TGC': 547, 'GCA': 372, 'CAT': 484, 'ATG': 725, 'AGT': 507, 'CGC': 97, 'CGT': 171, 'TGA': 630, 'GAC': 340, 'GGA': 282, 'GAG': 297, 'CCG': 74, 'AGC': 301, 'GCC': 187, 'GGG': 134, 'GCG': 88}