# Homework 8¶

Enter your name and EID here

This homework is due on April 9, 2019 at 4:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

Problem 1 (5 points): In bioinformatics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA sequencing. For example, if the DNA sequencing read is "ATCATCATG", then the 3-mers in that read include "ATC" (which occurs twice), "TCA" (which occurs twice), "CAT" (occurs twice), and "ATG" (occurs once). You can read more about k-mers on Wikipedia.

a) Write a function that takes a string of nucleotides as input and returns a dictionary with all 2-mers present in that string, and the number of times that each 2-mer occurs. Then use your function to find the 2-mers in the DNA sequence my_seq defined below.

The output of your function should be a dictionary that is structured like this (although it will have several more entries):

{"AT": 2, "TC": 2, "CA": 1}

where each key is a 2-mer itself (e.g., "AT") and each value is the number of times that 2-mer occurs.

b) Come up with a short DNA sequence and use it to verify manually that your function generates the correct result. Explain your reasoning in 2-3 sentences.

In [1]:
# Find all 2-mers in this sequences
my_seq = "CCTCTCCCTTATCGTCAATCTTCTCGAGGATTGGGGACCCTGCGCTGAACATGGAGAACATCACATCAGG"


Problem 2 (5 points): DNA sequences are typically stored in a format called FASTA (pronounced fast-ay). A single FASTA file may contain many different sequences. For example, you may have a FASTA file for a mouse, and each mouse gene sequence is stored as a separate sequence in that FASTA file. All sequences in a FASTA file begin on a new line with a greater-than symbol ">" (without quotes).
# Your code goes here