Homework 8

Enter your name and EID here

This homework is due on April 9, 2019 at 4:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

Problem 1 (5 points): In bioinformatics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA sequencing. For example, if the DNA sequencing read is "ATCATCATG", then the 3-mers in that read include "ATC" (which occurs twice), "TCA" (which occurs twice), "CAT" (occurs twice), and "ATG" (occurs once). You can read more about k-mers on Wikipedia.

a) Write a function that takes a string of nucleotides as input and returns a dictionary with all 2-mers present in that string, and the number of times that each 2-mer occurs. Then use your function to find the 2-mers in the DNA sequence my_seq defined below.

The output of your function should be a dictionary that is structured like this (although it will have several more entries):

{"AT": 2, "TC": 2, "CA": 1}

where each key is a 2-mer itself (e.g., "AT") and each value is the number of times that 2-mer occurs.

b) Come up with a short DNA sequence and use it to verify manually that your function generates the correct result. Explain your reasoning in 2-3 sentences.

In [1]:
# Find all 2-mers in this sequences
my_seq = "CCTCTCCCTTATCGTCAATCTTCTCGAGGATTGGGGACCCTGCGCTGAACATGGAGAACATCACATCAGG"

# Your code goes here

Your answer goes here

Problem 2 (5 points): DNA sequences are typically stored in a format called FASTA (pronounced fast-ay). A single FASTA file may contain many different sequences. For example, you may have a FASTA file for a mouse, and each mouse gene sequence is stored as a separate sequence in that FASTA file. All sequences in a FASTA file begin on a new line with a greater-than symbol ">" (without quotes).

Write a function that takes the name of a FASTA file as input, opens that file, counts the number of sequences in the file (by counting the number of lines in the file that start with a “>” symbol), and returns the count. Download the file "hepatitis_b_genome.fasta" to your computer and use your function to count the number of sequences in the file.

In [2]:
# Your code goes here