Homework 8

Enter your name and EID here

This homework is due on April 3, 2018 at 7:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

Problem 1 (5 points): In bioinformatics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA sequencing. For example, if the DNA sequencing read is "ATCATCATG", then the 3-mers in that read include "ATC" (which occurs twice), "TCA" (which occurs twice), "CAT" (occurs twice), and "ATG" (occurs once). You can read more about k-mers on Wikipedia.

a) Write a function that takes a string of nucleotides as input and returns a dictionary with all 4-mers present in that string, and the number of times that each 4-mer occurs. Then use your function to find the 4-mers in the DNA sequence my_seq defined below.

The output of your function should be a dictionary that is structured like this (although it will have several more entries):

{"ATCA": 2, "TCAT": 2, "CATC": 1}

where each key is a 4-mer itself (e.g., "ATCA") and each value is the number of times that 4-mer occurs.

b) Come up with a short DNA sequence and use it to verify manually that your function generates the correct result. Explain your reasoning in 2-3 sentences.

In [1]:
# Find all 4-mers in this sequences

# Your code goes here

Your answer goes here

Problem 2 (5 points): DNA sequences are typically stored in a format called FASTA (pronounced fast-ay). A single FASTA file may contain many different sequences. For example, you may have a FASTA file for a mouse, and each mouse gene sequence is stored as a separate sequence in that FASTA file. All sequences in a FASTA file begin on a new line with a greater-than symbol ">" (without quotes).

Write a function that takes the name of a FASTA file as input, opens that file, counts the number of sequences in the file (by counting the number of lines in the file that start with a “>” symbol), and returns the count. Download the file "CD4.fasta" to your computer and use your function to count the number of sequences in the file. The file CD4.fasta contains amino acid sequences of the CD4 membrane protein that is found on the surface of the immune cells.

In [2]:
# Your code goes here