Homework 8 Solutions

Enter your name and EID here

This homework is due on April 3, 2018 at 7:00pm. Please submit as a PDF file on Canvas. Before submission, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

Problem 1 (5 points): In bioinformatics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA sequencing. For example, if the DNA sequencing read is "ATCATCATG", then the 3-mers in that read include "ATC" (which occurs twice), "TCA" (which occurs twice), "CAT" (occurs twice), and "ATG" (occurs once). You can read more about k-mers on Wikipedia.

a) Write a function that takes a string of nucleotides as input and returns a dictionary with all 4-mers present in that string, and the number of times that each 4-mer occurs. Then use your function to find the 4-mers in the DNA sequence my_seq defined below.

The output of your function should be a dictionary that is structured like this (although it will have several more entries):

{"ATCA": 2, "TCAT": 2, "CATC": 1}

where each key is a 4-mer itself (e.g., "ATCA") and each value is the number of times that 4-mer occurs.

b) Come up with a short DNA sequence and use it to verify manually that your function generates the correct result. Explain your reasoning in 2-3 sentences.

In [1]:
# Find all 4-mers in this sequences
# test case
test_seq = "ATCATCAT"

def find_4mer(seq):
    # Create an empty dictionary to hold 4-mers
    out_dict = {}
    # Loop over every position in the sequence except for the last 3
    for i in range(len(seq) - 3):
        # Check if 4-mer is already in the output dictionary
        if seq[i:i+4] in out_dict:
            out_dict[seq[i:i+4]] += 1
            out_dict[seq[i:i+4]] = 1
    return out_dict

{'CAGC': 1, 'AGCC': 1, 'GCCC': 1, 'CCCA': 1, 'CCAA': 1, 'CAAT': 1, 'AATC': 1, 'ATCA': 1, 'TCAG': 1, 'CAGG': 2, 'AGGC': 1, 'GGCT': 1, 'GCTC': 1, 'CTCT': 1, 'TCTA': 1, 'CTAC': 1, 'TACT': 2, 'ACTG': 1, 'CTGC': 1, 'TGCC': 1, 'GCCA': 1, 'CCAC': 1, 'CACT': 1, 'ACTA': 1, 'CTAA': 1, 'TAAA': 1, 'AAAC': 1, 'AACT': 1, 'ACTT': 1, 'CTTA': 1, 'TTAC': 2, 'TACG': 2, 'ACGC': 2, 'CGCA': 1, 'GCAG': 1, 'AGGA': 1, 'GGAT': 1, 'GATA': 1, 'ATAT': 2, 'TATA': 1, 'TATT': 1, 'ATTT': 1, 'TTTA': 1, 'CGCC': 1, 'GCCG': 1, 'CCGA': 1, 'CGAC': 1, 'GACG': 1, 'ACGT': 1, 'CGTA': 1, 'GTAC': 1}
{'ATCA': 2, 'TCAT': 2, 'CATC': 1}

By looking at the sequence "ATCATCAT", we can see that "ATCA" occurs twice, "TCAT" occurs twice, and "CATC" occurs once. Our function reproduces these counts.

Problem 2 (5 points): DNA sequences are typically stored in a format called FASTA (pronounced fast-ay). A single FASTA file may contain many different sequences. For example, you may have a FASTA file for a mouse, and each mouse gene sequence is stored as a separate sequence in that FASTA file. All sequences in a FASTA file begin on a new line with a greater-than symbol ">" (without quotes).

Write a function that takes the name of a FASTA file as input, opens that file, counts the number of sequences in the file (by counting the number of lines in the file that start with a “>” symbol), and returns the count. Download the file "CD4.fasta" to your computer and use your function to count the number of sequences in the file. The file CD4.fasta contains amino acid sequences of the CD4 membrane protein that is found on the surface of the immune cells.

In [2]:
def count_seqs(fasta):
    # Set counter to 0
    seq_count = 0
    # Open the fasta file
    # We use 'r' here because we only want to read the contents of the file, not modify the file
    with open(fasta, 'r') as file:
        # Loop over each line in the file
        for line in file:
            # If the first character of the line is '>', add one to the counter
            if line[0] == '>':
                seq_count += 1
    return seq_count

In [ ]: