Lab Worksheet 10 Solutions

Problem 1: In bioinformatics, we are often interested in determining whether or not two DNA or amino acid sequences are similar. One simple measure of similarity is called pairwise sequence identity. To calculate pairwise sequence identity, we take two sequences, count the number of positions in which both sequences share the same nucleotide or amino acid, and then divide by the total number of positions. For example, say we have these two DNA sequences:

      Position: 1 2 3 4 5 6
    Sequence 1: A T C G T A
    Sequence 2: A T G A G A
Identical(y/n): y y n n n y

There are 3 positions that match out of 6 total positions, so the sequence identity is 50% (3/6).

Write a function that calculates the pairwise sequence identity for any two sequences of the same length. (Do not worry about properly aligning the two sequences. Sequence alignment is a concept we will return to later.) Your function should take two arguments: seq1 and seq2. Make sure that your function checks for equal sequence lengths. If the input sequences are of different lengths, your function should return an error message. Otherwise, your function should return the pairwise sequence identity as a percentage.

Finally, use your function to calculate the pairwise sequence identity of the two amino acid sequences below.

In [1]:
mouse_histone = "MARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPLTVALREIRRYQKSTELLIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA"
human_histone = "MARTKQTARKSTGGKAPRKQLATKAQRKSARATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLVTEIAQDFKTDLRFQSSAVNALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA"

# Your code goes here.

def calc_seq_identity(seq1, seq2):
    # Check that input sequences are the same length
    if len(seq1) != len(seq2):
        return "Error, input sequences are of different lengths."
    # Start a counter for the number of matches that we find
    matches = 0
    # Loop over the sequences
    for i in range(len(seq1)):
        # At each position i, check if letters are identical
        if seq1[i] == seq2[i]:
            matches += 1
    # Calculate a percentage
    return (matches/len(seq1))*100

calc_seq_identity(mouse_histone, human_histone)
    
Out[1]:
96.32352941176471



Problem 2: Open the file "road_not_taken.txt" from In-class Worksheet 19. Randomly shuffle the order of the lines in the poem, then save them to a file called "road_not_taken_shuffled.txt". Open your new file and print the first 10 lines to make sure that the poem has been shuffled. HINT: You can use the function shuffle() to shuffle the items in a list.

In [2]:
# You will need the library "random" to shuffle the lines in the poem.
from random import shuffle

# Your code goes here.

# Read in the poem
with open("road_not_taken.txt", "r") as file:
    lines = file.readlines()

# Shuffle the lines
# The shuffle function shuffles a list in place
shuffle(lines)

# Write out shuffled poem
with open("read_not_taken_shuffle.txt", "w") as file:
    file.writelines(lines)

# Open file with shuffled poem and print the first 10 lines
with open("read_not_taken_shuffle.txt", "r") as file:
    count = 1
    for line in file:
        if count <= 10:
            print(line)
            count += 1
        else:
            break
Oh, I kept the first for another day!

Then took the other, as just as fair,

And be one traveler, long I stood

And that has made all the difference.

And sorry I could not travel both

Somewhere ages and ages hence:

        THE ROAD NOT TAKEN

And having perhaps the better claim,



Though as for that the passing there

In [ ]: