Problem 1: In bioinformatics, we are often interested in determining whether or not two DNA or amino acid sequences are similar. One simple measure of similarity is called pairwise sequence identity. To calculate pairwise sequence identity, we take two sequences, count the number of positions in which both sequences share the same nucleotide or amino acid, and then divide by the total number of positions. For example, say we have these two DNA sequences:
Position: 1 2 3 4 5 6 Sequence 1: A T C G T A Sequence 2: A T G A G A Identical(y/n): y y n n n y
There are 3 positions that match out of 6 total positions, so the sequence identity is 50% (3/6).
Write a function that calculates the pairwise sequence identity for any two sequences of the same length. (Do not worry about properly aligning the two sequences. Sequence alignment is a concept we will return to later.) Your function should take two arguments:
seq2. Make sure that your function checks for equal sequence lengths. If the input sequences are of different lengths, your function should return an error message. Otherwise, your function should return the pairwise sequence identity as a percentage.
Finally, use your function to calculate the pairwise sequence identity of the two amino acid sequences below.
mouse_histone = "MARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPLTVALREIRRYQKSTELLIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA" human_histone = "MARTKQTARKSTGGKAPRKQLATKAQRKSARATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLVTEIAQDFKTDLRFQSSAVNALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA" # Your code goes here. def calc_seq_identity(seq1, seq2): # Check that input sequences are the same length if len(seq1) != len(seq2): return "Error, input sequences are of different lengths." # Start a counter for the number of matches that we find matches = 0 # Loop over the sequences for i in range(len(seq1)): # At each position i, check if letters are identical if seq1[i] == seq2[i]: matches += 1 # Calculate a percentage return (matches/len(seq1))*100 calc_seq_identity(mouse_histone, human_histone)
Problem 2: Open the file "
road_not_taken.txt" from In-class Worksheet 19. Randomly shuffle the order of the lines in the poem, then save them to a file called "
road_not_taken_shuffled.txt". Open your new file and print the first 10 lines to make sure that the poem has been shuffled. HINT: You can use the function
shuffle() to shuffle the items in a list.
# You will need the library "random" to shuffle the lines in the poem. from random import shuffle # Your code goes here. # Read in the poem with open("road_not_taken.txt", "r") as file: lines = file.readlines() # Shuffle the lines # The shuffle function shuffles a list in place shuffle(lines) # Write out shuffled poem with open("read_not_taken_shuffle.txt", "w") as file: file.writelines(lines) # Open file with shuffled poem and print the first 10 lines with open("read_not_taken_shuffle.txt", "r") as file: count = 1 for line in file: if count <= 10: print(line) count += 1 else: break
Oh, I kept the first for another day! Then took the other, as just as fair, And be one traveler, long I stood And that has made all the difference. And sorry I could not travel both Somewhere ages and ages hence: THE ROAD NOT TAKEN And having perhaps the better claim, Though as for that the passing there