In bioinformatics, we are often interested in determining whether or not two DNA or amino acid sequences are similar. One simple measure of similarity is called pairwise sequence identity. To calculate pairwise sequence identity, we take two sequences, count the number of positions in which both sequences share the same nucleotide or amino acid, and then divide by the total number of positions. For example, say we have these two DNA sequences:
Position: 1 2 3 4 5 6
Sequence 1: A T C G T A
Sequence 2: A T G A G A
Identical(y/n): y y n n n y
There are 3 positions that match out of 6 total positions, so the sequence identity is 50% (3/6).
Problem 1: Write code that outputs the example above to a CSV file. The CSV file should be written in tidy format, and it should contain one column for position
, one column for Sequence 1
, one column for Sequence 2
, and one column for Identical
. Use sequence1
and sequence2
to write a CSV file. Your code should generate a number for a position and check each position for a match. Once you produced a file, verify that the file was written correctly by opening it in R Studio.
sequence1 = "ATCGTA"
sequence2 = "ATGAGA"
# Your code goes here.
Problem 2: Write a function that calculates the pairwise sequence identity for any two sequences of the same length. (Do not worry about properly aligning the two sequences. Sequence alignment is a concept we will return to later.) Your function should take two arguments: seq1
and seq2
. Make sure that your function checks for equal sequence lengths. If the input sequences are of different lengths, your function should return an error message. Otherwise, your function should return the pairwise sequence identity as a percentage.
Finally, use your function to calculate the pairwise sequence identity of the two amino acid sequences below.
mouse_histone = "MARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPLTVALREIRRYQKSTELLIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA"
human_histone = "MARTKQTARKSTGGKAPRKQLATKAQRKSARATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLVTEIAQDFKTDLRFQSSAVNALQEACEAYLVGLFEDTNLCAIHAKRVTIMPKDIQLARRIRGERA"
# Your code goes here.
Problem 3: Write a function that counts the occurence of each amino acid in a sequence and writes them to a CSV file. Amino acids in the output file should appear in the alphabetical order. Your function should write a file in tidy format. Use your function to output amino acid counts in sequences mouse_histone
and human_histone
from the previous problem. HINT: You can use sorted()
to sort dictionary keys.
# Your code goes here.