April 28, 2020
In this class, our goal is to implement the Needleman-Wunsch algorithm in Python. You can read more about the Needleman-Wunsch algorithm on Wikipedia. The Wikipedia page contains pseudo-code which you might find helpful.
Since the Needleman-Wunsch algorithm works on a matrix of scores, it is helpful to discuss how to represent a matrix in Python. We can do so by creating a list of lists. Let's say we want to make a matrix that looks like this:
In Python, we would create it as follows:
my_matrix =  # Fill out the 0th row my_matrix.append([1, 3, 5, 7]) # Fill out the 1st row my_matrix.append([2, 3, 4, 5]) # Fill out the 2nd row my_matrix.append([5, 2, 20, 3]) print(my_matrix)
[[1, 3, 5, 7], [2, 3, 4, 5], [5, 2, 20, 3]]
However, when we print the matrix, it is not easy to read, because it is shown as a list of lists. So it is helpful to define a function
print_matrix() that prints a matrix in a nicer form:
def print_matrix(mat): # Loop over all rows for i in range(0, len(mat)): print("[", end = "") # Loop over each column in row i for j in range(0, len(mat[i])): # Print out the value in row i, column j print(mat[i][j], end = "") # Only add a tab if we're not in the last column if j != len(mat[i]) - 1: print("\t", end = "") print("]\n") print_matrix(my_matrix)
[1 3 5 7] [2 3 4 5] [5 2 20 3]
To retrieve an indiviual value from a matrix, we use code of the form
my_matrix[row][column]. So, to retrieve for example the value in the 0th column of the 2nd row, we write:
print("The value in the 2nd row and the 0th column is:", my_matrix)
The value in the 2nd row and the 0th column is: 5
It will also be helpful to have a function that can create a matrix of zeros.
# A function for making a matrix of zeroes def zeros(rows, cols): return [[0 for i in range(cols)] for j in range(rows)] zero_matrix = zeros(3, 5) # make zero matrix with 3 rows and 5 columns print_matrix(zero_matrix)
[0 0 0 0 0] [0 0 0 0 0] [0 0 0 0 0]
Write a function that takes two sequences as input and returns a matrix of Needleman-Wunsch scores. You do not have to do the back-tracing, just fill out the matrix.
Break the problem down into as many small steps as possible. Here are a few hints:
zeros()function defined below.
max()function will return the maximum value from a list of values. For example
max(1, 7, 3)will return
print_matrix()function to print out your matrix as frequently as possible. Always make sure that your code is doing what you think it's doing!
# Use these values to calculate scores gap_penalty = -1 match_award = 1 mismatch_penalty = -1 # Make a score matrix with these two sequences seq1 = "ATTACA" seq2 = "ATGCT" # A function for determining the score between any two bases in alignment def match_score(alpha, beta): if alpha == beta: return match_award elif alpha == '-' or beta == '-': return gap_penalty else: return mismatch_penalty # The function that actually fills out the matrix of scores def needleman_wunsch(seq1, seq2): # length of the two sequences n = len(seq1) m = len(seq2) # Generate matrix of zeros to store scores score = zeros(m+1, n+1) # Calculate score table # Fill out first column for i in range(0, m + 1): score[i] = gap_penalty * i # Fill out first row for j in range(0, n + 1): score[j] = gap_penalty * j # Fill out all other values in the score matrix for i in range(1, m + 1): for j in range(1, n + 1): # Calculate the score by checking the top, left, and diagonal cells match = score[i - 1][j - 1] + match_score(seq1[j-1], seq2[i-1]) delete = score[i - 1][j] + gap_penalty insert = score[i][j - 1] + gap_penalty # Record the maximum score from the three possible scores calculated above score[i][j] = max(match, delete, insert) return score # Test out the needleman_wunsch() function print_matrix(needleman_wunsch(seq1, seq2))
[0 -1 -2 -3 -4 -5 -6] [-1 1 0 -1 -2 -3 -4] [-2 0 2 1 0 -1 -2] [-3 -1 1 1 0 -1 -2] [-4 -2 0 0 0 1 0] [-5 -3 -1 1 0 0 0]
Modify your code from Problem 1 to back-trace through the score matrix and print out the final alignment. Hint: For the back-tracing, you'll want to use a
while loop (or several of them).
def nw_backtrace(seq1, seq2): # Calculate score table using the function we wrote earlier score = needleman_wunsch(seq1, seq2) # Create variables to store alignment align1 = "" align2 = "" # Start from the bottom right cell in matrix i = len(seq2) j = len(seq1) # We'll use i and j to keep track of where we are in the matrix, just like above while i > 0 and j > 0: # end touching the top or the left edge score_current = score[i][j] score_diagonal = score[i-1][j-1] score_up = score[i][j-1] score_left = score[i-1][j] # Check to figure out which cell the current score was calculated from, # then update i and j to correspond to that cell. if score_current == score_diagonal + match_score(seq1[j-1], seq2[i-1]): align1 += seq1[j-1] align2 += seq2[i-1] i -= 1 j -= 1 elif score_current == score_up + gap_penalty: align1 += seq1[j-1] align2 += '-' j -= 1 elif score_current == score_left + gap_penalty: align1 += '-' align2 += seq2[i-1] i -= 1 # Finish tracing up to the top left cell while j > 0: align1 += seq1[j-1] align2 += '-' j -= 1 while i > 0: align1 += '-' align2 += seq2[i-1] i -= 1 # Since we traversed the score matrix from the bottom right, our two sequences will be reversed. # These two lines reverse the order of the characters in each sequence. align1 = align1[::-1] align2 = align2[::-1] return(align1, align2) output1, output2 = nw_backtrace(seq1, seq2) print(output1 + "\n" + output2)