Class 25: Needleman-Wunsch

April 28, 2020

In this class, our goal is to implement the Needleman-Wunsch algorithm in Python. You can read more about the Needleman-Wunsch algorithm on Wikipedia. The Wikipedia page contains pseudo-code which you might find helpful.

Since the Needleman-Wunsch algorithm works on a matrix of scores, it is helpful to discuss how to represent a matrix in Python. We can do so by creating a list of lists. Let's say we want to make a matrix that looks like this:

1 3 5 7
2 3 4 5
5 2 20 3

In Python, we would create it as follows:

In [1]:
my_matrix = []
# Fill out the 0th row
my_matrix.append([1, 3, 5, 7])
# Fill out the 1st row
my_matrix.append([2, 3, 4, 5])
# Fill out the 2nd row
my_matrix.append([5, 2, 20, 3])

print(my_matrix)
[[1, 3, 5, 7], [2, 3, 4, 5], [5, 2, 20, 3]]

However, when we print the matrix, it is not easy to read, because it is shown as a list of lists. So it is helpful to define a function print_matrix() that prints a matrix in a nicer form:

In [2]:
def print_matrix(mat):
    # Loop over all rows
    for i in range(0, len(mat)):
        print("[", end = "")
        # Loop over each column in row i
        for j in range(0, len(mat[i])):
            # Print out the value in row i, column j
            print(mat[i][j], end = "")
            # Only add a tab if we're not in the last column
            if j != len(mat[i]) - 1:
                print("\t", end = "")
        print("]\n")

print_matrix(my_matrix)
[1	3	5	7]

[2	3	4	5]

[5	2	20	3]

To retrieve an indiviual value from a matrix, we use code of the form my_matrix[row][column]. So, to retrieve for example the value in the 0th column of the 2nd row, we write:

In [3]:
print("The value in the 2nd row and the 0th column is:", my_matrix[2][0])
The value in the 2nd row and the 0th column is: 5

It will also be helpful to have a function that can create a matrix of zeros.

In [4]:
# A function for making a matrix of zeroes
def zeros(rows, cols):
    return [[0 for i in range(cols)] for j in range(rows)]

zero_matrix = zeros(3, 5) # make zero matrix with 3 rows and 5 columns
print_matrix(zero_matrix)
[0	0	0	0	0]

[0	0	0	0	0]

[0	0	0	0	0]

Problems

Problem 1:

Write a function that takes two sequences as input and returns a matrix of Needleman-Wunsch scores. You do not have to do the back-tracing, just fill out the matrix.

Break the problem down into as many small steps as possible. Here are a few hints:

  • Before you calculate any scores, make an empty matrix of the appropriate size using the zeros() function defined below.
  • Fill out the 0th row and 0th column before you calculate any other scores.
  • The max() function will return the maximum value from a list of values. For example max(1, 7, 3) will return 7.
  • Make liberal use of the range() function.
  • Use the print_matrix() function to print out your matrix as frequently as possible. Always make sure that your code is doing what you think it's doing!
  • Remember, in Python, we start counting from 0.
In [5]:
# Use these values to calculate scores
gap_penalty = -1
match_award = 1
mismatch_penalty = -1

# Make a score matrix with these two sequences
seq1 = "ATTACA"
seq2 = "ATGCT"

# A function for determining the score between any two bases in alignment
def match_score(alpha, beta):
    if alpha == beta:
        return match_award
    elif alpha == '-' or beta == '-':
        return gap_penalty
    else:
        return mismatch_penalty

# The function that actually fills out the matrix of scores
def needleman_wunsch(seq1, seq2):
    
    # length of the two sequences
    n = len(seq1)
    m = len(seq2)  
    
    # Generate matrix of zeros to store scores
    score = zeros(m+1, n+1)
   
    # Calculate score table
    
    # Fill out first column
    for i in range(0, m + 1):
        score[i][0] = gap_penalty * i
    
    # Fill out first row
    for j in range(0, n + 1):
        score[0][j] = gap_penalty * j
    
    # Fill out all other values in the score matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            # Calculate the score by checking the top, left, and diagonal cells
            match = score[i - 1][j - 1] + match_score(seq1[j-1], seq2[i-1])
            delete = score[i - 1][j] + gap_penalty
            insert = score[i][j - 1] + gap_penalty
            # Record the maximum score from the three possible scores calculated above
            score[i][j] = max(match, delete, insert)

    return score

# Test out the needleman_wunsch() function
print_matrix(needleman_wunsch(seq1, seq2))
[0	-1	-2	-3	-4	-5	-6]

[-1	1	0	-1	-2	-3	-4]

[-2	0	2	1	0	-1	-2]

[-3	-1	1	1	0	-1	-2]

[-4	-2	0	0	0	1	0]

[-5	-3	-1	1	0	0	0]

If this was easy

Problem 2:

Modify your code from Problem 1 to back-trace through the score matrix and print out the final alignment. Hint: For the back-tracing, you'll want to use a while loop (or several of them).

In [6]:
def nw_backtrace(seq1, seq2):
    # Calculate score table using the function we wrote earlier
    score = needleman_wunsch(seq1, seq2)
    
    # Create variables to store alignment
    align1 = ""
    align2 = ""
    
    # Start from the bottom right cell in matrix
    i = len(seq2)
    j = len(seq1)
    
    # We'll use i and j to keep track of where we are in the matrix, just like above
    while i > 0 and j > 0: # end touching the top or the left edge
        score_current = score[i][j]
        score_diagonal = score[i-1][j-1]
        score_up = score[i][j-1]
        score_left = score[i-1][j]
        
        # Check to figure out which cell the current score was calculated from,
        # then update i and j to correspond to that cell.
        if score_current == score_diagonal + match_score(seq1[j-1], seq2[i-1]):
            align1 += seq1[j-1]
            align2 += seq2[i-1]
            i -= 1
            j -= 1
        elif score_current == score_up + gap_penalty:
            align1 += seq1[j-1]
            align2 += '-'
            j -= 1
        elif score_current == score_left + gap_penalty:
            align1 += '-'
            align2 += seq2[i-1]
            i -= 1

    # Finish tracing up to the top left cell
    while j > 0:
        align1 += seq1[j-1]
        align2 += '-'
        j -= 1
    while i > 0:
        align1 += '-'
        align2 += seq2[i-1]
        i -= 1
    
    # Since we traversed the score matrix from the bottom right, our two sequences will be reversed.
    # These two lines reverse the order of the characters in each sequence.
    align1 = align1[::-1]
    align2 = align2[::-1]
    
    return(align1, align2)

output1, output2 = nw_backtrace(seq1, seq2)

print(output1 + "\n" + output2)
ATTACA
A-TGCT