Class 25: Needleman-Wunsch¶

April 28, 2020

In this class, our goal is to implement the Needleman-Wunsch algorithm in Python. You can read more about the Needleman-Wunsch algorithm on Wikipedia. The Wikipedia page contains pseudo-code which you might find helpful.

Since the Needleman-Wunsch algorithm works on a matrix of scores, it is helpful to discuss how to represent a matrix in Python. We can do so by creating a list of lists. Let's say we want to make a matrix that looks like this:

1	3	5	7
2	3	4	5
5	2	20	3

In Python, we would create it as follows:

my_matrix = []
# Fill out the 0th row
my_matrix.append([1, 3, 5, 7])
# Fill out the 1st row
my_matrix.append([2, 3, 4, 5])
# Fill out the 2nd row
my_matrix.append([5, 2, 20, 3])

print(my_matrix)

[[1, 3, 5, 7], [2, 3, 4, 5], [5, 2, 20, 3]]

However, when we print the matrix, it is not easy to read, because it is shown as a list of lists. So it is helpful to define a function print_matrix() that prints a matrix in a nicer form:

def print_matrix(mat):
    # Loop over all rows
    for i in range(0, len(mat)):
        print("[", end = "")
        # Loop over each column in row i
        for j in range(0, len(mat[i])):
            # Print out the value in row i, column j
            print(mat[i][j], end = "")
            # Only add a tab if we're not in the last column
            if j != len(mat[i]) - 1:
                print("\t", end = "")
        print("]\n")

print_matrix(my_matrix)

[1	3	5	7]

[2	3	4	5]

[5	2	20	3]

To retrieve an indiviual value from a matrix, we use code of the form my_matrix[row][column]. So, to retrieve for example the value in the 0th column of the 2nd row, we write:

print("The value in the 2nd row and the 0th column is:", my_matrix[2][0])

The value in the 2nd row and the 0th column is: 5

It will also be helpful to have a function that can create a matrix of zeros.

# A function for making a matrix of zeroes
def zeros(rows, cols):
    return [[0 for i in range(cols)] for j in range(rows)]

zero_matrix = zeros(3, 5) # make zero matrix with 3 rows and 5 columns
print_matrix(zero_matrix)

[0	0	0	0	0]

[0	0	0	0	0]

[0	0	0	0	0]

Problems¶

Problem 1:

Write a function that takes two sequences as input and returns a matrix of Needleman-Wunsch scores. You do not have to do the back-tracing, just fill out the matrix.

Break the problem down into as many small steps as possible. Here are a few hints:

Before you calculate any scores, make an empty matrix of the appropriate size using the zeros() function defined below.
Fill out the 0th row and 0th column before you calculate any other scores.
The max() function will return the maximum value from a list of values. For example max(1, 7, 3) will return 7.
Make liberal use of the range() function.
Use the print_matrix() function to print out your matrix as frequently as possible. Always make sure that your code is doing what you think it's doing!
Remember, in Python, we start counting from 0.

# Use these values to calculate scores
gap_penalty = -1
match_award = 1
mismatch_penalty = -1

# Make a score matrix with these two sequences
seq1 = "ATTACA"
seq2 = "ATGCT"

# A function for determining the score between any two bases in alignment
def match_score(alpha, beta):
    if alpha == beta:
        return match_award
    elif alpha == '-' or beta == '-':
        return gap_penalty
    else:
        return mismatch_penalty

# The function that actually fills out the matrix of scores
def needleman_wunsch(seq1, seq2):
    
    # length of the two sequences
    n = len(seq1)
    m = len(seq2)  
    
    # Generate matrix of zeros to store scores
    score = zeros(m+1, n+1)
   
    # Calculate score table
    
    # Fill out first column
    for i in range(0, m + 1):
        score[i][0] = gap_penalty * i
    
    # Fill out first row
    for j in range(0, n + 1):
        score[0][j] = gap_penalty * j
    
    # Fill out all other values in the score matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            # Calculate the score by checking the top, left, and diagonal cells
            match = score[i - 1][j - 1] + match_score(seq1[j-1], seq2[i-1])
            delete = score[i - 1][j] + gap_penalty
            insert = score[i][j - 1] + gap_penalty
            # Record the maximum score from the three possible scores calculated above
            score[i][j] = max(match, delete, insert)

    return score

# Test out the needleman_wunsch() function
print_matrix(needleman_wunsch(seq1, seq2))

[0	-1	-2	-3	-4	-5	-6]

[-1	1	0	-1	-2	-3	-4]

[-2	0	2	1	0	-1	-2]

[-3	-1	1	1	0	-1	-2]

[-4	-2	0	0	0	1	0]

[-5	-3	-1	1	0	0	0]

If this was easy¶

Problem 2:

Modify your code from Problem 1 to back-trace through the score matrix and print out the final alignment. Hint: For the back-tracing, you'll want to use a while loop (or several of them).

def nw_backtrace(seq1, seq2):
    # Calculate score table using the function we wrote earlier
    score = needleman_wunsch(seq1, seq2)
    
    # Create variables to store alignment
    align1 = ""
    align2 = ""
    
    # Start from the bottom right cell in matrix
    i = len(seq2)
    j = len(seq1)
    
    # We'll use i and j to keep track of where we are in the matrix, just like above
    while i > 0 and j > 0: # end touching the top or the left edge
        score_current = score[i][j]
        score_diagonal = score[i-1][j-1]
        score_up = score[i][j-1]
        score_left = score[i-1][j]
        
        # Check to figure out which cell the current score was calculated from,
        # then update i and j to correspond to that cell.
        if score_current == score_diagonal + match_score(seq1[j-1], seq2[i-1]):
            align1 += seq1[j-1]
            align2 += seq2[i-1]
            i -= 1
            j -= 1
        elif score_current == score_up + gap_penalty:
            align1 += seq1[j-1]
            align2 += '-'
            j -= 1
        elif score_current == score_left + gap_penalty:
            align1 += '-'
            align2 += seq2[i-1]
            i -= 1

    # Finish tracing up to the top left cell
    while j > 0:
        align1 += seq1[j-1]
        align2 += '-'
        j -= 1
    while i > 0:
        align1 += '-'
        align2 += seq2[i-1]
        i -= 1
    
    # Since we traversed the score matrix from the bottom right, our two sequences will be reversed.
    # These two lines reverse the order of the characters in each sequence.
    align1 = align1[::-1]
    align2 = align2[::-1]
    
    return(align1, align2)

output1, output2 = nw_backtrace(seq1, seq2)

print(output1 + "\n" + output2)

ATTACA
A-TGCT