Here we store material for the course CB2442, Bioinformatics.
This project is maintained by kth-gt
Your task is to write a function that can align two sequences using the Needleman-Wunch algorithm for global sequence alignment.
Begin with downloading the project to your local computer by using this link.
Unzip the files into a directory and open the directory in VScode.
$ unzip 'kth-gt cb2442 main prog-p2.zip'
$ code .
Set the list authors
to contain all the group members names.
Add a python function to the file labp2.py
named
def initiate_global_dp(len_str_a, len_str_b):
that takes the length of two sequences, a
and b
, as input, and outpus a (dynamic programming) matrix of dimentions [len_str_a,len_str_b]
as well as a tracer matrix, both initiated to perform a Needleman-Wunsch global alignment.
To your aid, you have the function gap()
that returns the penalty of a gap of one amino acid.
Remember that in a Needleman-Wunch Algorithm, we initiate the dynamic programming matrix, $S$, is initiated as:
Furthermore the trace matrix should be initited in such a way that the alignment, once it reaches the first colum or row, goes back to (0,0), i.e. if you follow the suggested notation of the module, you should set trace(x,y,1) = True for first column, and trace(x,y,2) = True for first row (except for 0,0)
The function initiate_global_dp
should return a tuple with the dynamic programming matrix and the trace matrix.
S
:G | C | T | ||
---|---|---|---|---|
0 | 0 | gap*1 | gap*2 | gap*3 |
G | gap*1 | |||
A | gap*2 | |||
T | gap*3 |
gap
represents the gap penalty function gap()
.Think about whether the gap score should be positive or negative!
G | C | T | ||
---|---|---|---|---|
0 | (F,F,F) | (F,F,T) | (F,F,T) | (F,F,T) |
G | (F,T,F) | |||
A | (F,T,F) | |||
T | (F,T,F) |
It is shaped like the Score matrix S
but with an extra dimension. This extra dimension tells us if we had a match/mismatch, insertion or deletion, depending on the position of the “True”.
(F,F,F)
No True –> indicates no match, insertion, or deletion at the origin.(T,F,F)
True at position 0 –> indicates a match/mismatch penalty(F,T,F)
True at position 1 –> indicates a gap penalty (insert). This is done in agreement with how S
was initialized(F,F,T)
True at position 2 –> indicates a gap penalty (deletion)[len_str_a + 1, len_str_b + 1]
, meaning it has an extra row and column to account for the initial gap penalties at the start of the sequences. Be careful with this and make checks to ensure your indices are consistent with this! Think about what part of the matrix is filled, and which part is unfilled. Also, remember that Python starts its index counting at 0, unlike Matlab or Julia.Now implement the function
def global_align(seqA,seqB):
that fills in the rest of the dynamic programming matrix using the recursion:
\[S_{ij} = \max \begin{cases} S_{i-1,j-1} + d(a_i,b_j) & \text{(match/mismatch)} \\ S_{i-1,j} + d(a_i,-) & \text{(deletion)} \\ S_{i,j-1} + d(-,b_j) & \text{(insertion)} \end{cases}\]a_i
: Character at position i
in sequence A
.b_j
: Character at position j
in sequence B
.d(a_i, b_j)
: Substitution cost between characters a_i
and b_j
.
d(a_i, b_j) > 0
when a_i = b_j
. match_score function
d(a_i, b_j) < 0
when a_i ≠ b_j
. match_score function
-
gap function
Example of match score function:
d(A, A) = +1
(match)d(A, G) = -1
(mismatch)Also, fill in the trace matrix, so that it follows the maximal paths through the matrix. I.e. set the trace to be
The function should return the dynamic programming matrix, the trace matrix, and the score of the alignment
You can make an initial execution of your dna2aa
function by running the ain function of the python file itself by executing the line,
$ python labp2.py
However ther final test of the code is done by executing the runnerp2.py
executable, which can be exected from command line as,
$ python runnerp2.py
or just
$ ./runnerp2.py
This executes the code in labp2.py
, and validates the results against some known test vectors.
If you implemented the function right, you will see your names apearing.
If you have extra time, you can try out one or more of the following excercises:
global
indicating if you want a global or local alignment and add the extra code for making a local alignment.