Here we store material for the course CB2442, Bioinformatics.
This project is maintained by kth-gt
Hierarchical clustering is widely used in bioinformatics. In the CB2442 course, you have encountered it both in the Sequence feature module and in the Phylogenetics module. In this lab, your task is to write a function that, given a pairwise distance matrix and a list of names of the corresponding objects (in this case, sequences), performs hierarchical clustering using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method. The function should output the result as a tree in the Newick format. Before starting, please read this thoroughly
Begin with downloading the project to your local computer by using this link.
Unzip the files into a directory and open the directory in VScode.
$ unzip 'kth-gt cb2442 main prog-p1.zip'
$ code .
In the labp4.py
file, modify the function
def upgma(dist_matr, names_list):
that takes a pairwise distance matrix (as a 2-dimensional numpy array) and a list of sequence names (as a list) as input and should return a tree in Newick format (as a string). The output tree should not include branch lengths. Also, set the list authors
in the beginning of the file to contain the group members’ names.
Initialization:
nwk_list
: A list containing the names of the sequences (e.g., ['S1', 'S2', 'S3', ...]
).cluster_list
: A list of lists, each with a single sequence index (e.g., [[0], [1], [2], ...]
). This is the same as the nwk_list
but it has indices instead of names.updated_dist_matr
: A copy of the original distance matrix (dist_matr
) that will be modified during clustering.Iterative Clustering:
cluster_list
:
updated_dist_matr
. Remember that each row/column represents a cluster![0,1]
cluster_list
. [[0], [1], [2], ...]
–> [[2], ...]
cluster_list
. [[2], [0,1], ...]
nwk_list
:
'(S1,S2)'
).nwk_list
.nwk_list
.updated_dist_matr
:
cluster_list
, the nwk_list
, and the updated_dist_matr
to check they are changing according to your plan.i=j
?for
loops, and make sure they are behaving as you want. Python starts at 0
and finishes at len-1
by default, but it can be changed.The last step is to update the distance matrix by filling the row and column initialized at values of 0
.
For each other cluster:
dist_matr
).This gives you the average distance between the new cluster and the other cluster.
To be more clear: If the new cluster consists of the original indices [0,1]
, you start by computing the average distance between 0
and 2
, and then between 1
and 2
(where 2
is other_cluster
). This average distance is a value that goes in the updated_dist_matr
at the index [new_cluster, other_cluster]
… Remember that the matrix is symmetric though! Repeat for every other cluster.
You can make an initial execution of your upgma
function by running the main function of the python file itself by executing the line
$ python3 labp4.py
However, the final test of the code is done by executing the runner4.py
executable, which can be executed from command line as
$ python3 runner4.py
or just
$ ./runner4.py
This executes the upgma
function in labp4.py
using a couple of test inputs and validates the results.
If you have implemented the function correctly, you will see your names appear.
Good luck with the lab!