I recently had to implement kMeans algorithm for clustering genes based on their profiles for one of my bioinformatics homework. Even though, I implemented my code, I wanted to compare the results using BioPython. For those, who do not know, BioPython is a set of libraries that allow you to write bioinformatics code. They have implemented most of the fundamental algorithms in bioinformatics. To me it was a great lifesaver as I can test out my ideas in few lines of Python code.
In Ubuntu, if you want to install it use the command : sudo apt-get install python-biopython .
Anyway, I was searching in net to get a code snippet to do kMeans using BioPython and somehow did not find any. So I wrote one myself using the documentation. I thought I will post code in the blog so that if anyone needs it in the future its a google search away 🙂
Biopython’s kMeans code requires the input to be in the format accepted by Eisner’s treeview program so that needed some data massaging. The code per se is very simple. It provides the data file, number of clusters and the number of runs to try as input. Additionally it also passes an array to initialize the cluster centroids.
from Bio import Cluster f = open("gData1.csv") record = Cluster.Record(f) initialId =  numClusters = 4 numRecords = 12 for i in range(numClusters): for j in range(numRecords/numClusters): initialId.append(i) #(clusterAssignment, totalError, numPasses) = record.kcluster(nclusters=numClusters,initialid=initialId) (clusterAssignment, totalError, numPasses) = record.kcluster(nclusters=numClusters,npass=10) geneNames = record.genename g = [  for i in range(numClusters)] numIndex = 0 for i in clusterAssignment : g[i].append(geneNames[numIndex]) numIndex += 1 f.close()
The input file looks like this : It is a tab separated file with the first two special fields : geneid and gene name.
GENEID NAME PARAM1 PARAM2 BLAH1 BLAH2 -0.43 0.3
Hope the snippet is useful to some !