K-Means Clustering Algorithm

January 27, 2010 by Saravanan Thirumuruganathan

Data Mining is one of my favorite areas in CS. I have so far taken courses on intro data mining, web mining and in this semester (Spring 2010) an advanced data mining course using Graphs and Matrices . IMO, one of the problems with data mining and machine learning is that most of the algorithms / concepts are stuffed with mathematics , even though the basic ideas are very simple. So I am hoping to write some simple to read posts on some of the cool topics in data mining (DM for short) /machine learning (ML for short) . So the writing will be very informal and may lack in rigor.

In this first post, I plan to write on K-Means algorithm. But before talking about the algorithm some basics of Data Mining and Machine Learning. Two very common tasks in Data Mining are Classification and Clustering.

Classification

In Classification , we are given some training data where someone manually had classified each data in it to some category (aka class). Our aim is to do the classify the future data (aka testing data) into categories/classes as accurately as possible. Classification is a technique in Supervised Learning. Common practical examples include finding if a tumor is malignant or is a particular credit card transaction fraudulent. You can even consider, handwriting recognition as classification.

Clustering

Clustering , on the other hand , is an unsupervised learning technique . The aim here is to group the data points into clusters such that similar items are lumped together in the same cluster. Here , nobody "trains" the algorithm and it is expected to do a good job. Clustering is one of the important tools in exploratory analysis.

There are two broad types of Clustering. One is called Hierarchical clustering where you try to create a hierarchy of related objects or clusters. One can consider biological taxonomy as a simple example of clustering where similar organisms are lumped into same species . Then all similar species are put into same genus and so on. A more recent example is that of Phylogenetic trees where geneticists find the evolutionary relation between species . Here organisms with common ancestors are clubbed together.

If the above examples seem very biological consider Google News . Google News groups articles on similar topics together. For eg all articles that talk about India’s cricket victory today (around 1000 news articles) are grouped into a single cluster , as their primary topic is very similar . Other sports clusters are similarly formed. Now , since all these sports clusters have a common theme , they will be clubbed and put into a bigger cluster called Sports , Similarly , all articles on Sri Lankan elections are grouped together into a cluster. Then other political clusters are grouped to form the Politics section.

The other major type of Clustering is called partitional or iterative clustering. Consider you are given a bin of colored balls. If you separate them into piles of same colored balls , you are doing a rudimentary partitional clustering. I will talk more about Hierarchical clustering in some future post. Here ,I will talk about K-Means which is the most popular partitional clustering algorithm. It is considered as one of the top 10 data mining algorithms .

Distance Measures

If we want to compare two items we need a notion of similarity or dissimilarity (aka distance). Most of the time, if we know one of these values , we can calculate the other as they related in a antagonistic way. Let us focus on distance. If the items are simple points in a n-dimensional space, then a simple metric will be the Euclidean distance. If we consider p,q as two points ,

$d(p,q) = \sqrt{\displaystyle \sum_{i=1}^{n}{(p_{i}-q_{i})^{2}}}$

There are other distance measures like Manhattan Distance , Mahalanobis distance. Popular similarity measures are Jaccard coefficient , Cosine similarity etc.

Things are a bit complicated when the items are not points but objects with some attributes (for eg news articles). In this we might need to design our own distance measures. For checking if news articles talk about same topic, a simple distance measure can be to find the entities in an article and compare their frequencies. Two articles with common words like India, Bangladesh, Cricket, Victor, Test "might" talk about the same topic. For a real world application , finding a good distance measure requires lot of work – The distance measure should be efficient but also reasonably accurate.

K-Means algorithm

Let us talk about the K-means algorithm. The algorithm accepts two inputs. The data itself, and "k", the number of clusters. We will talk about the implications of specifying "k" later. The output is k clusters with input data partitioned among them.

The aim of K-means (or clustering) is this : We want to group the items into k clusters such that all items in same cluster are as similar to each other as possible. And items not in same cluster are as different as possible. We use the distance measures to calculate similarity and dissimilarity. One of the important concept in K-means is that of centroid. Each cluster has a centroid. You can consider it as the point that is most representative of the cluster. Equivalently, centroid is point that is the "center" of a cluster.

Algorithm

1. Randomly choose k items and make them as initial centroids.
2. For each point, find the nearest centroid and assign the point to the cluster associated with the nearest centroid.
3. Update the centroid of each cluster based on the items in that cluster. Typically, the new centroid will be the average of all points in the cluster.
4. Repeats steps 2 and 3, till no point switches clusters.

As you can see, the algorithm is extremely simple. After some iterations, we will get k-clusters within which each points are similar. It is quite obvious that the algorithm should work. There is also a formal proof for convergence based on error criterion but we will skip it.

Some Random Points

1. The algorithm converges even though sometimes it may take exponential time. Although, in practice, it converges very fast.
2. The centroid of a cluster need not correspond to any point in the cluster. Initially, the centroid is a point in the cluster but after some iterations it need not be the case. If you force the algorithm to always select an item in the cluster as its centroid then we get the k-medoids algorithm.
3. There are lot of good things about this algorithm. It is simple to understand, easy to code , scalable , efficient etc.
4. Let N be the number of data points. k be the number of clusters. i be the number of iterations. The time complexity is O(Nki).
5. Although not very obvious, some times, the K-Means algorithm gets stuck in a local minima and might return a bad clustering.
6. The selection of initial points is very critical to avoid local minima. Basic algorithm selects them randomly but much better heuristics exist. A good set of initial points are important as it influences both quality of clusters and also the running time of the algorithm.
7. Finding the correct number of clusters is one of the big problems in K-means. A wrong value of k can give you a sub optimal result. There are many techniques to predict the number of k , for eg by "preprocessing" the data using hierarchical clustering etc.
8. K-means is lot in common with other algorithms like PCA and EM . Both of them are really cool ideas and I will post about them sometime in the future. One interesting tidbit is that professor Chris Ding who is taking my Advanced Data Mining course , proved the equivalence of K-means and PCA.

My Experience With K-Means

As my data mining course project, I did Clustering of wikipedia results. The task is something like this : Lets say you search for the word Jaguar in wikipedia. It will return a list of articles that contain Jaguar. The word Jaguar has multiple interpretation – as an animal , as a car , or as a version of Mac OS . My project took these pages as input and applied K-means algorithm on the pages . It will group pages that talk about similar concepts and more interestingly , it will also try to give a name for the cluster. (One of the thing about K-means is that it forms the clusters based on similarity – interpreting what each cluster means is up to us ) . So for Jaguar, it will group all articles that talk about Jaguar as a car into one cluster and so on.

Infact, it worked well even if you give a more abstract query term like Obama or McCain . (Remember, it was written in Fall 2008 , when there was US election fever 🙂 ) . So for McCain, it will group pages into clusters talking about his military career, senate career, presidential election and so on. That project was real fun and I learnt about lot of things in data mining while doing it. In case you are interested, the project report is here.

Code

I wrote a simple code in java to do clustering but I found a more generic code at Antonio Gulli’s blog . The code is here.

51 Responses

on February 4, 2010 at 10:58 pm | Reply Introduction to PCA – Part 1 « God, Your Book Is Great !!

[…] got lot of useful feedback on my post on K-means. Most common were to discuss the basic intuition in more detail and to include more figures […]
on April 1, 2010 at 9:36 pm | Reply Introduction To Mean Shift Algorithm « God, Your Book Is Great !!

[…] : I have discussed K-Means at K-Means Clustering Algorithm. You can use it to brush it up if you […]
on June 17, 2010 at 1:29 am | Reply Analyzing Blog Traffic using WordPress Stats API « God, Your Book Is Great !!

[…] at a level higher that the typical user and I have to explain it better. For eg , in my post of K-Means , this was the reason I explain supervised and unsupervised learning at the start even though most […]
on July 21, 2010 at 11:10 am | Reply nghiahiep

Hi!
I am student from VietNam. I have project with K-mean. Would you can send me your’s project source code?. Thank you so much.
Sorry about my’s English not good.
My email: nghiahiep512@gmail.com
Hope to receive your email soon.
Thank
- on July 21, 2010 at 1:51 pm | Reply Saravanan Thirumuruganathan
  
  Hello nghiahiep,
  
  I lost my code during a hard disk failure. I guess one possibility is to use the Antonio Gulli’s code to which i have linked in the blog . If you search the net, you should be able to find lot of code samples ! Sorry about not being able to help you !
on July 22, 2010 at 4:19 am | Reply nghiahiep

Oh,no problem. Thank you for help me!
on October 21, 2010 at 4:41 am | Reply digambar

Hello sir,
We are doing project on cbir in vb.net using K-mean. plz send me project source code?. Thank you so much.
- on October 22, 2010 at 1:36 pm | Reply Saravanan Thirumuruganathan
  
  Hello digambar,
  
  As I said in one my previous comments, I lost the code during a hard disk failure. I have linked Antonio Gulli’s code. You can always search the net for additional code samples.
on October 25, 2010 at 8:44 am | Reply neha

hi can u help me regarding hierarchical clustering algo imprementation
- on October 27, 2010 at 11:00 pm | Reply Saravanan Thirumuruganathan
  
  Hello Neha,
  
  I am not free at the moment. I am sure you will be able to find lot of sample code and tutorials on Heirarchical clustering. If you are in interested in tutorial/theoretical pointers, I can provide some of them.
on October 30, 2010 at 2:17 pm | Reply ranjana

could u plz help me out by sending the limitations of k means n different techniques to improve the algo…hv got semminar on these topics
- on November 2, 2010 at 6:30 pm | Reply Saravanan Thirumuruganathan
  
  Hi ranjana,
  
  you may want to take a look at http://geomblog.blogspot.com/search/label/clustering for a detailed discussion about different aspects of KMeans. Some of the limitations are : Choice of k, hard assignment of points, assuming spherical shapes etc. But there are different techniques to overcome each of them. For eg, you can use elbow method to choose k , use soft clustering , gaussian mixtures etc
on November 2, 2010 at 5:09 pm | Reply jaggu

sir can u please send me sample c code for kmeans clustering
algorithm ..its urgent
- on November 2, 2010 at 6:22 pm | Reply Saravanan Thirumuruganathan
  
  Hi Jaggu,
  
  As I have said in previous comments, I do not have the code for KMeans. I have linked an alternate C++ code. You may use it if you want.
on November 3, 2010 at 9:13 am | Reply jaggu

thanks for responding sir..
on December 24, 2010 at 12:07 am | Reply raji87bdvt@gmail.com

hi sir am doing mtech in davangere so i decided do the technical seminar on the topic k-means clustering algorithm so can you send me that code for it ether in java or c
please…………
on December 24, 2010 at 12:09 am | Reply raji87bdvt@gmail.com

hi sir am doing mtech in davangere so i decided do the technical seminar on the topic k-means clustering algorithm so can you send me that code for it ether in java or c
please…………
- on December 24, 2010 at 8:50 pm | Reply Saravanan Thirumuruganathan
  
  Hi Raji,
  
  As I have said in previous comments, I do not have the code for KMeans. I have linked an alternate C++ code from Antonio Gulli’s blog. You may use it if you want.
on March 29, 2011 at 2:41 pm | Reply hank

I am very interesting in clustering wikipedia search, can you share your sourcecode, or point me to the website where I can get some information. thanks.
- on March 29, 2011 at 5:02 pm | Reply Saravanan Thirumuruganathan
  
  Hank,
  
  As I said in one of the previous comments, I lost the code in a hard disk failure but it was not very hard to implement. The two challenges were to find a suitable “distance” measure and proper cluster initialization and scaling it to run for millions of records. Let me know if you need any specific help.
on March 30, 2011 at 8:07 am | Reply Hank

I am still trying to find a suitable “distance” measure for each vector, there are vectors for each document, 1 means words in dictionary, 0 means not, if there are 650 documents, then there will be 650 vectors corresponding to each, cosin will help me to get similarity rate between two documents, but I dont know how to use k means method to cluster these docs from here

below article help me but I am still finding the answer:

http://www.scribd.com/doc/32930615/Clustering-Unstructured-Data-Flat-Files-An-Implementation-in-Text-Mining-Tool
- on April 7, 2011 at 11:17 am | Reply Saravanan Thirumuruganathan
  
  Hank,
  
  I do not think that metric alone will give great results. You may also want to use other meta information like their wikipedia category, no of hops it takes to reach one url to another and so on. Finding a good distance metric is pretty hard. You may want to take a look at my project report I linked in the post.
on May 20, 2011 at 11:50 am | Reply Ravi

This is exactly the type of explanation I was looking for. Amazing job! Have you ever considered writing a book? You should! 🙂
- on May 22, 2011 at 9:15 pm | Reply Saravanan Thirumuruganathan
  
  Ravi,
  
  Thanks for the kind words. One of the original reason for starting the blog is to write about ml / data mining algorithms in simple language. But somehow got busy with research. May be i will try to write one post on ML a week.
  
  Lets see how it goes !
on June 20, 2011 at 2:55 am | Reply Sowmiya Rajagopalan

hi am sowmiya and doing my first course in data mining. Am new to this field ya. And need to submit a project end of this week. Almost done working on half the project. But the final part of the project is to do k-means algorithm same as what you ve done in ur project. Enaku romba helpful ah irukum if u can send me ur java code for your project. my email id is sowmiyaraj88@gmail.com . Can you help me plz. Thanks a lot. 🙂

Sowmiya
on June 28, 2011 at 12:36 am | Reply Anonymous

Hi saravanan,

This is revathi.

I want to do phd in datamining. could you suggest the topics in clustering
- on June 29, 2011 at 4:38 pm | Reply Saravanan Thirumuruganathan
  
  Hi Revathi,
  
  Sent a mail to your email id about the topics.
on June 29, 2011 at 9:01 am | Reply Anonymous

Revathi

Hi saravanan,

I want to do phd in K-means clustering
Can you suggest the topics
Regards, Revathi
on July 13, 2011 at 5:25 am | Reply dung

Hi you! you have opensource of SVM-KNN for me!
on July 22, 2011 at 10:20 am | Reply Mo Firouz

Hi!

I am implementing the k-medoids algorithm. I was wondering if you could tell me whether you had published any papers on this?

Thanks,
- on July 25, 2011 at 6:40 pm | Reply Saravanan Thirumuruganathan
  
  @Mo Firouz,
  
  I have not released any papers on K-Medoids. I would suggest you to search in Google Scholar. Also you can check in Suresh Venkatasubramanian’s blog. I have linked this site in the references.
- on December 15, 2013 at 3:59 am | Reply Adiran
  
  Hey , can u give me k-medoids algorithm ? i need that …. pls ..
on September 25, 2011 at 11:51 am | Reply nidhya

i need k mean clustering code in c language but the input should be in characters not the integers. plz help me
on November 7, 2011 at 8:07 am | Reply Shambu

hey,I am working on Modified K-Means.Do u have any articles or some related material related to this topic ??
- on November 19, 2011 at 2:43 am | Reply Saravanan Thirumuruganathan
  
  Hi,
  
  I would suggest you to take a look at Suresh’s Geomblog (given in the More readings section). He also conducted a entire graduate level seminar on clustering. hope this is useful.
on November 7, 2011 at 1:33 pm | Reply Anonymous

hey ,,,some has deleted my comments…;)
- on November 19, 2011 at 2:44 am | Reply Saravanan Thirumuruganathan
  
  Hi,
  
  Comments are sent to the moderator for approval. I try to approve 2-3 times a day and hence this causes the delay.
on December 3, 2011 at 2:34 pm | Reply Humberto Fioravante Ferro

Hi, Thirumuruganathan!
Your article is a good job. Thanks for making it available, I enjoyed it a lot!
on January 15, 2012 at 3:23 am | Reply koushik

hi,
this is koushik
I am currently doing project in data mining and i need to cluster web pages using k-means fuzzy algorithm , plz help me with the java code of this algorithm.
Moreover am having problem relating to the opening of your project report showing this message-“This file is not plain text (only UTF-8 and Latin-1 text encodings are currently supported).” plz help me out with this problem too!

thanks in advance:-)
on March 16, 2012 at 2:56 pm | Reply chandan

HI,
Sarvanan

i want to know that why mean shift is better than other clustering algorithm.
- on March 18, 2012 at 8:25 pm | Reply Saravanan Thirumuruganathan
  
  Chandan,
  
  Mean shift by itself is not necessarily better than other clustering algorithms. The data distribution defines which is best. For a brief discussion of the pros and cons please see http://metaoptimize.com/qa/questions/668/why-use-k-means-in-place-of-mean-shift-clustering
on July 24, 2012 at 11:52 am | Reply shiva

Hi,
I am looking out for neuro-fuzzy clustering algorithms. Please share the link to the same. Also share code if u have come across. Also, I am looking for web log data sets. please share the link
shiva
on September 7, 2012 at 3:35 am | Reply Anonymous

Excellent explanation!
Richa
on November 6, 2012 at 1:50 am | Reply isha

hi.. this is isha…. i want to do a project in clustering… my moto is to compare the other clustering methods nd kmeans… nd hav a comparitive study…. wil it b a gu projct??? pls suggest…. u sum1 have differnt sggstions then please suggest me… send me email… id is: isha.ganguli@gmail.com
on December 27, 2012 at 11:22 am | Reply sraves

hai this is sraves i am doing my project in kmeaans custering so can you please send me the code related to kmeans clustering for an image please i tried a lot but not yet geting.so,please help me and for suggestions plz send me at psravanthi73@gmail.com
on March 13, 2013 at 11:09 pm | Reply tumblr webcam girls

What’s up to every body, it’s my first visit of this web site;
this webpage carries remarkable and actually excellent stuff in support of readers.
on April 14, 2013 at 9:43 pm | Reply flight distance calculator

Excellent blog you have here but I was wondering if
you knew of any community forums that cover the same topics talked about here?

I’d really like to be a part of group where I can get feed-back from other knowledgeable individuals that share the same interest. If you have any recommendations, please let me know. Thanks!
on April 17, 2013 at 11:27 am | Reply Exposed Aggregate

I create a leave a response whenever I appreciate
a post on a site or if I have something to add to
the discussion. Usually it is caused by the passion communicated in the article I browsed.
And on this post K-Means Clustering Algorithm | God, Your Book Is
Great !!. I was moved enough to leave a leave a responsea response 😉 I do
have 2 questions for you if you do not mind.

Could it be just me or do a few of the responses appear
as if they are written by brain dead folks?
😛 And, if you are writing at other sites, I would like to follow anything fresh you have to post.
Would you make a list the complete urls of your communal sites like your linkedin profile, Facebook page
or twitter feed?
on April 19, 2013 at 2:53 pm | Reply free video porn free

I do not know if it’s just me or if perhaps everyone else encountering issues with your website. It looks like some of the text in your posts are running off the screen. Can someone else please comment and let me know if this is happening to them as well? This could be a problem with my web browser because I’ve had this happen before.
Many thanks
on April 23, 2013 at 9:19 pm | Reply Galen

I loved as much as you will receive carried
out right here. The sketch is attractive, your authored subject matter stylish.
nonetheless, you command get got an impatience over that you wish be delivering the following.
unwell unquestionably come more formerly again as exactly the
same nearly a lot often inside case you shield this
increase.
on October 2, 2013 at 2:57 am | Reply Clara

hello Saravanan,

thank you so much for this post.
Am masters student and I wanna do my research on clustering.
Can u suggest me any topic on k-means?

Thanks in advance

Clara