Data Mining is one of my favorite areas in CS. I have so far taken courses on intro data mining, web mining and in this semester (Spring 2010) an advanced data mining course using Graphs and Matrices . IMO, one of the problems with data mining and machine learning is that most of the algorithms / concepts are stuffed with mathematics , even though the basic ideas are very simple. So I am hoping to write some simple to read posts on some of the cool topics in data mining (DM for short) /machine learning (ML for short) . So the writing will be very informal and may lack in rigor.
In this first post, I plan to write on K-Means algorithm. But before talking about the algorithm some basics of Data Mining and Machine Learning. Two very common tasks in Data Mining are Classification and Clustering.
Classification
In Classification , we are given some training data where someone manually had classified each data in it to some category (aka class). Our aim is to do the classify the future data (aka testing data) into categories/classes as accurately as possible. Classification is a technique in Supervised Learning. Common practical examples include finding if a tumor is malignant or is a particular credit card transaction fraudulent. You can even consider, handwriting recognition as classification.
Clustering
Clustering , on the other hand , is an unsupervised learning technique . The aim here is to group the data points into clusters such that similar items are lumped together in the same cluster. Here , nobody "trains" the algorithm and it is expected to do a good job. Clustering is one of the important tools in exploratory analysis.
There are two broad types of Clustering. One is called Hierarchical clustering where you try to create a hierarchy of related objects or clusters. One can consider biological taxonomy as a simple example of clustering where similar organisms are lumped into same species . Then all similar species are put into same genus and so on. A more recent example is that of Phylogenetic trees where geneticists find the evolutionary relation between species . Here organisms with common ancestors are clubbed together.
If the above examples seem very biological consider Google News . Google News groups articles on similar topics together. For eg all articles that talk about India’s cricket victory today (around 1000 news articles) are grouped into a single cluster , as their primary topic is very similar . Other sports clusters are similarly formed. Now , since all these sports clusters have a common theme , they will be clubbed and put into a bigger cluster called Sports , Similarly , all articles on Sri Lankan elections are grouped together into a cluster. Then other political clusters are grouped to form the Politics section.
The other major type of Clustering is called partitional or iterative clustering. Consider you are given a bin of colored balls. If you separate them into piles of same colored balls , you are doing a rudimentary partitional clustering. I will talk more about Hierarchical clustering in some future post. Here ,I will talk about K-Means which is the most popular partitional clustering algorithm. It is considered as one of the top 10 data mining algorithms .
Distance Measures
If we want to compare two items we need a notion of similarity or dissimilarity (aka distance). Most of the time, if we know one of these values , we can calculate the other as they related in a antagonistic way. Let us focus on distance. If the items are simple points in a n-dimensional space, then a simple metric will be the Euclidean distance. If we consider p,q as two points ,
There are other distance measures like Manhattan Distance , Mahalanobis distance. Popular similarity measures are Jaccard coefficient , Cosine similarity etc.
Things are a bit complicated when the items are not points but objects with some attributes (for eg news articles). In this we might need to design our own distance measures. For checking if news articles talk about same topic, a simple distance measure can be to find the entities in an article and compare their frequencies. Two articles with common words like India, Bangladesh, Cricket, Victor, Test "might" talk about the same topic. For a real world application , finding a good distance measure requires lot of work – The distance measure should be efficient but also reasonably accurate.
K-Means algorithm
Let us talk about the K-means algorithm. The algorithm accepts two inputs. The data itself, and "k", the number of clusters. We will talk about the implications of specifying "k" later. The output is k clusters with input data partitioned among them.
The aim of K-means (or clustering) is this : We want to group the items into k clusters such that all items in same cluster are as similar to each other as possible. And items not in same cluster are as different as possible. We use the distance measures to calculate similarity and dissimilarity. One of the important concept in K-means is that of centroid. Each cluster has a centroid. You can consider it as the point that is most representative of the cluster. Equivalently, centroid is point that is the "center" of a cluster.
Algorithm
1. Randomly choose k items and make them as initial centroids.
2. For each point, find the nearest centroid and assign the point to the cluster associated with the nearest centroid.
3. Update the centroid of each cluster based on the items in that cluster. Typically, the new centroid will be the average of all points in the cluster.
4. Repeats steps 2 and 3, till no point switches clusters.
As you can see, the algorithm is extremely simple. After some iterations, we will get k-clusters within which each points are similar. It is quite obvious that the algorithm should work. There is also a formal proof for convergence based on error criterion but we will skip it.
Some Random Points
1. The algorithm converges even though sometimes it may take exponential time. Although, in practice, it converges very fast.
2. The centroid of a cluster need not correspond to any point in the cluster. Initially, the centroid is a point in the cluster but after some iterations it need not be the case. If you force the algorithm to always select an item in the cluster as its centroid then we get the k-medoids algorithm.
3. There are lot of good things about this algorithm. It is simple to understand, easy to code , scalable , efficient etc.
4. Let N be the number of data points. k be the number of clusters. i be the number of iterations. The time complexity is O(Nki).
5. Although not very obvious, some times, the K-Means algorithm gets stuck in a local minima and might return a bad clustering.
6. The selection of initial points is very critical to avoid local minima. Basic algorithm selects them randomly but much better heuristics exist. A good set of initial points are important as it influences both quality of clusters and also the running time of the algorithm.
7. Finding the correct number of clusters is one of the big problems in K-means. A wrong value of k can give you a sub optimal result. There are many techniques to predict the number of k , for eg by "preprocessing" the data using hierarchical clustering etc.
8. K-means is lot in common with other algorithms like PCA and EM . Both of them are really cool ideas and I will post about them sometime in the future. One interesting tidbit is that professor Chris Ding who is taking my Advanced Data Mining course , proved the equivalence of K-means and PCA.
My Experience With K-Means
As my data mining course project, I did Clustering of wikipedia results. The task is something like this : Lets say you search for the word Jaguar in wikipedia. It will return a list of articles that contain Jaguar. The word Jaguar has multiple interpretation – as an animal , as a car , or as a version of Mac OS . My project took these pages as input and applied K-means algorithm on the pages . It will group pages that talk about similar concepts and more interestingly , it will also try to give a name for the cluster. (One of the thing about K-means is that it forms the clusters based on similarity – interpreting what each cluster means is up to us ) . So for Jaguar, it will group all articles that talk about Jaguar as a car into one cluster and so on.
Infact, it worked well even if you give a more abstract query term like Obama or McCain . (Remember, it was written in Fall 2008 , when there was US election fever 🙂 ) . So for McCain, it will group pages into clusters talking about his military career, senate career, presidential election and so on. That project was real fun and I learnt about lot of things in data mining while doing it. In case you are interested, the project report is here.
Code
I wrote a simple code in java to do clustering but I found a more generic code at Antonio Gulli’s blog . The code is here.
More Readings
There are lot of good resources on K-Means. One of them is Geomblog’s post on k-means. This blog also contains some posts on Clustering.
Update : [Apr 23 2010] :
In my blog’s dashboard, I see lot of people coming to my post with the query, how to select k. There are two good posts that discuss about this issue. You can check them out at Geomblog – Elbow Method and Diminishing Returns/ROC curve.
If you liked this post , please subscribe to the RSS feed.
[…] got lot of useful feedback on my post on K-means. Most common were to discuss the basic intuition in more detail and to include more figures […]
[…] : I have discussed K-Means at K-Means Clustering Algorithm. You can use it to brush it up if you […]
[…] at a level higher that the typical user and I have to explain it better. For eg , in my post of K-Means , this was the reason I explain supervised and unsupervised learning at the start even though most […]
Hi!
I am student from VietNam. I have project with K-mean. Would you can send me your’s project source code?. Thank you so much.
Sorry about my’s English not good.
My email: nghiahiep512@gmail.com
Hope to receive your email soon.
Thank
Hello nghiahiep,
I lost my code during a hard disk failure. I guess one possibility is to use the Antonio Gulli’s code to which i have linked in the blog . If you search the net, you should be able to find lot of code samples ! Sorry about not being able to help you !
Oh,no problem. Thank you for help me!
Hello sir,
We are doing project on cbir in vb.net using K-mean. plz send me project source code?. Thank you so much.
Hello digambar,
As I said in one my previous comments, I lost the code during a hard disk failure. I have linked Antonio Gulli’s code. You can always search the net for additional code samples.
hi can u help me regarding hierarchical clustering algo imprementation
Hello Neha,
I am not free at the moment. I am sure you will be able to find lot of sample code and tutorials on Heirarchical clustering. If you are in interested in tutorial/theoretical pointers, I can provide some of them.
could u plz help me out by sending the limitations of k means n different techniques to improve the algo…hv got semminar on these topics
Hi ranjana,
you may want to take a look at http://geomblog.blogspot.com/search/label/clustering for a detailed discussion about different aspects of KMeans. Some of the limitations are : Choice of k, hard assignment of points, assuming spherical shapes etc. But there are different techniques to overcome each of them. For eg, you can use elbow method to choose k , use soft clustering , gaussian mixtures etc
sir can u please send me sample c code for kmeans clustering
algorithm ..its urgent
Hi Jaggu,
As I have said in previous comments, I do not have the code for KMeans. I have linked an alternate C++ code. You may use it if you want.
thanks for responding sir..
hi sir am doing mtech in davangere so i decided do the technical seminar on the topic k-means clustering algorithm so can you send me that code for it ether in java or c
please…………
hi sir am doing mtech in davangere so i decided do the technical seminar on the topic k-means clustering algorithm so can you send me that code for it ether in java or c
please…………
Hi Raji,
As I have said in previous comments, I do not have the code for KMeans. I have linked an alternate C++ code from Antonio Gulli’s blog. You may use it if you want.
I am very interesting in clustering wikipedia search, can you share your sourcecode, or point me to the website where I can get some information. thanks.
Hank,
As I said in one of the previous comments, I lost the code in a hard disk failure but it was not very hard to implement. The two challenges were to find a suitable “distance” measure and proper cluster initialization and scaling it to run for millions of records. Let me know if you need any specific help.
I am still trying to find a suitable “distance” measure for each vector, there are vectors for each document, 1 means words in dictionary, 0 means not, if there are 650 documents, then there will be 650 vectors corresponding to each, cosin will help me to get similarity rate between two documents, but I dont know how to use k means method to cluster these docs from here
below article help me but I am still finding the answer:
http://www.scribd.com/doc/32930615/Clustering-Unstructured-Data-Flat-Files-An-Implementation-in-Text-Mining-Tool
Hank,
I do not think that metric alone will give great results. You may also want to use other meta information like their wikipedia category, no of hops it takes to reach one url to another and so on. Finding a good distance metric is pretty hard. You may want to take a look at my project report I linked in the post.
This is exactly the type of explanation I was looking for. Amazing job! Have you ever considered writing a book? You should! 🙂
Ravi,
Thanks for the kind words. One of the original reason for starting the blog is to write about ml / data mining algorithms in simple language. But somehow got busy with research. May be i will try to write one post on ML a week.
Lets see how it goes !
hi am sowmiya and doing my first course in data mining. Am new to this field ya. And need to submit a project end of this week. Almost done working on half the project. But the final part of the project is to do k-means algorithm same as what you ve done in ur project. Enaku romba helpful ah irukum if u can send me ur java code for your project. my email id is sowmiyaraj88@gmail.com . Can you help me plz. Thanks a lot. 🙂
Sowmiya
Hi saravanan,
This is revathi.
I want to do phd in datamining. could you suggest the topics in clustering
Hi Revathi,
Sent a mail to your email id about the topics.
Revathi
Hi saravanan,
I want to do phd in K-means clustering
Can you suggest the topics
Regards, Revathi
Hi you! you have opensource of SVM-KNN for me!
Hi!
I am implementing the k-medoids algorithm. I was wondering if you could tell me whether you had published any papers on this?
Thanks,
@Mo Firouz,
I have not released any papers on K-Medoids. I would suggest you to search in Google Scholar. Also you can check in Suresh Venkatasubramanian’s blog. I have linked this site in the references.
Hey , can u give me k-medoids algorithm ? i need that …. pls ..
i need k mean clustering code in c language but the input should be in characters not the integers. plz help me
hey,I am working on Modified K-Means.Do u have any articles or some related material related to this topic ??
Hi,
I would suggest you to take a look at Suresh’s Geomblog (given in the More readings section). He also conducted a entire graduate level seminar on clustering. hope this is useful.
hey ,,,some has deleted my comments…;)
Hi,
Comments are sent to the moderator for approval. I try to approve 2-3 times a day and hence this causes the delay.
Hi, Thirumuruganathan!
Your article is a good job. Thanks for making it available, I enjoyed it a lot!
hi,
this is koushik
I am currently doing project in data mining and i need to cluster web pages using k-means fuzzy algorithm , plz help me with the java code of this algorithm.
Moreover am having problem relating to the opening of your project report showing this message-“This file is not plain text (only UTF-8 and Latin-1 text encodings are currently supported).” plz help me out with this problem too!
thanks in advance:-)
HI,
Sarvanan
i want to know that why mean shift is better than other clustering algorithm.
Chandan,
Mean shift by itself is not necessarily better than other clustering algorithms. The data distribution defines which is best. For a brief discussion of the pros and cons please see http://metaoptimize.com/qa/questions/668/why-use-k-means-in-place-of-mean-shift-clustering
Hi,
I am looking out for neuro-fuzzy clustering algorithms. Please share the link to the same. Also share code if u have come across. Also, I am looking for web log data sets. please share the link
shiva
Excellent explanation!
Richa
hi.. this is isha…. i want to do a project in clustering… my moto is to compare the other clustering methods nd kmeans… nd hav a comparitive study…. wil it b a gu projct??? pls suggest…. u sum1 have differnt sggstions then please suggest me… send me email… id is: isha.ganguli@gmail.com
hai this is sraves i am doing my project in kmeaans custering so can you please send me the code related to kmeans clustering for an image please i tried a lot but not yet geting.so,please help me and for suggestions plz send me at psravanthi73@gmail.com
What’s up to every body, it’s my first visit of this web site;
this webpage carries remarkable and actually excellent stuff in support of readers.
Excellent blog you have here but I was wondering if
you knew of any community forums that cover the same topics talked about here?
I’d really like to be a part of group where I can get feed-back from other knowledgeable individuals that share the same interest. If you have any recommendations, please let me know. Thanks!
I create a leave a response whenever I appreciate
a post on a site or if I have something to add to
the discussion. Usually it is caused by the passion communicated in the article I browsed.
And on this post K-Means Clustering Algorithm | God, Your Book Is
Great !!. I was moved enough to leave a leave a responsea response 😉 I do
have 2 questions for you if you do not mind.
Could it be just me or do a few of the responses appear
as if they are written by brain dead folks?
😛 And, if you are writing at other sites, I would like to follow anything fresh you have to post.
Would you make a list the complete urls of your communal sites like your linkedin profile, Facebook page
or twitter feed?
I do not know if it’s just me or if perhaps everyone else encountering issues with your website. It looks like some of the text in your posts are running off the screen. Can someone else please comment and let me know if this is happening to them as well? This could be a problem with my web browser because I’ve had this happen before.
Many thanks
I loved as much as you will receive carried
out right here. The sketch is attractive, your authored subject matter stylish.
nonetheless, you command get got an impatience over that you wish be delivering the following.
unwell unquestionably come more formerly again as exactly the
same nearly a lot often inside case you shield this
increase.
hello Saravanan,
thank you so much for this post.
Am masters student and I wanna do my research on clustering.
Can u suggest me any topic on k-means?
Thanks in advance
Clara