Data Mining is one of my favorite areas in CS. I have so far taken courses on intro data mining, web mining and in this semester (Spring 2010) an advanced data mining course using Graphs and Matrices . IMO, one of the problems with data mining and machine learning is that most of the algorithms / concepts are stuffed with mathematics , even though the basic ideas are very simple. So I am hoping to write some simple to read posts on some of the cool topics in data mining (DM for short) /machine learning (ML for short) . So the writing will be very informal and may lack in rigor.
In this first post, I plan to write on K-Means algorithm. But before talking about the algorithm some basics of Data Mining and Machine Learning. Two very common tasks in Data Mining are Classification and Clustering.
In Classification , we are given some training data where someone manually had classified each data in it to some category (aka class). Our aim is to do the classify the future data (aka testing data) into categories/classes as accurately as possible. Classification is a technique in Supervised Learning. Common practical examples include finding if a tumor is malignant or is a particular credit card transaction fraudulent. You can even consider, handwriting recognition as classification.
Clustering , on the other hand , is an unsupervised learning technique . The aim here is to group the data points into clusters such that similar items are lumped together in the same cluster. Here , nobody "trains" the algorithm and it is expected to do a good job. Clustering is one of the important tools in exploratory analysis.
There are two broad types of Clustering. One is called Hierarchical clustering where you try to create a hierarchy of related objects or clusters. One can consider biological taxonomy as a simple example of clustering where similar organisms are lumped into same species . Then all similar species are put into same genus and so on. A more recent example is that of Phylogenetic trees where geneticists find the evolutionary relation between species . Here organisms with common ancestors are clubbed together.
If the above examples seem very biological consider Google News . Google News groups articles on similar topics together. For eg all articles that talk about India’s cricket victory today (around 1000 news articles) are grouped into a single cluster , as their primary topic is very similar . Other sports clusters are similarly formed. Now , since all these sports clusters have a common theme , they will be clubbed and put into a bigger cluster called Sports , Similarly , all articles on Sri Lankan elections are grouped together into a cluster. Then other political clusters are grouped to form the Politics section.
The other major type of Clustering is called partitional or iterative clustering. Consider you are given a bin of colored balls. If you separate them into piles of same colored balls , you are doing a rudimentary partitional clustering. I will talk more about Hierarchical clustering in some future post. Here ,I will talk about K-Means which is the most popular partitional clustering algorithm. It is considered as one of the top 10 data mining algorithms .
If we want to compare two items we need a notion of similarity or dissimilarity (aka distance). Most of the time, if we know one of these values , we can calculate the other as they related in a antagonistic way. Let us focus on distance. If the items are simple points in a n-dimensional space, then a simple metric will be the Euclidean distance. If we consider p,q as two points ,
Things are a bit complicated when the items are not points but objects with some attributes (for eg news articles). In this we might need to design our own distance measures. For checking if news articles talk about same topic, a simple distance measure can be to find the entities in an article and compare their frequencies. Two articles with common words like India, Bangladesh, Cricket, Victor, Test "might" talk about the same topic. For a real world application , finding a good distance measure requires lot of work – The distance measure should be efficient but also reasonably accurate.
Let us talk about the K-means algorithm. The algorithm accepts two inputs. The data itself, and "k", the number of clusters. We will talk about the implications of specifying "k" later. The output is k clusters with input data partitioned among them.
The aim of K-means (or clustering) is this : We want to group the items into k clusters such that all items in same cluster are as similar to each other as possible. And items not in same cluster are as different as possible. We use the distance measures to calculate similarity and dissimilarity. One of the important concept in K-means is that of centroid. Each cluster has a centroid. You can consider it as the point that is most representative of the cluster. Equivalently, centroid is point that is the "center" of a cluster.
1. Randomly choose k items and make them as initial centroids.
2. For each point, find the nearest centroid and assign the point to the cluster associated with the nearest centroid.
3. Update the centroid of each cluster based on the items in that cluster. Typically, the new centroid will be the average of all points in the cluster.
4. Repeats steps 2 and 3, till no point switches clusters.
As you can see, the algorithm is extremely simple. After some iterations, we will get k-clusters within which each points are similar. It is quite obvious that the algorithm should work. There is also a formal proof for convergence based on error criterion but we will skip it.
Some Random Points
1. The algorithm converges even though sometimes it may take exponential time. Although, in practice, it converges very fast.
2. The centroid of a cluster need not correspond to any point in the cluster. Initially, the centroid is a point in the cluster but after some iterations it need not be the case. If you force the algorithm to always select an item in the cluster as its centroid then we get the k-medoids algorithm.
3. There are lot of good things about this algorithm. It is simple to understand, easy to code , scalable , efficient etc.
4. Let N be the number of data points. k be the number of clusters. i be the number of iterations. The time complexity is O(Nki).
5. Although not very obvious, some times, the K-Means algorithm gets stuck in a local minima and might return a bad clustering.
6. The selection of initial points is very critical to avoid local minima. Basic algorithm selects them randomly but much better heuristics exist. A good set of initial points are important as it influences both quality of clusters and also the running time of the algorithm.
7. Finding the correct number of clusters is one of the big problems in K-means. A wrong value of k can give you a sub optimal result. There are many techniques to predict the number of k , for eg by "preprocessing" the data using hierarchical clustering etc.
8. K-means is lot in common with other algorithms like PCA and EM . Both of them are really cool ideas and I will post about them sometime in the future. One interesting tidbit is that professor Chris Ding who is taking my Advanced Data Mining course , proved the equivalence of K-means and PCA.
My Experience With K-Means
As my data mining course project, I did Clustering of wikipedia results. The task is something like this : Lets say you search for the word Jaguar in wikipedia. It will return a list of articles that contain Jaguar. The word Jaguar has multiple interpretation – as an animal , as a car , or as a version of Mac OS . My project took these pages as input and applied K-means algorithm on the pages . It will group pages that talk about similar concepts and more interestingly , it will also try to give a name for the cluster. (One of the thing about K-means is that it forms the clusters based on similarity – interpreting what each cluster means is up to us ) . So for Jaguar, it will group all articles that talk about Jaguar as a car into one cluster and so on.
Infact, it worked well even if you give a more abstract query term like Obama or McCain . (Remember, it was written in Fall 2008 , when there was US election fever 🙂 ) . So for McCain, it will group pages into clusters talking about his military career, senate career, presidential election and so on. That project was real fun and I learnt about lot of things in data mining while doing it. In case you are interested, the project report is here.
Update : [Apr 23 2010] :
In my blog’s dashboard, I see lot of people coming to my post with the query, how to select k. There are two good posts that discuss about this issue. You can check them out at Geomblog – Elbow Method and Diminishing Returns/ROC curve.
If you liked this post , please subscribe to the RSS feed.