Feeds:
Posts

## Introduction To Mean Shift Algorithm

Its been quite some time since I wrote a Data Mining post . Today, I intend to post on Mean Shift – a really cool but not very well known algorithm. The basic idea is quite simple but the results are amazing. It was invented long back in 1975 but was not widely used till two papers applied the algorithm to Computer Vision.

I learned this algorithm in my Advanced Data Mining course and I wrote the lecture notes on it. So here I am trying to convert my lecture notes to a post. I have tried to simplify it – but this post is quite involved than the other posts.

It is quite sad that there exists no good post on such a good algorithm. While writing my lecture notes, I struggled a lot for good resources 🙂 . The 3 “classic" papers on Mean Shift are quite hard to understand. Most of the other resources are usually from Computer Vision courses where Mean Shift is taught lightly as yet another technique for vision tasks  (like segmentation) and contains only the main intuition and the formulas.

As a disclaimer, there might be errors in my exposition – so if you find anything wrong please let me know and I will fix it. You can always check out the reference for more details. I have not included any graphics in it but you can check the ppt given in the references for an animation of Mean Shift.

### Introduction

Mean Shift is a powerful and versatile non parametric iterative algorithm that can be used for lot of purposes like finding modes, clustering etc. Mean Shift was introduced in Fukunaga and Hostetler [1] and has been extended to be applicable in other fields like Computer Vision.This document will provide a discussion of Mean Shift , prove its convergence and slightly discuss its important applications.

### Intuitive Idea of Mean Shift

This section provides an intuitive idea of Mean shift and the later sections will expand the idea. Mean shift considers feature space as a empirical probability density function. If the input is a set of points then Mean shift considers them as sampled from the underlying probability density function. If dense regions (or clusters) are present in the feature space , then they correspond to the mode (or local maxima) of the probability density function. We can also identify clusters associated with the given mode using Mean Shift.

For each data point, Mean shift associates it with the nearby peak of the dataset’s probability density function. For each data point, Mean shift defines a window around it and computes the mean of the data point . Then it shifts the center of the window to the mean and repeats the algorithm till it converges. After each iteration, we can consider that the window shifts to a more denser region of the dataset.

At the high level, we can specify Mean Shift as follows :
1. Fix a window around each data point.
2. Compute the mean of data within the window.
3. Shift the window to the mean and repeat till convergence.

### Preliminaries

#### Kernels :

A kernel is a function that satisfies the following requirements :

1. $\int_{R^{d}}\phi(x)=1$

2. $\phi(x)\geq0$

Some examples of kernels include :

1. Rectangular $\phi(x)=\begin{cases} 1 & a\leq x\leq b\\ 0 & else\end{cases}$

2. Gaussian $\phi(x)=e^{-\frac{x^{2}}{2\sigma^{2}}}$

3. Epanechnikov $\phi(x)=\begin{cases} \frac{3}{4}(1-x^{2}) & if\;|x|\leq1\\ 0 & else\end{cases}$

Kernel Density Estimation

Kernel density estimation is a non parametric way to estimate the density function of a random variable. This is usually called as the Parzen window technique. Given a kernel K, bandwidth parameter h , Kernel density estimator for a given set of d-dimensional points is

${\displaystyle \hat{f}(x)=\frac{1}{nh^{d}}\sum_{i=1}^{n}K\left(\frac{x-x_{i}}{h}\right)}$

### Gradient Ascent Nature of Mean Shift

Mean shift can be considered to based on Gradient ascent on the density contour. The generic formula for gradient ascent is ,

$x_{1}=x_{0}+\eta f'(x_{0})$

Applying it to kernel density estimator,

${\displaystyle \hat{f}(x)=\frac{1}{nh^{d}}\sum_{i=1}^{n}K\left(\frac{x-x_{i}}{h}\right)}$

$\bigtriangledown{\displaystyle \hat{f}(x)=\frac{1}{nh^{d}}\sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)}$

Setting it to 0 we get,

${\displaystyle \sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)\overrightarrow{x}=\sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)\overrightarrow{x_{i}}}$

Finally , we get

${\displaystyle \overrightarrow{x}=\frac{\sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)\overrightarrow{x_{i}}}{\sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)}}$

### Mean Shift

As explained above, Mean shift treats the points the feature space as an probability density function . Dense regions in feature space corresponds to local maxima or modes. So for each data point, we perform gradient ascent on the local estimated density until convergence. The stationary points obtained via gradient ascent represent the modes of the density function. All points associated with the same stationary point belong to the same cluster.

Assuming $g(x)=-K'(x)$ , we have

${\displaystyle m(x)=\frac{\sum_{i=1}^{n}g\left(\frac{x-x_{i}}{h}\right)x_{i}}{\sum_{i=1}^{n}g\left(\frac{x-x_{i}}{h}\right)}-x}$

The quantity $m(x)$ is called as the mean shift. So mean shift procedure can be summarized as : For each point $x_{i}$

1. Compute mean shift vector $m(x_{i}^{t})$

2. Move the density estimation window by $m(x_{i}^{t})$

3. Repeat till convergence

Using a Gaussian kernel as an example,

1. $y_{i}^{0}=x_{i}$
2. ${\displaystyle y_{i}^{t+1}=\frac{\sum_{i=1}^{n}x_{j}e^{\frac{-|y_{i}^{t}-x_{j}|^{2}}{h^{2}}}}{\sum_{i=1}^{n}e^{\frac{-|y_{i}^{t}-x_{j}|^{2}}{h^{2}}}}}$

### Proof Of Convergence

Using the kernel profile,

${\displaystyle y^{t+1}=\frac{\sum_{i=1}^{n}x_{i}k(||\frac{y^{t}-x_{i}}{h}||^{2})}{\sum_{i=1}^{n}k(||\frac{y^{t}-x_{i}}{h}||^{2})}}$

To prove the convergence , we have to prove that $f(y^{t+1})\geq f(y^{t})$

$f(y^{t+1})-f(y^{t})={\displaystyle \sum_{i=1}^{n}}k(||\frac{y^{t+1}-x_{i}}{h}||^{2})-{\displaystyle \sum_{i=1}^{n}}k(||\frac{y^{t}-x_{i}}{h}||^{2})$

But since the kernel is a convex function we have ,

$k(y^{t+1})-k(y^{t})\geq k'(y^{t})(y^{t+1}-y^{t})$

Using it ,

$f(y^{t+1})-f(y^{t})\geq{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(||\frac{y^{t+1}-x_{i}}{h}||^{2}-||\frac{y^{t}-x_{i}}{h}||^{2})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-2y^{t+1}x_{i}+x_{i}^{2}-(y^{t^{2}}-2y^{t}x_{i}+x_{i}^{2}))$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-y^{t^{2}}-2(y^{t+1}-y^{t})^{T}x_{i})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-y^{t^{2}}-2(y^{t+1}-y^{t})^{T}y^{t+1})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-y^{t^{2}}-2(y^{(t+1)^{2}}-y^{t}y^{t+1}))$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-y^{t^{2}}-2y^{(t+1)^{2}}+2y^{t}y^{t+1})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(-y^{(t+1)^{2}}-y^{t^{2}}+2y^{t}y^{t+1})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(-1)(y^{(t+1)^{2}}+y^{t^{2}}-2y^{t}y^{t+1})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}-}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(||y^{t+1}-y^{t}||^{2})$

$\geq0$

Thus we have proven that the sequence $\{f(j)\}_{j=1,2...}$is convergent. The second part of the proof in [2] which tries to prove the sequence $\{y_{j}\}_{j=1,2,...}$ is convergent is wrong.

### Improvements to Classic Mean Shift Algorithm

The classic mean shift algorithm is time intensive. The time complexity of it is given by $O(Tn^{2})$ where $T$ is the number of iterations and $n$ is the number of data points in the data set. Many improvements have been made to the mean shift algorithm to make it converge faster.

One of them is the adaptive Mean Shift where you let the bandwidth parameter vary for each data point. Here, the $h$ parameter is calculated using kNN algorithm. If $x_{i,k}$is the k-nearest neighbor of $x_{i}$ then the bandwidth is calculated as

$h_{i}=||x_{i}-x_{i,k}||$

Here we use $L_{1}$or $L_{2}$ norm to find the bandwidth.

An alternate way to speed up convergence is to alter the data points
during the course of Mean Shift. Again using a Gaussian kernel as
an example,

1. $y_{i}^{0}=x_{i}$
2. ${\displaystyle y_{i}^{t+1}=\frac{\sum_{i=1}^{n}x_{j}e^{\frac{-|y_{i}^{t}-x_{j}|^{2}}{h^{2}}}}{\sum_{i=1}^{n}e^{\frac{-|y_{i}^{t}-x_{j}|^{2}}{h^{2}}}}}$
3. $x_{i}=y_{i}^{t+1}$

### Other Issues

1. Even though mean shift is a non parametric algorithm , it does require the bandwidth parameter h to be tuned. We can use kNN to find out the bandwidth. The choice of bandwidth in influences convergence rate and the number of clusters.
2. Choice of bandwidth parameter h is critical. A large h might result in incorrect
clustering and might merge distinct clusters. A very small h might result in too many clusters.

3. When using kNN to determining h, the choice of k influences the value of h. For good results, k has to increase when the dimension of the data increases.
4. Mean shift might not work well in higher dimensions. In higher dimensions , the number of local maxima is pretty high and it might converge to a local optima soon.
5. Epanechnikov kernel has a clear cutoff and is optimal in bias-variance tradeoff.

### Applications of Mean Shift

Mean shift is a versatile algorithm that has found a lot of practical applications – especially in the computer vision field. In the computer vision, the dimensions are usually low (e.g. the color profile of the image). Hence mean shift is used to perform lot of common tasks in vision.

Clustering

The most important application is using Mean Shift for clustering. The fact that Mean Shift does not make assumptions about the number of clusters or the shape of the cluster makes it ideal for handling clusters of arbitrary shape and number.

Although, Mean Shift is primarily a mode finding algorithm , we can find clusters using it. The stationary points obtained via gradient ascent represent the modes of the density function. All points associated with the same stationary point belong to the same cluster.

An alternate way is to use the concept of Basin of Attraction. Informally, the set of points that converge to the same mode forms the basin of attraction for that mode. All the points in the same basin of attraction are associated with the same cluster. The number of clusters is obtained by the number of modes.

Computer Vision Applications

Mean Shift is used in multiple tasks in Computer Vision like segmentation, tracking, discontinuity preserving smoothing etc. For more details see [2],[8].

### Comparison with K-Means

Note : I have discussed K-Means at K-Means Clustering Algorithm. You can use it to brush it up if you want.

K-Means is one of most popular clustering algorithms. It is simple,fast and efficient. We can compare Mean Shift with K-Means on number of parameters.

One of the most important difference is that K-means makes two broad assumptions – the number of clusters is already known and the clusters are shaped spherically (or elliptically). Mean shift , being a non parametric algorithm, does not assume anything about number of clusters. The number of modes give the number of clusters. Also, since it is based on density estimation, it can handle arbitrarily shaped clusters.

K-means is very sensitive to initializations. A wrong initialization can delay convergence or some times even result in wrong clusters. Mean shift is fairly robust to initializations. Typically, mean shift is run for each point or some times points are selected uniformly from the feature space [2] . Similarly, K-means is sensitive to outliers but Mean Shift is not very sensitive.

K-means is fast and has a time complexity $O(knT)$ where k is the number of clusters, n is the number of points and T is the number of iterations. Classic mean shift is computationally expensive with a time complexity $O(Tn^{2})$.

Mean shift is sensitive to the selection of bandwidth, $h$. A small $h$ can slow down the convergence. A large $h$ can speed up convergence but might merge two modes. But still, there are many techniques to determine $h$ reasonably well.

Update [30 Apr 2010] : I did not expect this reasonably technical post to become very popular, yet it did ! Some of the people who read it asked for a sample source code. I did write one in Matlab which randomly generates some points according to several gaussian distribution and the clusters using Mean Shift . It implements both the basic algorithm and also the adaptive algorithm. You can download my Mean Shift code here. Comments are as always welcome !

### References

1. Fukunaga and Hostetler, "The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition", IEEE Transactions on Information Theory vol 21 , pp 32-40 ,1975
2. Dorin Comaniciu and Peter Meer, Mean Shift : A Robust approach towards feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence vol 24 No 5 May 2002.
3. Yizong Cheng , Mean Shift, Mode Seeking, and Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence vol 17 No 8 Aug 1995.
4. Mean Shift Clustering by Konstantinos G. Derpanis
5. Chris Ding Lectures CSE 6339 Spring 2010.
6. Dijun Luo’s presentation slides.
7. cs.nyu.edu/~fergus/teaching/vision/12_segmentation.ppt

8. Dorin Comaniciu, Visvanathan Ramesh and Peter Meer, Kernel-Based Object Tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence vol 25 No 5 May 2003.
9. Dorin Comaniciu, Visvanathan Ramesh and Peter Meer, The Variable Bandwidth Mean Shift and Data-Driven Scale Selection, ICCV 2001.

1. Choosing the number of clusters I: The Elbow Method
A post with some ideas on choosing number of clusters in a principled way.

2. 4 Chatboxes for wordpress.com blogs
An interesting way to embed chats in blogs. I have added a chat window in my blog to chat with me just for the fun of it. Lets see how it goes.

3. This seat is reserved  and More women, more power?
Two interesting posts on Women reservation bill in India. Its a shame that the bill is not passed.

Another nice post on complex numbers by Steven Strogatz. It was a nice refresher to learn again how to take roots (square , cube et al) for imaginary numbers.

5. reddit.com Interviews Peter Norvig
An interesting interview with Peter Norvig. He answered some interesting questions . For eg I liked his answer that linear classifiers have progressed beyond everyone’s expectations. Another good question was why Lisp is not much used in Google. For the list of questions see reddit site.I found this interview through this site.

6. A collection of code competition sites
Good collection of code competition sites.

## K-Means Clustering Algorithm

Data Mining is one of my favorite areas in CS. I have so far taken courses on intro data mining, web mining and in this semester (Spring 2010) an advanced data mining course using Graphs and Matrices . IMO, one of the problems with data mining and machine learning is that most of the algorithms / concepts are stuffed with mathematics , even though the basic ideas are very simple. So I am hoping to write some simple to read posts on some of the cool topics in data mining (DM for short) /machine learning (ML for short) . So the writing will be very informal and may lack in rigor.

In this first post, I plan to write on K-Means algorithm. But before talking about the algorithm some basics of Data Mining and Machine Learning. Two very common tasks in Data Mining are Classification and Clustering.

### Classification

In Classification , we are given some training data where someone manually had classified each data in it to some category (aka class). Our aim is to do the classify the future data (aka testing data) into categories/classes as accurately as possible. Classification is a technique in Supervised Learning. Common practical examples include finding if a tumor is malignant or is a particular credit card transaction fraudulent. You can even consider, handwriting recognition as classification.

### Clustering

Clustering , on the other hand , is an unsupervised learning technique . The aim here is to group the data points into clusters such that similar items are lumped together in the same cluster. Here , nobody "trains" the algorithm and it is expected to do a good job. Clustering is one of the important tools in exploratory analysis.

There are two broad types of Clustering. One is called Hierarchical clustering where you try to create a hierarchy of related objects or clusters. One can consider biological taxonomy as a simple example of clustering where similar organisms are lumped into same species . Then all similar species are put into same genus and so on. A more recent example is that of Phylogenetic trees where geneticists find the evolutionary relation between species . Here organisms with common ancestors  are clubbed together.

If the above examples seem very biological consider Google News . Google News groups articles on similar topics together.  For eg all articles that talk about India’s cricket victory today (around 1000 news articles) are grouped into a single cluster , as their primary topic is very similar . Other sports clusters are similarly formed. Now , since all these sports clusters have a common theme , they will be clubbed and put into a bigger cluster called Sports , Similarly , all articles on Sri Lankan elections are grouped together into a cluster. Then other political clusters are grouped to form the Politics section.

The other major type of Clustering is called partitional or iterative clustering. Consider you are given a bin of colored balls. If you separate them into piles of same colored balls , you are doing a rudimentary partitional clustering. I will talk more about Hierarchical clustering in some future post. Here ,I will talk about K-Means which is the most popular partitional clustering algorithm. It is considered as one of the top 10 data mining algorithms .

### Distance Measures

If we want to compare two items we need a notion of similarity or dissimilarity (aka distance). Most of the time, if we know one of these values , we can calculate the other as they related in a antagonistic way. Let us focus on distance. If the items are simple points in a n-dimensional space, then a simple metric will be the Euclidean distance. If we consider p,q as two points ,

$d(p,q) = \sqrt{\displaystyle \sum_{i=1}^{n}{(p_{i}-q_{i})^{2}}}$

There are other distance measures like Manhattan Distance , Mahalanobis distance. Popular similarity measures are Jaccard coefficient , Cosine similarity etc.

Things are a bit complicated when the items are not points but objects with some attributes (for eg news articles). In this we might need to design our own distance measures. For checking if news articles talk about same topic, a simple distance measure can be to find the entities in an article and compare their frequencies. Two articles with common words like India, Bangladesh, Cricket, Victor, Test "might" talk about the same topic. For a real world application , finding a good distance measure requires lot of work – The distance measure should be efficient but also reasonably accurate.

### K-Means algorithm

Let us talk about the K-means algorithm. The algorithm accepts two inputs. The data itself, and "k", the number of clusters. We will talk about the implications of specifying "k" later. The output is k clusters with input data partitioned among them.

The aim of K-means (or clustering) is this : We want to group the items into k clusters such that all items in same cluster are as similar to each other as possible. And items not in same cluster are as different as possible. We use the distance measures to calculate similarity and dissimilarity. One of the important concept in K-means is that of centroid. Each cluster has a centroid. You can consider it as the point that is most representative of the cluster. Equivalently, centroid is point that is the "center" of a cluster.

### Algorithm

1. Randomly choose k items and make them as initial centroids.
2. For each point, find the nearest centroid and assign the point to the cluster associated with the nearest centroid.
3. Update the centroid of each cluster based on the items in that cluster. Typically, the new centroid will be the average of all points in the cluster.
4. Repeats steps 2 and 3, till no point switches clusters.

As you can see, the algorithm is extremely simple. After some iterations, we will get k-clusters within which each points are similar. It is quite obvious that the algorithm should work. There is also a formal proof for convergence based on error criterion but we will skip it.

### Some Random Points

1. The algorithm converges even though sometimes it may take exponential time. Although, in practice, it converges very fast.
2. The centroid of a cluster need not correspond to any point in the cluster. Initially, the centroid is a point in the cluster but after some iterations it need not be the case. If you force the algorithm to always select an item in the cluster as its centroid then we get the k-medoids algorithm.
3. There are lot of good things about this algorithm. It is simple to understand, easy to code , scalable , efficient etc.
4. Let N be the number of data points. k be the number of clusters. i be the number of iterations. The time complexity is O(Nki).
5. Although not very obvious, some times, the K-Means algorithm gets stuck in a local minima and might return a bad clustering.
6. The selection of initial points is very critical to avoid local minima. Basic algorithm selects them randomly but much better heuristics exist. A good set of initial points are important as it influences both quality of clusters and also the running time of the algorithm.
7. Finding the correct number of clusters is one of the big problems in K-means. A wrong value of k can give you a sub optimal result. There are many techniques to predict the number of k , for eg by "preprocessing" the data using hierarchical clustering etc.
8. K-means is lot in common with other algorithms like PCA and EM . Both of them are really cool ideas and I will post about them sometime in the future. One interesting tidbit is that professor Chris Ding who is taking my Advanced Data Mining course , proved the equivalence of K-means and PCA.

### My Experience With K-Means

As my data mining course project, I did Clustering of wikipedia results. The task is something like this : Lets say you search for the word Jaguar in wikipedia. It will return a list of articles that contain Jaguar. The word Jaguar has multiple interpretation – as an animal , as a car , or as a version of Mac OS . My project took these pages as input and applied K-means algorithm on the pages . It will group pages that talk about similar concepts and more interestingly , it will also try to give a name for the cluster. (One of the thing about K-means is that it forms the clusters based on similarity – interpreting what each cluster means is up to us ) . So for Jaguar, it will group all articles that talk about Jaguar as a car into one cluster and so on.

Infact, it worked well even if you give a more abstract query term like Obama or McCain . (Remember, it was written in Fall 2008 , when there was US election fever 🙂 ) . So for McCain, it will group pages into clusters talking about his military career, senate career, presidential election and so on. That project was real fun and I learnt about lot of things in data mining while doing it. In case you are interested, the project report is here.

### Code

I wrote a simple code in java to do clustering but I found a more generic code at Antonio Gulli’s blog . The code is here.