Feeds:
Posts

## Tutorial on Sampling Theory for Approximate Query Processing

### Prelude

This blog post is based on the lecture notes I prepared for one of the courses in which I am the teaching assistant. Hopefully, this notes will be useful for anyone interested in either Approximate Query Processing (AQP) or basic Sampling theory as applied in Databases.

### Approximate Query Processing

Consider a database D which contains billions of tuples. Such a large database is not uncommon in data warehousing. When we want to get some answers from the database, we construct a SQL query to get the results. Due to the large size of the database, any query should take quite a bit of time. This is regardless of the use of techniques like indexing which can speed up the processing time but does not really reduce the time asymptotically.

One observation is that the queries that are posed to the database returns very precise and accurate answers – probably after looking at each and every tuple. For a lot of use cases of OLAP we may not need such a precise answer. Consider some sample queries like – what is the ratio of male to female in UTA ? What percentage of US people live in Texas? What is the average salary of all employees in the company? and so on.

Notice that for such queries we really do not need answers that are correct to the last decimal point. Also notice that each of those query is an aggregate over some column. Approximate query processing is a viable technique to use in these cases. A slightly less accurate result but which is computed instantly is desirable in these cases. This is because most analysts are performing exploratory operation on the database and do not need precise answers. An approximate answer along with a confidence interval would suit most of the use cases.

There are multiple techniques to perform approximate query processing. The two most popular involve histograms and sampling. Histograms store some statistic about the database and then use it to answer queries approximately. Sampling creates a very small subset of the database and then uses this smaller database to answer the query. In this course we will focus on AQP techniques that use sampling. Sampling is a very well studied problem backed up by a rich theory that can guide us in selecting the subset so that we can provide reasonably accurate results and also provide statistical error bounds.

### Introduction to Sampling

The idea behind sampling is very simple. We want to estimate some specific characteristic of a population. For eg this might be the fraction of people who support some presidential candidate or the fraction of people who work in a specific field or fraction of people infected with a disease and so on. The naive strategy is to evaluate the entire population. Most of the time , this is infeasible due to constraints on time, cost or other factors.

An alternate approach that is usually used is to pick a subset of people . This subset is usually called a sample. The size of the sample is usually an order of magnitude smaller than the population. We then use the sample to perform our desired evaluation. Once we get some result, we can use this to estimate the characteristic for the entire population. Sampling theory helps, among other things, on how to select the subset ,  what is the size of population, how to extrapolate the result from sample to original population , how to estimate the confidence interval of our prediction etc.

The process of randomly picking a subset of the population and using it to perform some estimation should appear very strange. On a first glance it might look that we might get wildly inaccurate results. We will later see how to give statistical guarantees over our prediction. Sampling is a very powerful and popular technique. More and more problems in the real world are being solved using sampling. Lots of recent problems in data mining and machine learning essentially use sampling and randomization to approximately solve very complex problems which are not at all solvable otherwise. This is for this reason that most DBMS provide sampling as a fundamental operator. (Eg Oracle provides a SAMPLE operator in select and also dbms_stats package. SQL Server provides TABLESAMPLE operator and so on).

We represent the population with P and the sample with S. N represents the size of population and n represents the size of the sample. We will use these letters to denote statistics on the population and sample. For eg, $\mu_P$ represents the mean of the population and $\mu_S$ represents the mean of the sample. Similarly, $\sigma_P$ and $\sigma_S$ represent the standard deviation of the population and sample respectively.

### Types of Sampling

There are different ways to perform sampling. The ones that we will use most in AQP are :

### Sampling With/Without Replacement

The aim of sampling is to get a subset (sample) from a larger population. There are two ways to go about it. In the first approach, we randomly pick some entity, perform measurements if any and add it to the sample. We then replace the entity back to the original population and repeat the experiment. This means that same entity can come in the sample multiple times. This approach is called Sampling with replacement. This is the simplest approach to sampling. There is no additional overhead to check if an entity is already in sample or not. Typically, sampling with replacement is modeled with binomial distribution.

In the second approach, we explicitly make sure that an entity does not appear in the sample more than once. So we randomly pick an entity from the population, verify it is not already in the sample, perform measurement and so on. Alternatively, we can remove entities that were added to sample from the population. This approach is called Sampling without replacement and is usually modeled with an hypergeometric distribution.

### Bernoulli Sampling

Bernoulli Trial : A Bernoulli Trial is an experiment which has exactly two outcomes – Success or failure. Each of these outcomes has an associated probability that completely determines the trial. For eg consider a coin which produces head with probability 0.6 and tail with probability 0.4 . This constitutes a bernoulli trial as there are exactly two outcomes.

In Bernoulli Sampling, we perform a Bernoulli trial on each tuple in the database. Each tuple can be selected into a sample with uniform probability. If the trial is a success, the tuple is added to the sample. Else it is not. The most important thing to notice is that all the tuples have exactly the sample probability of getting into the sample. Alternatively, the success probability for Bernoulli trial remains the same for each tuple. Pseudo code for Bernoulli Sampling is given below :

success probability SP = $\frac{n}{N}$
for i = 1 to N
Perform a Bernoulli trial with success probability SP
If outcome is success add i-th tuple to sample

It is important to note that Bernoulli sampling falls under Sampling without replacement. Size of the sample follows a binomial distribution with parameters $B(N,\frac{n}{N})$. ie it can vary between 0 and N-1 with the expected size of sample as n.

### Uniform Random Sampling

In Uniform Random Sampling, we pick each tuple in the database with a constant probability. This means that the probability of any tuple entering the sample is constant. Typically, this is implemented as sampling with replacement. Pseudo code for Uniform Random Sampling is given below :

1. Generate a set of n random numbers S between 1 and N.
2. Select tuples with index in S and add it to sample.

Note that in this case we have exactly n tuples in the sample. We can also notice that sample tuple might appear multiple times. The number of times a tuple appears in the sample forms a binomial distribution with parameters $B(n,\frac{1}{n})$.

### Weighted Random Sampling

In Weighted Random Sampling, we perform a Bernoulli trial on each tuple in the database. The difference with Uniform random sampling that the success probability for each Bernoulli trial varies. In other words, each tuple has a different probability of getting into the sample.

### Stratified Sampling

If the population can be subdivided into sub population that are distinct and non overlapping , stratified sampling can be used. In Stratified sampling, we split the population into a bunch of strata and then form sampling over each strata independently. For eg a political poll can be stratified on gender, race , state etc.

There are multiple advantages in using stratified sampling. For one, this allows the convenience to use different sampling techniques over each strata. If there is some specific strata that might be under represented in a general random sampling, we can easily provide additional weights for the samples taken from that. It is also possible to vary the number of samples from a strata to minimize the error. For eg, we can take less number of samples from a strata with low variance while preserving them for strata with high variance.

Stratified sampling is not a panacea because getting a good stratification strategy may not be obvious. In lot of population, the best feature to stratify is not obvious. Even worse, the population may not contain subgroups that are homogeneous and non overlapping.

Given a population , sample size and strata there are many ways to allocate the sample across different strata. The two commonly used strategies are :

1. Proportional Allocation : In this approach the contribution of each strata to the sample is proportional to the size of the strata. So a strata that accounts to 10\% of the population will use 10\% of the sample. Note that this approach does not use any information about a sub population other than its size.
2. Variance based Allocation : This strategy allocates samples in proportion to their variance. So a strata with high variance will have higher representation than one with smaller variance. This is logical as we need few samples to accurately estimate the parameters of a sub population when its variance is low. Any additional samples do not add much additional information or reduce the final estimation dramatically.

### Reservoir Sampling

Reservoir Sampling is an algorithm that is widely used to take n random samples from a large database of size N. The real utility of reservoir sampling is realized when N is a large number or is not really known at sample time. This scenario is quite common when the input is a streaming data or when the database is frequently updated . Running the simple uniform random sampling algorithm (say Bernoulli sampling) is inefficient as N is large or the old tuples may be purged (or goes out of Sliding Window). Reservoir sampling allows you to get the random sample in a linear pass such that you only inspect any tuple at most once.

### Reservoir Sampling with Single Sample

Consider the following contrived example. We have a database which is constantly updated and we want to have a single random tuple from it.

The base case occurs when there is only one tuple in the database. In this case our random sample is the first tuple. Hence the sample  S = t1.

Lets say the database is updated and a new tuple  t2 is added. The naive approach is to restart the entire sampling process. Instead, in reservoir sampling, we accept the new tuple as the random sample with probability $\frac{1}{2}$. ie toss a coin which returns head with probability 0.5 and if it returns head , then replace t1 with t2.

We can see why S is a uniform sample. The probability that S contains t1 or t2 remains the same.

1. $Pr(S=t1) = 1 * \frac{1}{2} = \frac{1}{2}$ . The random sample is t1 when it was selected first into S (with probability 1) and then not rejected by t2 with probability $1-\frac{1}{2}$.
2. $Pr(S=t2) = \frac{1}{2}$ . The random sample is t2 when it replaces  t1 in the second step. This occurs with probability $\frac{1}{2}$

The database is updated and lets assume a new tuple t3 is added. We accept the new tuple as the random sample with probability $\frac{1}{3}$. ie toss a coin which returns head with probability 0.33 and if it returns head , then replace the previous value of S (t1 or t2) with t3. More generally when inspecting the i-th tuple, accept it with probability $\frac{1}{i}$.

It might look as if we are treating t3 unfairly because we only accept it with probability 0.33. But we can show probabilistically that the sample is still uniform. The probability that S is t1 or t2 or t3 remains the same.

1. $Pr(S=t1) = 1 \times \frac{1}{2} \times \frac{2}{3}=\frac{1}{3}$ . The only scenario when random sample is still t1 occurs when it was selected first into S (with probability 1) ,not rejected by t2 with probability $1-\frac{1}{2}$ and not rejected by t3 with probability $1-\frac{1}{3} = \frac{2}{3}$.
2. $Pr(S=t2) = \frac{1}{2} \times \frac{2}{3} = \frac{1}{3}$ . The random sample is 2 when it replaces t1 in the second step. This occurs with probability $\frac{1}{2}$. Then in the next step is not replaced by t3. This occurs with probability $1-\frac{1}{3} = \frac{2}{3}$.
3. $Pr(S=t3) = \frac{1}{3}$ . The random sample is 3 when S contains either t1 or t2 and it is replaced by t3. This occurs with probability $\frac{1}{3}$.

The pseudo code looks like :

S = t1
for i = 2 to N
Replace S with tuple t_i with probability $\frac{1}{i}$

### Reservoir Sampling with Multiple Samples

A very similar approach works when the sample size is more than 1. Lets say that we need a sample of size n. Then we initially set the first n tuples of the database as the sample. The next steps is a bit different. In the previous case, there was only one sample so we replaced the sample with the selected tuple. When sample size is more than 1, then this steps splits to two parts :

1. Acceptance : For any new tuple t_i, we need to decide if this tuple enters the sample. This occurs with probability $\frac{n}{i}$.
2. Replacement : Once we decided to accept the new tuple into the sample, some tuple already in the sample needs to make way. This is usually done randomly. We randomly pick a tuple in the sample and replace it with tuple t_i.

The pseudo code looks like :

Store first n elements into S
for i = n+1 to N
Accept tuple t_i with probability $\frac{n}{i}$
If accepted, replace a random tuple in S with tuple t_i

A coding trick that avoids the "coin tossing" by generating a random index and then accepts it if it is less than our sample size. The pseudo code looks like :

Store first n elements into S
for i = n+1 to N
randIndex = random number between 1 and i
if randIndex <= n
replace tuple at index "randIndex" in the sample with tuple t_i

We can similarly analyze that the classical reservoir sampling does provide a uniform random sample. Please refer to the paper  Random Sampling with a Reservoir by Jeffrey Vitter for additional details.

### Sampling in AQP

As discussed above, our main aim is to discuss how sampling techniques is used in AQP. Let us assume that we have a sample S of size n. The usual strategy that we will follow is to apply any aggregate query on the sample S instead of database D. We then use the result of the query from S to estimate the result for D.

One thing to note is that we will only use aggregate queries for approximate processing. Specifically we will look at COUNT, SUM and AVERAGE. The formulas for estimating the values of the aggregate query for the entire database from the sample for these 3 operators is well studied. For additional details refer to the paper “Random Sampling from Databases" by Frank Olken.

Uniform Random Sample

1. Count : $\frac{\sum_{i=1}^{n} T_i p_i}{n} = \frac{\sum_{i=1}^{n} T_i \frac{1}{N}}{n}$ where $T_i$ is an indicator random variable that is 1 when tuple t_i satisfies our clause. $p_i$ is the probability that tuple will be selected into the sample. Intuitively, the formula finds the fraction of tuples in Sample which satisfied the query and applies the same fraction to the entire database.
2. Sum : $\frac{\sum_{i=1}^{n} x_i \frac{1}{p_i}}{n} = \frac{\sum_{i=1}^{n} x_i N}{n}$
3. Average : $\frac{Sum}{Count}$
\end{itemize}

Weighted Random Sample

1. Count : $\frac{\sum_{i=1}^{n} T_i p_i}{n}$ where $T_i$ is an indicator random variable that is 1 when tuple t_i satisfies our clause. $p_i$ is the probability that tuple will be selected into the sample. Intuitively, the formula reweighs each tuple according to the selection probability of the tuple.
2. Sum : $\frac{\sum_{i=1}^{n} x_i \frac{1}{p_i}}{n}$
3. Average : $\frac{Sum}{Count}$

### Probability/Statistics Refresher

Here are few commonly used equations related to Expectation and variance.

1. E[X] = $\sum_{i=1}^{n} x_i p_i$
2. E[aX+b] = aE[X] + b
3. E[X+Y] = E[X] + E[Y] (also called Linearity of Expectations)
4. Var[X] = $E[ (X - E[X] ) ^2]$
5. Var[X+a] = Var[X]
6. Var[aX] = $a^2$ Var[X]
7. Var[X+Y] = Var[X] + Var[Y] if X and Y are independent.

Law of Large Numbers : This law is one of the fundamental laws in probability. Let $X_1,X_2 \ldots , X_n$ be random variables drawn iid. Very informally, as n increases, the average of the variables approaches the expected value of the distribution from which the variables are drawn. For eg, if we have a coin which provides head with probability 0.5 and toss it 10000 times, the number of heads will be very close to 5000.

Binomial Distribution: Suppose you repeat a Bernoulli trial with success probability of p , n times. The distribution of the number of successes in the n trials is provided by binomial distribution B(n,p). This is a very important distribution for modeling sampling with replacement. For eg if we perform Bernoulli sampling, the final size of the sample is a Binomial distribution. Also if we randomly pick n tuples from database of size N , the number of times any tuple is repeated in the sample can also modeled by Binomial distribution. The expected value is given by $E[X]=np$ and variance is given by $np(1-p)$.

Normal Distribution : Normal distribution , aka Gaussian distribution, is one of the most important probability distributions. It is usually represented with parameters $N(\mu,\sigma^2)$. It has the familiar bell curve shape. $\mu$ determines the center of the normal curve and $\sigma^2$ determines the width of it. A smaller value results in a tighter curve while a larger value results in a more flat/wide curve.

Equations (2) and (7) above gives us some detail about how expected value and variance of sum of two random variables can be computed from the expected value/variance of the constituent random variables . We can extend by induction for the rules to hold for the sum of arbitrarily large number of random variables. This introduces us to the concept of Central Limit Theorem.

Central Limit Theorem :  This is one of the most fundamental rules in probability. Assume that we have n different random variables $X_1,X_2 \ldots , X_n$. Each of these random variables have mean $\mu$ and variance $\sigma^2$. For a large n, the distribution of sums $X = X_1 + X_2 + \ldots + X_n$ is normal with parameters $N(n \mu, n \sigma^2)$. Similarly, the distribution of the averages , $X = \frac{X_1 + X_2 + \ldots + X_n}{n}$ is normal with parameters $N(\mu, \frac{\sigma^2}{n})$.

Law of large numbers and Central limit theorem are two important results that allow analysis of the results. law of large number says that if we pick a large enough sample then the average of the sample will be closer to the true average of population. Central limit theorem states that if you repeat the sampling experiment multiple times and plot the distribution of the average value of the samples, they follow a normal distribution. Jointly, they allow you to derive the expression for Standard error.

### Standard Error

The essential idea behind sampling is to perform the experiment on a randomly chosen subset of the population instead of the original population. So far we discussed how to perform the sampling and how to estimate the value from sample to the larger population. In this section let us discuss about the error in our prediction. The concept of standard error is usually used for the same.

Lets say we want to estimate the mean of the population $\mu_P$ . We performed the sampling and found the sample mean as $\mu_S$. Since the sample mean is an unbiased estimator of the population mean we can announce that they are the same. But it need not always be the case that these two values are same. There are two common ways to analyze the error.

1. Absolute Error : $|\mu_P - \mu_S|^2$
2. Relative Error : $\frac{|\mu_P - \mu_S|^2}{\mu_P}$

The primary problem we have in estimating these error metrics is that we do not know the value of $\mu_P$. We will use an indirect way to estimate the error.

Consider the sampling process again. We picked a sample $S={X_1,X_2,\ldots,X_n}$ of random tuples from the database and computed the sample mean as $\mu_S=\frac{\sum_{i=1}^{n} X_i}{n}$. But since the tuples in S are picked randomly, the sample mean changes based on the sample. This means that the sample mean $\mu_S$ is itself a random variable.

Let the population $P={X_1,X_2,\ldots,X_N}$. Then the population mean $\mu_P = \overline{X} = \frac{\sum_{i=1}^{N} X_i}{N}$.  Let the sample be $S={Y_1,Y_2,\ldots,Y_n}$ and sample mean is $\mu_S = \frac{\sum_{i=1}^{n} Y_i}{n}$.

### Standard Error when Sample Size = 1

Consider the case where the sample S consists of only one element. Our aim is to find population mean. As per our approach, we pick a random element $Y_1$ and then announce it as the sample mean ie ${\mu}_S=Y_1$ . The error in this case is the difference between $Y_1$ and $\mu_P$. Since the sample is randomly picked , the sample mean is a random variable. Then it also implies that the error is also a random variable.

We can derive the expected value for the error as follows :
$E[err^2] = \sum_{i=1}^{N}\frac{(\overline{X}-X_i)^2}{N} = \sigma_P^2$.

We can see that the expected value of the error is nothing but the standard deviation of the population !

### Standard Error when Sample Size = 2

Consider the case where the sample S consists of exactly two elements. Our aim is to find population mean. As per our approach, we pick two random elements $Y_1,Y_2$ and then announce the sample mean as ${\mu}_S = \frac{Y_1+Y_2}{2}$. The error in this case is the difference between $\mu_S$ and $\mu_P$. Since the samples are randomly picked , the sample mean is again a random variable. Then it also implies that the error is also a random variable.

We can derive the expected value for the error as $E[err^2] = E[(\overline{X} - \overline{Y})^2] = Var[\overline{Y}]$. We can see that the sample can be any one of $N^2$ different combination of elements . The calculation for variance might be tricky if not for the independence of samples. Since the two elements were picked randomly , they two are independent and we can use that to estimate the variance easily.

$Var[\overline{Y}] = Var [\frac{Y_1 + Y_2}{2}] = Var [\frac{Y_1}{2} + \frac{Y_2}{2}] = \frac{1}{4} Var[Y_1] + \frac{1}{4} Var[Y_2] = \frac{\sigma_P^2}{4} + \frac{\sigma_P^2}{4} = \frac{\sigma_P^2}{2}$

Note that we used rules 3, 6 and 7 from above.

### Standard Error when Sample Size = n

We will not derive the formula here but we can easily show that when the sample contains n different elements the standard error is given by $E[err^2] = \frac{\sigma_P^2}{n} = \frac{Variance\;of\;population}{Sample\;size}$.

There are some interesting things to note here :

1. The expected error when using 2 samples is less than that of the expected error when we used only one sample.
2. As the size of sample increases the error decreases even faster. The rate of decrease is inversely proportional to square of size of the sample size.
3. This also formalizes the intuition that if the variance of the population is less, we need less number of samples to provide estimates with small error.
4. Usually our work will dictate the tolerable error and we can use the formula to find the appropriate n that will make standard error less than our tolerance factor.

Hope this post was useful 🙂

## Impressions on MIT Machine Learning Video Lectures

The title of this post is very similar to my old berkeley ML lectures post. Machine Learning is one of the coolest fields in CS and I really want to master it. But so far I was having trouble finding the basic intuitions behind the course. When I read the books, it somehow feels that I am missing some overarching intuition behind the field and that is the root cause behind all my troubles. I "think" I now know what it is. My math background is not as strong as I want to be. I have been putting lot of efforts over the last few months to rectify it. I have been poring over books on linear algebra, probability and statistics. I am still lacking a bit in Calculus but I am sure I will be making up soon.

I took two great courses this semester and they played a role in bridging the gap . They are Computational Methods and Bioinformatics. Both of them are taught by great teachers and had lot of fun topics. Computational Methods was an exploratory course discussing about lot of math needed for CS – things like root finding , linear and non linear equation solving, least squares, Eigen values , SVD and monte-carlo methods . As you can see it is a great sampling of techniques widely used in machine learning and other related fields (eg robotics, vision etc). The text books and other reference materials did a good job of pointing applications in machine learning. So I think I now appreciate many of the derivations used in machine learning better now. Similarly my bioinformatics course showed the various ways in which machine learning techniques are adapted to fit to that domain. As a combo they did a world of good to me.

Since I am now more confident, I started searching for Machine Learning lectures to restart my quest. There are lot of exceptional course websites and notes, but I preferred the courses with video/audio lectures. The most popular seems to be Stanford’s CS 229. It is great and has an excellent collection of notes. I have listened to few of the lectures before bailing out. Hopefully this time , I will be able to complete them. Since I already had (and partially listened to ) Berkeley lecutures, I was searching for similar courses in MIT to complete the triumvirate. I was surprised that OCW did not have a video lecture for data mining / machine learning. It is a pity as I feel these two fields are becoming very important in CS now. As I was systematically going through old courses , I hit a jackpot.

I found MIT’s 6.867 Machine Learning course taught by Professor Leslie Kaelbling in fall of 2005. They had video lectures but all their links were broken. I spent quite some time hunting for good sources and with some luck found them ! One of the things I liked about this course is this : It was taught in 2005 – Slightly old but not too old – and hence the course topics looks relatively approachable by my standards. Additionally the initial lectures discuss the necessary background which is an additional plus.

The archived course website is at 6.867 Machine Learning Fall 2005 . The video links are at SMA’s course website. Please note that the webpage explicitly states that it is copyrighted 😦 Hopefully they wont mind using it for academic purposes. Surprisingly the videos are again in rm format which I hate. One of the reasons is that it causes my Ubuntu based players to crash occasionally. The steps to download are a bit different from the one in Berkeley course. My instructions below are for ISO section. I downloaded a video from each and was not able to find any difference and the ISO files were smaller than the prog files. Also note that the URLs are provided in reverse chrological order. So the first link in the website corresponds to last lecture.

1. Find the url of the file. For eg the url of first lecture in ISO recording is http://smasvr.nus.edu.sg:8080/ramgen/sma/2005-2006/sma5514-fall/sma-5514-lec-mit-54100-07sep2005-0930-iso.rm .

2. Extract the rm file name from the url . Here it is sma-5514-lec-mit-54100-07sep2005-0930-iso.rm .

3. Construct the rtsp url. As of now, the rtsp url looks like this – rtsp://137.132.165.11:554/sma/2005-2006/sma5514-fall/lectFile . So for our case it is rtsp://137.132.165.11:554/sma/2005-2006/sma5514-fall/sma-5514-lec-mit-54100-07sep2005-0930-iso.rm .

4. Download the file using mplayer. The one that worked for me is this : mplayer -noframedrop -dumpfile Lect1.rm -dumpstream rtsp://137.132.165.11:554/sma/2005-2006/sma5514-fall/sma-5514-lec-mit-54100-07sep2005-0930-iso.rm . Remember to replace the arguments for dumpfile and dumpstream for each download.

5. Convert to other formats. Use mencoder to convert to other formats. For some reason, all the formats I tried resulted in very large files. If anyone found a way to convert it effectively, please let me know.

6. Enjoy the videos !

### Other Notes

1. The files will download in a slow fashion. The reason is that mplayer will try to download them as if you are actually watching them. I tried to play with the caching by setting a large value for the bandwidth parameter but it did caused some premature termination. If anyone got it working please comment .

2. I was also not able to convert the rm files to other formats avi or mp4 correctly. The files were of 2-3 times the size of the original rm file. If anyone found the proper incantation please let me know.

I plan to listen to them in December and January. If I learn any cool ideas, I will blog about them. Have fun with the video lectures !

## Code Snippet to run kMeans clustering in BioPython

I recently had to implement kMeans algorithm for clustering genes based on their profiles for one of my bioinformatics homework. Even though, I implemented my code, I wanted to compare the results using BioPython. For those, who do not know, BioPython is a set of libraries that allow you to write bioinformatics code. They have implemented most of the fundamental algorithms in bioinformatics. To me it was a great lifesaver as I can test out my ideas in few lines of Python code.

In Ubuntu, if you want to install it use the command : sudo apt-get install python-biopython .

Anyway, I was searching in net to get a code snippet to do kMeans using BioPython and somehow did not find any. So I wrote one myself using the documentation. I thought I will post code in the blog so that if anyone needs it in the future its a google search away 🙂

Biopython’s kMeans code requires the input to be in the format accepted by Eisner’s treeview program so that needed some data massaging. The code per se is very simple. It provides the data file, number of clusters and the number of runs to try as input. Additionally it also passes an array to initialize the cluster centroids.

from Bio import Cluster
f = open("gData1.csv")
record = Cluster.Record(f)
initialId = []
numClusters = 4
numRecords = 12
for i in range(numClusters):
for j in range(numRecords/numClusters):
initialId.append(i)
#(clusterAssignment, totalError, numPasses) = record.kcluster(nclusters=numClusters,initialid=initialId)
(clusterAssignment, totalError, numPasses) = record.kcluster(nclusters=numClusters,npass=10)
geneNames = record.genename
g = [ [] for i in range(numClusters)]
numIndex = 0
for i in clusterAssignment :
g[i].append(geneNames[numIndex])
numIndex += 1
f.close()


The input file looks like this : It is a tab separated file with the first two special fields : geneid and gene name.

GENEID  NAME    PARAM1      PARAM2
BLAH1    BLAH2   -0.43          0.3


Hope the snippet is useful to some !

## How to analyze your Blog Traffic using WordPress Stats API

Analyzing my blog traffic is one of my favorite past times. Seeing traffic surge and strangers using my posts gives me gratification. Initially, my analysis was fairly low tech – checking WordPress stats page periodically. After doing it a few times, I realized I could do more methodically – and I can use the statistics to do some rudimentary data mining.

As you know, I have a WordPress.com blog. WordPress exposes the statistics in two ways : As a flash chart in the dashboard and as API service.

### Viewing Blog Stats – Low Tech Way

If you want to take a look at your blog’s stats today , then you can do it in the dashboard. Assuming you are logged into your WordPress login, go to your blog. You will see WordPress toolbar at the top of the page. Click "My Dashboard". It will show a chart of total daily visits for the past 15 days. If you want more details, hover your mouse over "My Account" and select "Stats".

This page shows a wealth of information about your blog but it is limited to 2 days – today and yesterday. There are atleast 6 important statistics in that page.

### WordPress Stats Sections

a) Views Per Day
This is probably the statistic you are most interested in. This plots the total number of visits each day for the last 30 days. This also has multiple tabs which allows you to aggregate the statistics information. For example you can view the total weekly and monthly visits . This gives a basic view about how your blog is faring.

b) Referrers
Referrers are basically web pages from which visitors reached your blog. There are different types of referrers : pingbacks, related posts, explicit linking etc. If a certain website is driving lot of visitors to your blog , it is time to notice ! Referrers allows you to get that insight. This panel shows the top 10 referrers. You can click on the "Referrers" link to get the full listing.

c) Top Posts and Pages
This shows the top 10 posts for the given day. One interesting thing is that they track home page separately. Also if you have different pages (eg About), then visits to these pages are also tracked. Again, if you want to see the number of visits to each page, you can click on the title link.

One interesting thing is that each of the posts/pages in this section also has an addition chart icon near them. Clicking on them gives the stats for that particular post alone. The new page shows a chart detailing how many visits occurred to that page in the last few weeks. Most interestingly , they also show the average visits per day for the last few weeks and months. This gives a glimpse into the lasting popularity of your post.

d) Search Engine Terms
I am not very clear what this exactly means – The instruction says "These are terms people used to find your blog." . I am not sure if this corresponds to search terms typed in search engines or in WordPress search boxes etc . Anyway, this panel gives information about the search terms that fetch visits to your blog.

e) Clicks
Clicks panel shows the list of links in your blog that your visitors clicked. In a way , you can consider it as an inverse of referrers. In this case, you act as a referrer for some other blog. This post gives some hints about the type of visitors to your blog.

f) Aggregate Statistics
There is also another panel that shows some aggregate stats. This shows the total number of views to your blog so far , number of blogs and posts , email subscribers etc.

### A Better Way

Using WordPress Stats page to get your data is fairly low tech. This gives some data which can only give you an instinct of how things go. But it will not give you any deeper insights. For doing more data mining, you need data – lots of data. Fortunately, WordPress makes available a stats API which you can query to get regular data. In the rest of the post , we will talk about the API and how to use the data that you get out of it.

### Using WordPress Stats API

The primary url which provides the stats is http://stats.wordpress.com/csv.php . You can click on it to see the required parameters. There are 4 important parameters to this API.

a) api_key  : api_key  is a mandatory parameter and ensures that only owner queries the website. There are three ways to get this information. This key is emailed to you at the time you created your blog. Or you can access it at My Dashboard -> Users -> Personal Settings. This will show your api key. For the truly lazy click this url .

b) blog_id : This is a number which uniquely identifies your blog. Either blog_id or blog_uri is mandatory. I generally prefer blog_id. Finding blog_id is a bit tricky. Go to the blog stats page (My Account -> Stats). Click on the title link of "Top Posts and Pages". This will open a new page which shows the statistics for last 7 days. If you look at the page’s url, it will have a parameter called blog. The value of this parameter is the blog_id . Atleast for my blog , it is a 8 digit number.

c) blog_uri : If you do not want to take all the trouble of getting blog_id , use blog_uri. This is nothing but the url of your blog (http://blah.wordpress.com).

d) table : This field identifies the exact statistic you want. One of views, postviews, referrers, searchterms, clicks. Each of these correspond to the sections of WordPress stats discussed above. If table is not specified , views is selected as the default table.

You can get more details from the stats API page given above.

### Sample Python Scripts to fetch WordPress Stats

I have written a few scripts which fetch each of the WordPress stats.  One of them run every hour and gets the total number of views so far for the whole blog. The other scripts runs once a day and fetch the total clicks, search terms, referrer and top posts for that day. All of these store the data as a csv file which lends itself to analysis.

If you are interested in the scripts , the links to them are  :

1. getBlogDaysStats.py  : Fetches the total views for the whole blog at the current time. For best results run every hour.
2. getBlogReferrers.py : Fetches all the referrers to your blog.
3. getBlogPostViews.py : Fetches the number of views for individual blog posts and pages.
4. getBlogSearchTerms.py : Fetches all the search terms used to find your blog today.
5. getBlogClicks.py : Fetches the urls that people who visited your blog clicked.

### How to Collect WordPress Statistics

The first step is of course to collect data periodically. I use cron to run the scripts. My crontab file looks like this :

11 * * * * /usr/bin/python scriptpath/getBlogDaysStats.py
12 0 * * * /usr/bin/python scriptpath/getBlogClicks.py
14 0 * * * /usr/bin/python scriptpath/getBlogPostViews.py
15 0 * * * /usr/bin/python scriptpath/getBlogReferrers.py
16 0 * * * /usr/bin/python scriptpath/getBlogSearchTerms.py

Basically, I run the getBlogDaysStats every hour and other scripts every day. I also run the rest of scripts at early morning so that it fetches the previous day’s data.

### How to Use WordPress Statistics

If you run the scripts for few days, you will have lot of data. The amount of analysis you can make is limited only by your creativity. In this section, I will tell some of the ways I use the stats instead of giving an explicit how-to.

1. Views per day : It is collected by getBlogDaysStats.py. The most basic stuff is to chart them. This will give a glimpse of your trend – If it is static or climbing, then good news. If it is falling down it is something to worry about. I must also mention that have a more or less a plateau in your chart happens often. For eg in my blog, the charts follow a pattern – It increases for quite some time , then stays at the same level for a long time and then increases again. Also , worrying about individual day’s statistics is not a good idea. Try to aggregate them into weekly and monthly values as they give a less noisy view of your blog traffic.

Another common thing to do is to analyze per hour traffic. This can be easily derived from the output of the script. Basically, if m is the number of views at time a and n is the number of views at time b , then you received n-m views in b-a hours. I usually calculate it for every hour. This gives a *basic* idea of peak time for your blog – You can also infer your primary audience , although the interpretation is ambiguous. As an example , I get most of my traffic at night – especially between 1 AM – 9 AM. Morning time traffic is pretty weak and it picks up again in the evening. Interpreting this is hard as my blog covers a lot of topics – but if your blog is more focused you learn a lot about your visitors.

2. Referrers : This is a very useful statistic if you do some marketing for your blog. For best results, you may want to use just the domain of the url instead of the whole url for analysis. Using it you can figure out which sites drive traffic to your blog. If it is another blog, then it is a good idea to cultivate some friendship with that blog’s owner. For eg, for my blog , I found that digg drives more traffic that reddit. Also facebook drives some traffic to my blog – so I use WordPress’s facebook publicize feature. I also find that I get some traffic due to WordPress’s related posts feature which means that I must use good use of categories and tags. Your mileage may vary but I hope the basic idea is clear.

3. Individual Post Views : This is probably the most useful set of statistics. Basically , it allows you to analyze the traffic of individual posts over a period of time. I have a file which associates a blog post with extra information : For eg it stores the categories, tags, original creation date, all modification dates etc. (If you are curious , I store the information in JSON format). Once you have this information lot of analysis is possible.

a. You can figure out your audience type. If for a post, you have lot of audience in the first week and almost no audience from then on – then most likely your audience is driven by subscription. If it keeps having a regular traffic, then probably it has some useful stuff and traffic is constantly driven to it by search engines. For eg, my Biweekly links belong to the first scenario : When I publish one, lot of people visit it and then after a few days it gets practically no visits. In the other case, my post of Mean Shift gets a steady stream of views every week. If you want to sustain a good viewership, you may want to write more posts which can attract long term views.

b. If you use categories and tags wisely, you can tally the number of views per each category. This will give you an idea of the blog posts which users prefer. I noticed that my audience seems to like my Linux / Data Mining posts than other categories. So it is a good idea to write more of those posts.

c. You can kind of see a pareto effect in your blog posts. For eg, my top 10 blogs account for atleast 70% of my blog traffic. So if I could identify them correctly, I can write lesser posts but still maintain my blog traffic 😉

You can do lot more than these simple analysis but this is just a start.

4. Search Terms : This is another neat statistic. You can use it to figure out the primary way in which users access your blog. For eg, the ratio of total blog post view for a day and number of search terms for the day is quite interesting. If the ratio is high, then most of the people find your blog using search engines. In a way , this a potential transient audience whom you can convert to regular audience. If the ratio is small , then your blog gets views by referrers and regular viewers. This will assure you a steady audience , but it is slightly hard to get new people "find" your blog.

This statistic also tells you which keywords the viewers use to find my blog. You can gleam lot of interesting things from this. For eg, almost all of my search terms are 3-5 words long and usually very specific. This either means that the user is an expert and has crafted specific query. It may also mean that user rewrote the query and my blog was not found in the general query. I also use the terms to figure out if the user would have been satisfied with my blog. For eg, I know that a user searching "install matlab to 64-bit" will be satisfied while some one who searches "k means determine k" will not be. You can do either of two things : augment your blog post to add information that users are searching , or point users to resources that satisfies their query. For eg, I found lot of people reached my blog searching for how to find k. I found geomblog had couple of good posts on it and updated my blog to link to these posts. Some times, I may add a FAQ if same search query comes multiple times and if my page contains the information but is obscure. Eg : lot of people reached my blog searching for Empathy’s chat log location. My post of Empathy  had it but not in a prominent fashion. So I added a FAQ which points the answer immediately.

5. Clicks : This statistic is tangentially useful in finding out which links the user clicks. One way I use it is to gauge the "tech" level of my reader who visits my blog using search engines. I usually link to Wikipedia articles for common terms. If the user clicks on these basic terms often ,then it might mean that I write articles at a level higher that the typical user and I have to explain it better. For eg , in my post of K-Means , this was the reason I explain supervised and unsupervised learning at the start even though most people learning k-means already know it.

### Other Resources for Blog Traffic

There are other locations that give some useful information about your traffic. Some of them are :

a. Google Web Master Site : This is arguably one of the most comprehensive information about your blog post’s performance in Google. You can see it a Google Webmaster Page -> Your site on web -> Search queries. It has information like impressions, click throughs, average position etc. You can download all of them in a csv file too ! Literally a gold mine for data lovers.
b. Feedburner : Even though, WordPress has a feed, I switched to FeedBurner. One of the main reason was that it gave me a csv file detailing the number of views by my subscribers.
c. Quantcast : Useful for aggregate information. It has multiple charts that detail the number of views, unique visitors etc. The Quantcast data might not be accurate as it is usually estimated – but it gives a broad gauge of your blog. It also has some statistic which says how many of your visitors are addicts , how many are pass throughs etc. Quite useful !
d. Alexa : Similar to Quantcast . I primarily use my Alexa Rank for motivation to improve my blog ranking.

### Pivot Tables

I primarily use Python prompt to play with data. If you are not comfortable with programmatic tweaking, use spreadsheets to do these analysis. If you have Windows, you can do lot of powerful analysis by importing the statistics obtained by the scripts into Microsoft Excel. Excel has a neat feature called Pivot tables. It is also an advanced topic that I will not discuss here. You can do some fantastic analysis using pivots. They also give you the ability to view the same data from multiple perspectives.

In this post, I have barely scratched the surface – You can do lot of amazing analysis using WordPress Stats API . I will talk about more complex analysis in a later post. Have fun with the data !

## A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm

K Nearest Neighbor (KNN from now on) is one of those algorithms that are very simple to understand but works incredibly well in practice. Also it is surprisingly versatile and its applications range from vision to proteins to computational geometry to graphs and so on . Most people learn the algorithm and do not use it much which is a pity as a clever use of KNN can make things very simple. It also might surprise many to know that KNN is one of the top 10 data mining algorithms. Lets see why this is the case !

In this post, I will talk about KNN and how to apply it in various scenarios. I will focus primarily on classification even though it can also be used in regression). I also will not discuss much about Voronoi diagram or  tessellation.

### KNN Introduction

KNN is an non parametric lazy learning algorithm. That is a pretty concise statement. When you say a technique is non parametric , it means that it does not make any assumptions on the underlying data distribution. This is pretty useful , as in the real world , most of the practical data does not obey the typical theoretical assumptions made (eg gaussian mixtures, linearly separable etc) . Non parametric algorithms like KNN come to the rescue here.

It is also a lazy algorithm. What this means is that it does not use the training data points to do any generalization. In other words, there is no explicit training phase or it is very minimal. This means the training phase is pretty fast . Lack of generalization means that KNN keeps all the training data. More exactly, all the training data is needed during the testing phase. (Well this is an exaggeration, but not far from truth). This is in contrast to other techniques like SVM where you can discard all non support vectors without any problem.  Most of the lazy algorithms – especially KNN – makes decision based on the entire training data set (in the best case a subset of them).

The dichotomy is pretty obvious here – There is a non existent or minimal training phase but a costly testing phase. The cost is in terms of both time and memory. More time might be needed as in the worst case, all data points might take point in decision. More memory is needed as we need to store all training data.

### Assumptions in KNN

Before using KNN, let us revisit some of the assumptions in KNN.

KNN assumes that the data is in a feature space. More exactly, the data points are in a metric space. The data can be scalars or possibly even multidimensional vectors. Since the points are in feature space, they have a notion of distance – This need not necessarily be Euclidean distance although it is the one commonly used.

Each of the training data consists of a set of vectors and class label associated with each vector. In the simplest case , it will be either + or – (for positive or negative classes). But KNN , can work equally well with arbitrary number of classes.

We are also given a single number "k" . This number decides how many neighbors (where neighbors is defined based on the distance metric) influence the classification. This is usually a odd number if the number of classes is 2. If k=1 , then the algorithm is simply called the nearest neighbor algorithm.

### KNN for Density Estimation

Although classification remains the primary application of KNN, we can use it to do density estimation also. Since KNN is non parametric, it can do estimation for arbitrary distributions. The idea is very similar to use of Parzen window . Instead of using hypercube and kernel functions, here we do the estimation as follows – For estimating the density at a point x, place a hypercube centered at x and keep increasing its size till k neighbors are captured. Now estimate the density using the formula,

$p(x) = \frac{k/n}{V}$

Where n is the total number of V is the volume of the hypercube. Notice that the numerator is essentially a constant and the density is influenced by the volume. The intuition is this : Lets say density at x is very high. Now, we can find k points near x very quickly . These points are also very close to x (by definition of high density). This means the volume of hypercube is small and the resultant density is high. Lets say the density around x is very low. Then the volume of the hypercube needed to encompass k nearest neighbors is large and consequently, the ratio is low.

The volume performs a job similar to the bandwidth parameter in kernel density estimation. In fact , KNN is one of common methods to estimate the bandwidth (eg adaptive mean shift) .

### KNN for Classification

Lets see how to use KNN for classification. In this case, we are given some data points for training and also a new unlabelled data for testing. Our aim is to find the class label for the new point. The algorithm has different behavior based on k.

### Case 1 : k = 1 or Nearest Neighbor Rule

This is the simplest scenario. Let x be the point to be labeled . Find the point closest to x . Let it be y. Now nearest neighbor rule asks to assign the label of y to x. This seems too simplistic and some times even counter intuitive. If you feel that this procedure will result a huge error , you are right – but there is a catch. This reasoning holds only when the number of data points is not very large.

If the number of data points is very large, then there is a very high chance that label of x and y are same. An example might help – Lets say you have a (potentially) biased coin. You toss it for 1 million time and you have got head 900,000 times. Then most likely your next call will be head. We can use a similar argument here.

Let me try an informal argument here -  Assume all points are in a D dimensional plane . The number of points is reasonably large. This means that the density of the plane at any point is fairly high. In other words , within any subspace there is adequate number of points. Consider a point x in the subspace which also has a lot of neighbors. Now let y be the nearest neighbor. If x and y are sufficiently close, then we can assume that probability that x and y belong to same class is fairly same – Then by decision theory, x and y have the same class.

The book "Pattern Classification" by Duda and Hart has an excellent discussion about this Nearest Neighbor rule. One of their striking results is to obtain a fairly tight error bound to the Nearest Neighbor rule. The bound is

$P^* \leq P \leq P^* ( 2 - \frac{c}{c-1} P^*)$

Where $P^*$ is the Bayes error rate, c is the number of classes and P is the error rate of Nearest Neighbor. The result is indeed very striking (atleast to me) because it says that if the number of points is fairly large then the error rate of Nearest Neighbor is less that twice the Bayes error rate. Pretty cool for a simple algorithm like KNN. Do read the book for all the juicy details.

### Case 2 : k = K or k-Nearest Neighbor Rule

This is a straightforward extension of 1NN. Basically what we do is that we try to find the k nearest neighbor and do a majority voting. Typically k is odd when the number of classes is 2. Lets say k = 5 and there are 3 instances of C1 and 2 instances of C2. In this case , KNN says that new point has to labeled as C1 as it forms the majority. We follow a similar argument when there are multiple classes.

One of the straight forward extension is not to give 1 vote to all the neighbors. A very common thing to do is weighted kNN where each point has a weight which is typically calculated using its distance. For eg under inverse distance weighting, each point has a weight equal to the inverse of its distance to the point to be classified. This means that neighboring points have a higher vote than the farther points.

It is quite obvious that the accuracy *might* increase when you increase k but the computation cost also increases.

### Some Basic Observations

1. If we assume that the points are d-dimensional, then the straight forward implementation of finding k Nearest Neighbor takes O(dn) time.
2. We can think of KNN in two ways  – One way is that KNN tries to estimate the posterior probability of the point to be labeled (and apply bayesian decision theory based on the posterior probability). An alternate way is that KNN calculates the decision surface (either implicitly or explicitly) and then uses it to decide on the class of the new points.
3. There are many possible ways to apply weights for KNN – One popular example is the Shephard’s method.
4. Even though the naive method takes O(dn) time, it is very hard to do better unless we make some other assumptions. There are some efficient data structures like KD-Tree  which can reduce the time complexity but they do it at the cost of increased training time and complexity.
5. In KNN, k is usually chosen as an odd number if the number of classes is 2.
6. Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes ) .A simple approach to select k is set  $k=\sqrt{n}$
7. There are some interesting data structures and algorithms when you apply KNN on graphs – See Euclidean minimum spanning tree and Nearest neighbor graph .

8. There are also some nice techniques like condensing, search tree and partial distance that try to reduce the time taken to find the k nearest neighbor. Duda et al has a discussion of all these techniques.

### Applications of KNN

KNN is a versatile algorithm and is used in a huge number of fields. Let us take a look at few uncommon and non trivial applications.

1. Nearest Neighbor based Content Retrieval
This is one the fascinating applications of KNN – Basically we can use it in Computer Vision for many cases – You can consider handwriting detection as a rudimentary nearest neighbor problem. The problem becomes more fascinating if the content is a video – given a video find the video closest to the query from the database – Although this looks abstract, it has lot of practical applications – Eg : Consider ASL (American Sign Language)  . Here the communication is done using hand gestures.

So lets say if we want to prepare a dictionary for ASL so that user can query it doing a gesture. Now the problem reduces to find the (possibly k) closest gesture(s) stored in the database and show to user. In its heart it is nothing but a KNN problem. One of the professors from my dept , Vassilis Athitsos , does research in this interesting topic – See Nearest Neighbor Retrieval and Classification for more details.

2. Gene Expression
This is another cool area where many a time, KNN performs better than other state of the art techniques . In fact a combination of KNN-SVM is one of the most popular techniques there. This is a huge topic on its own and hence I will refrain from talking much more about it.

3. Protein-Protein interaction and 3D structure prediction
Graph based KNN is used in protein interaction prediction. Similarly KNN is used in structure prediction.

### References

There are lot of excellent references for KNN. Probably, the finest is the book "Pattern Classification" by Duda and Hart. Most of the other references are application specific. Computational Geometry has some elegant algorithms for KNN range searching. Bioinformatics and Proteomics also has lot of topical references.

After this post, I hope you had a better appreciation of KNN !

If you liked this post , please subscribe to the RSS feed.

## Introduction To Mean Shift Algorithm

Its been quite some time since I wrote a Data Mining post . Today, I intend to post on Mean Shift – a really cool but not very well known algorithm. The basic idea is quite simple but the results are amazing. It was invented long back in 1975 but was not widely used till two papers applied the algorithm to Computer Vision.

I learned this algorithm in my Advanced Data Mining course and I wrote the lecture notes on it. So here I am trying to convert my lecture notes to a post. I have tried to simplify it – but this post is quite involved than the other posts.

It is quite sad that there exists no good post on such a good algorithm. While writing my lecture notes, I struggled a lot for good resources 🙂 . The 3 “classic" papers on Mean Shift are quite hard to understand. Most of the other resources are usually from Computer Vision courses where Mean Shift is taught lightly as yet another technique for vision tasks  (like segmentation) and contains only the main intuition and the formulas.

As a disclaimer, there might be errors in my exposition – so if you find anything wrong please let me know and I will fix it. You can always check out the reference for more details. I have not included any graphics in it but you can check the ppt given in the references for an animation of Mean Shift.

### Introduction

Mean Shift is a powerful and versatile non parametric iterative algorithm that can be used for lot of purposes like finding modes, clustering etc. Mean Shift was introduced in Fukunaga and Hostetler [1] and has been extended to be applicable in other fields like Computer Vision.This document will provide a discussion of Mean Shift , prove its convergence and slightly discuss its important applications.

### Intuitive Idea of Mean Shift

This section provides an intuitive idea of Mean shift and the later sections will expand the idea. Mean shift considers feature space as a empirical probability density function. If the input is a set of points then Mean shift considers them as sampled from the underlying probability density function. If dense regions (or clusters) are present in the feature space , then they correspond to the mode (or local maxima) of the probability density function. We can also identify clusters associated with the given mode using Mean Shift.

For each data point, Mean shift associates it with the nearby peak of the dataset’s probability density function. For each data point, Mean shift defines a window around it and computes the mean of the data point . Then it shifts the center of the window to the mean and repeats the algorithm till it converges. After each iteration, we can consider that the window shifts to a more denser region of the dataset.

At the high level, we can specify Mean Shift as follows :
1. Fix a window around each data point.
2. Compute the mean of data within the window.
3. Shift the window to the mean and repeat till convergence.

### Preliminaries

#### Kernels :

A kernel is a function that satisfies the following requirements :

1. $\int_{R^{d}}\phi(x)=1$

2. $\phi(x)\geq0$

Some examples of kernels include :

1. Rectangular $\phi(x)=\begin{cases} 1 & a\leq x\leq b\\ 0 & else\end{cases}$

2. Gaussian $\phi(x)=e^{-\frac{x^{2}}{2\sigma^{2}}}$

3. Epanechnikov $\phi(x)=\begin{cases} \frac{3}{4}(1-x^{2}) & if\;|x|\leq1\\ 0 & else\end{cases}$

Kernel Density Estimation

Kernel density estimation is a non parametric way to estimate the density function of a random variable. This is usually called as the Parzen window technique. Given a kernel K, bandwidth parameter h , Kernel density estimator for a given set of d-dimensional points is

${\displaystyle \hat{f}(x)=\frac{1}{nh^{d}}\sum_{i=1}^{n}K\left(\frac{x-x_{i}}{h}\right)}$

### Gradient Ascent Nature of Mean Shift

Mean shift can be considered to based on Gradient ascent on the density contour. The generic formula for gradient ascent is ,

$x_{1}=x_{0}+\eta f'(x_{0})$

Applying it to kernel density estimator,

${\displaystyle \hat{f}(x)=\frac{1}{nh^{d}}\sum_{i=1}^{n}K\left(\frac{x-x_{i}}{h}\right)}$

$\bigtriangledown{\displaystyle \hat{f}(x)=\frac{1}{nh^{d}}\sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)}$

Setting it to 0 we get,

${\displaystyle \sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)\overrightarrow{x}=\sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)\overrightarrow{x_{i}}}$

Finally , we get

${\displaystyle \overrightarrow{x}=\frac{\sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)\overrightarrow{x_{i}}}{\sum_{i=1}^{n}K'\left(\frac{x-x_{i}}{h}\right)}}$

### Mean Shift

As explained above, Mean shift treats the points the feature space as an probability density function . Dense regions in feature space corresponds to local maxima or modes. So for each data point, we perform gradient ascent on the local estimated density until convergence. The stationary points obtained via gradient ascent represent the modes of the density function. All points associated with the same stationary point belong to the same cluster.

Assuming $g(x)=-K'(x)$ , we have

${\displaystyle m(x)=\frac{\sum_{i=1}^{n}g\left(\frac{x-x_{i}}{h}\right)x_{i}}{\sum_{i=1}^{n}g\left(\frac{x-x_{i}}{h}\right)}-x}$

The quantity $m(x)$ is called as the mean shift. So mean shift procedure can be summarized as : For each point $x_{i}$

1. Compute mean shift vector $m(x_{i}^{t})$

2. Move the density estimation window by $m(x_{i}^{t})$

3. Repeat till convergence

Using a Gaussian kernel as an example,

1. $y_{i}^{0}=x_{i}$
2. ${\displaystyle y_{i}^{t+1}=\frac{\sum_{i=1}^{n}x_{j}e^{\frac{-|y_{i}^{t}-x_{j}|^{2}}{h^{2}}}}{\sum_{i=1}^{n}e^{\frac{-|y_{i}^{t}-x_{j}|^{2}}{h^{2}}}}}$

### Proof Of Convergence

Using the kernel profile,

${\displaystyle y^{t+1}=\frac{\sum_{i=1}^{n}x_{i}k(||\frac{y^{t}-x_{i}}{h}||^{2})}{\sum_{i=1}^{n}k(||\frac{y^{t}-x_{i}}{h}||^{2})}}$

To prove the convergence , we have to prove that $f(y^{t+1})\geq f(y^{t})$

$f(y^{t+1})-f(y^{t})={\displaystyle \sum_{i=1}^{n}}k(||\frac{y^{t+1}-x_{i}}{h}||^{2})-{\displaystyle \sum_{i=1}^{n}}k(||\frac{y^{t}-x_{i}}{h}||^{2})$

But since the kernel is a convex function we have ,

$k(y^{t+1})-k(y^{t})\geq k'(y^{t})(y^{t+1}-y^{t})$

Using it ,

$f(y^{t+1})-f(y^{t})\geq{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(||\frac{y^{t+1}-x_{i}}{h}||^{2}-||\frac{y^{t}-x_{i}}{h}||^{2})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-2y^{t+1}x_{i}+x_{i}^{2}-(y^{t^{2}}-2y^{t}x_{i}+x_{i}^{2}))$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-y^{t^{2}}-2(y^{t+1}-y^{t})^{T}x_{i})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-y^{t^{2}}-2(y^{t+1}-y^{t})^{T}y^{t+1})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-y^{t^{2}}-2(y^{(t+1)^{2}}-y^{t}y^{t+1}))$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(y^{(t+1)^{2}}-y^{t^{2}}-2y^{(t+1)^{2}}+2y^{t}y^{t+1})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(-y^{(t+1)^{2}}-y^{t^{2}}+2y^{t}y^{t+1})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(-1)(y^{(t+1)^{2}}+y^{t^{2}}-2y^{t}y^{t+1})$

$=\frac{1}{h^{2}}{\displaystyle \sum_{i=1}^{n}-}k'(||\frac{y^{t}-x_{i}}{h}||^{2})(||y^{t+1}-y^{t}||^{2})$

$\geq0$

Thus we have proven that the sequence $\{f(j)\}_{j=1,2...}$is convergent. The second part of the proof in [2] which tries to prove the sequence $\{y_{j}\}_{j=1,2,...}$ is convergent is wrong.

### Improvements to Classic Mean Shift Algorithm

The classic mean shift algorithm is time intensive. The time complexity of it is given by $O(Tn^{2})$ where $T$ is the number of iterations and $n$ is the number of data points in the data set. Many improvements have been made to the mean shift algorithm to make it converge faster.

One of them is the adaptive Mean Shift where you let the bandwidth parameter vary for each data point. Here, the $h$ parameter is calculated using kNN algorithm. If $x_{i,k}$is the k-nearest neighbor of $x_{i}$ then the bandwidth is calculated as

$h_{i}=||x_{i}-x_{i,k}||$

Here we use $L_{1}$or $L_{2}$ norm to find the bandwidth.

An alternate way to speed up convergence is to alter the data points
during the course of Mean Shift. Again using a Gaussian kernel as
an example,

1. $y_{i}^{0}=x_{i}$
2. ${\displaystyle y_{i}^{t+1}=\frac{\sum_{i=1}^{n}x_{j}e^{\frac{-|y_{i}^{t}-x_{j}|^{2}}{h^{2}}}}{\sum_{i=1}^{n}e^{\frac{-|y_{i}^{t}-x_{j}|^{2}}{h^{2}}}}}$
3. $x_{i}=y_{i}^{t+1}$

### Other Issues

1. Even though mean shift is a non parametric algorithm , it does require the bandwidth parameter h to be tuned. We can use kNN to find out the bandwidth. The choice of bandwidth in influences convergence rate and the number of clusters.
2. Choice of bandwidth parameter h is critical. A large h might result in incorrect
clustering and might merge distinct clusters. A very small h might result in too many clusters.

3. When using kNN to determining h, the choice of k influences the value of h. For good results, k has to increase when the dimension of the data increases.
4. Mean shift might not work well in higher dimensions. In higher dimensions , the number of local maxima is pretty high and it might converge to a local optima soon.
5. Epanechnikov kernel has a clear cutoff and is optimal in bias-variance tradeoff.

### Applications of Mean Shift

Mean shift is a versatile algorithm that has found a lot of practical applications – especially in the computer vision field. In the computer vision, the dimensions are usually low (e.g. the color profile of the image). Hence mean shift is used to perform lot of common tasks in vision.

Clustering

The most important application is using Mean Shift for clustering. The fact that Mean Shift does not make assumptions about the number of clusters or the shape of the cluster makes it ideal for handling clusters of arbitrary shape and number.

Although, Mean Shift is primarily a mode finding algorithm , we can find clusters using it. The stationary points obtained via gradient ascent represent the modes of the density function. All points associated with the same stationary point belong to the same cluster.

An alternate way is to use the concept of Basin of Attraction. Informally, the set of points that converge to the same mode forms the basin of attraction for that mode. All the points in the same basin of attraction are associated with the same cluster. The number of clusters is obtained by the number of modes.

Computer Vision Applications

Mean Shift is used in multiple tasks in Computer Vision like segmentation, tracking, discontinuity preserving smoothing etc. For more details see [2],[8].

### Comparison with K-Means

Note : I have discussed K-Means at K-Means Clustering Algorithm. You can use it to brush it up if you want.

K-Means is one of most popular clustering algorithms. It is simple,fast and efficient. We can compare Mean Shift with K-Means on number of parameters.

One of the most important difference is that K-means makes two broad assumptions – the number of clusters is already known and the clusters are shaped spherically (or elliptically). Mean shift , being a non parametric algorithm, does not assume anything about number of clusters. The number of modes give the number of clusters. Also, since it is based on density estimation, it can handle arbitrarily shaped clusters.

K-means is very sensitive to initializations. A wrong initialization can delay convergence or some times even result in wrong clusters. Mean shift is fairly robust to initializations. Typically, mean shift is run for each point or some times points are selected uniformly from the feature space [2] . Similarly, K-means is sensitive to outliers but Mean Shift is not very sensitive.

K-means is fast and has a time complexity $O(knT)$ where k is the number of clusters, n is the number of points and T is the number of iterations. Classic mean shift is computationally expensive with a time complexity $O(Tn^{2})$.

Mean shift is sensitive to the selection of bandwidth, $h$. A small $h$ can slow down the convergence. A large $h$ can speed up convergence but might merge two modes. But still, there are many techniques to determine $h$ reasonably well.

Update [30 Apr 2010] : I did not expect this reasonably technical post to become very popular, yet it did ! Some of the people who read it asked for a sample source code. I did write one in Matlab which randomly generates some points according to several gaussian distribution and the clusters using Mean Shift . It implements both the basic algorithm and also the adaptive algorithm. You can download my Mean Shift code here. Comments are as always welcome !

### References

1. Fukunaga and Hostetler, "The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition", IEEE Transactions on Information Theory vol 21 , pp 32-40 ,1975
2. Dorin Comaniciu and Peter Meer, Mean Shift : A Robust approach towards feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence vol 24 No 5 May 2002.
3. Yizong Cheng , Mean Shift, Mode Seeking, and Clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence vol 17 No 8 Aug 1995.
4. Mean Shift Clustering by Konstantinos G. Derpanis
5. Chris Ding Lectures CSE 6339 Spring 2010.
6. Dijun Luo’s presentation slides.
7. cs.nyu.edu/~fergus/teaching/vision/12_segmentation.ppt

8. Dorin Comaniciu, Visvanathan Ramesh and Peter Meer, Kernel-Based Object Tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence vol 25 No 5 May 2003.
9. Dorin Comaniciu, Visvanathan Ramesh and Peter Meer, The Variable Bandwidth Mean Shift and Data-Driven Scale Selection, ICCV 2001.

## Impressions on Berkeley Machine Learning Workshop

Machine Learning is one of my areas of interest and I am trying to learn as much as possible. It is a very demanding but also a fun field .

Recently , I came across a huge list of video lectures on Machine Learning from the blog Onionesque Reality. More exactly, the post titled Demystifying Support Vector Machines for Beginners had a section “Webcasts/ Video Lectures on Learning Theory, Support Vector Machines and related ideas” where links to lot of lectures were provided. I sampled the lectures given here and I have to admit most of them are really high quality stuff.

As a first step, I decided to finish the Berkeley’s Machine Learning workshop. The website for video lectures is here. It contains 11 lectures over two days. I find it incredible that they were able to finish this off in two days. If I had been there my brain would have exploded with all this information  🙂

The courses cover most of the important areas of Machine Learning. Best of all, two of the lectures – on Graphical Models (my area of research) and Non Parametric Bayesian Models were taken by Professor Michael Jordan. Well, he is an excellent teacher and it is no wonder that many of his students are leaders in Machine Learning.

Overall , the courses were good – some excellent and some good.

The lectures were encoded in real player format which surprised me. It is hard but not impossible to play rm files in Linux. Even though all of  totem, mplayer or vlc can play it , the navigation was very awkward and clumsy and is prone to frequent crashes. I am thinking of converting them to mp4 format but have not found the exact way to do it .

The lectures were streamed via rtsp protocol. If you are using Linux and wanted to download the lectures there is an easy way to do it. Just click on one of the lectures and find its rtsp url. (Most of the time it is done by right clicking and saying Copy Url ).  You can then use mplayer to dump the file . The syntax is

mplayer -noframedrop -dumpfile Manifold_Learning_and_Visualization.rm -dumpstream rtsp://169.229.131.16:554/events/eecs/eecs_20070824_2.rm

The dumpfile argument takes the output file into which the lecture will be downloaded into. The dumpstream takes the rtsp url as input.

Have fun with the lectures !