Feeds:
Posts

## A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm

K Nearest Neighbor (KNN from now on) is one of those algorithms that are very simple to understand but works incredibly well in practice. Also it is surprisingly versatile and its applications range from vision to proteins to computational geometry to graphs and so on . Most people learn the algorithm and do not use it much which is a pity as a clever use of KNN can make things very simple. It also might surprise many to know that KNN is one of the top 10 data mining algorithms. Lets see why this is the case !

In this post, I will talk about KNN and how to apply it in various scenarios. I will focus primarily on classification even though it can also be used in regression). I also will not discuss much about Voronoi diagram or  tessellation.

### KNN Introduction

KNN is an non parametric lazy learning algorithm. That is a pretty concise statement. When you say a technique is non parametric , it means that it does not make any assumptions on the underlying data distribution. This is pretty useful , as in the real world , most of the practical data does not obey the typical theoretical assumptions made (eg gaussian mixtures, linearly separable etc) . Non parametric algorithms like KNN come to the rescue here.

It is also a lazy algorithm. What this means is that it does not use the training data points to do any generalization. In other words, there is no explicit training phase or it is very minimal. This means the training phase is pretty fast . Lack of generalization means that KNN keeps all the training data. More exactly, all the training data is needed during the testing phase. (Well this is an exaggeration, but not far from truth). This is in contrast to other techniques like SVM where you can discard all non support vectors without any problem.  Most of the lazy algorithms – especially KNN – makes decision based on the entire training data set (in the best case a subset of them).

The dichotomy is pretty obvious here – There is a non existent or minimal training phase but a costly testing phase. The cost is in terms of both time and memory. More time might be needed as in the worst case, all data points might take point in decision. More memory is needed as we need to store all training data.

### Assumptions in KNN

Before using KNN, let us revisit some of the assumptions in KNN.

KNN assumes that the data is in a feature space. More exactly, the data points are in a metric space. The data can be scalars or possibly even multidimensional vectors. Since the points are in feature space, they have a notion of distance – This need not necessarily be Euclidean distance although it is the one commonly used.

Each of the training data consists of a set of vectors and class label associated with each vector. In the simplest case , it will be either + or – (for positive or negative classes). But KNN , can work equally well with arbitrary number of classes.

We are also given a single number "k" . This number decides how many neighbors (where neighbors is defined based on the distance metric) influence the classification. This is usually a odd number if the number of classes is 2. If k=1 , then the algorithm is simply called the nearest neighbor algorithm.

### KNN for Density Estimation

Although classification remains the primary application of KNN, we can use it to do density estimation also. Since KNN is non parametric, it can do estimation for arbitrary distributions. The idea is very similar to use of Parzen window . Instead of using hypercube and kernel functions, here we do the estimation as follows – For estimating the density at a point x, place a hypercube centered at x and keep increasing its size till k neighbors are captured. Now estimate the density using the formula, $p(x) = \frac{k/n}{V}$

Where n is the total number of V is the volume of the hypercube. Notice that the numerator is essentially a constant and the density is influenced by the volume. The intuition is this : Lets say density at x is very high. Now, we can find k points near x very quickly . These points are also very close to x (by definition of high density). This means the volume of hypercube is small and the resultant density is high. Lets say the density around x is very low. Then the volume of the hypercube needed to encompass k nearest neighbors is large and consequently, the ratio is low.

The volume performs a job similar to the bandwidth parameter in kernel density estimation. In fact , KNN is one of common methods to estimate the bandwidth (eg adaptive mean shift) .

### KNN for Classification

Lets see how to use KNN for classification. In this case, we are given some data points for training and also a new unlabelled data for testing. Our aim is to find the class label for the new point. The algorithm has different behavior based on k.

### Case 1 : k = 1 or Nearest Neighbor Rule

This is the simplest scenario. Let x be the point to be labeled . Find the point closest to x . Let it be y. Now nearest neighbor rule asks to assign the label of y to x. This seems too simplistic and some times even counter intuitive. If you feel that this procedure will result a huge error , you are right – but there is a catch. This reasoning holds only when the number of data points is not very large.

If the number of data points is very large, then there is a very high chance that label of x and y are same. An example might help – Lets say you have a (potentially) biased coin. You toss it for 1 million time and you have got head 900,000 times. Then most likely your next call will be head. We can use a similar argument here.

Let me try an informal argument here -  Assume all points are in a D dimensional plane . The number of points is reasonably large. This means that the density of the plane at any point is fairly high. In other words , within any subspace there is adequate number of points. Consider a point x in the subspace which also has a lot of neighbors. Now let y be the nearest neighbor. If x and y are sufficiently close, then we can assume that probability that x and y belong to same class is fairly same – Then by decision theory, x and y have the same class.

The book "Pattern Classification" by Duda and Hart has an excellent discussion about this Nearest Neighbor rule. One of their striking results is to obtain a fairly tight error bound to the Nearest Neighbor rule. The bound is $P^* \leq P \leq P^* ( 2 - \frac{c}{c-1} P^*)$

Where $P^*$ is the Bayes error rate, c is the number of classes and P is the error rate of Nearest Neighbor. The result is indeed very striking (atleast to me) because it says that if the number of points is fairly large then the error rate of Nearest Neighbor is less that twice the Bayes error rate. Pretty cool for a simple algorithm like KNN. Do read the book for all the juicy details.

### Case 2 : k = K or k-Nearest Neighbor Rule

This is a straightforward extension of 1NN. Basically what we do is that we try to find the k nearest neighbor and do a majority voting. Typically k is odd when the number of classes is 2. Lets say k = 5 and there are 3 instances of C1 and 2 instances of C2. In this case , KNN says that new point has to labeled as C1 as it forms the majority. We follow a similar argument when there are multiple classes.

One of the straight forward extension is not to give 1 vote to all the neighbors. A very common thing to do is weighted kNN where each point has a weight which is typically calculated using its distance. For eg under inverse distance weighting, each point has a weight equal to the inverse of its distance to the point to be classified. This means that neighboring points have a higher vote than the farther points.

It is quite obvious that the accuracy *might* increase when you increase k but the computation cost also increases.

### Some Basic Observations

1. If we assume that the points are d-dimensional, then the straight forward implementation of finding k Nearest Neighbor takes O(dn) time.
2. We can think of KNN in two ways  – One way is that KNN tries to estimate the posterior probability of the point to be labeled (and apply bayesian decision theory based on the posterior probability). An alternate way is that KNN calculates the decision surface (either implicitly or explicitly) and then uses it to decide on the class of the new points.
3. There are many possible ways to apply weights for KNN – One popular example is the Shephard’s method.
4. Even though the naive method takes O(dn) time, it is very hard to do better unless we make some other assumptions. There are some efficient data structures like KD-Tree  which can reduce the time complexity but they do it at the cost of increased training time and complexity.
5. In KNN, k is usually chosen as an odd number if the number of classes is 2.
6. Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes ) .A simple approach to select k is set $k=\sqrt{n}$
7. There are some interesting data structures and algorithms when you apply KNN on graphs – See Euclidean minimum spanning tree and Nearest neighbor graph .

8. There are also some nice techniques like condensing, search tree and partial distance that try to reduce the time taken to find the k nearest neighbor. Duda et al has a discussion of all these techniques.

### Applications of KNN

KNN is a versatile algorithm and is used in a huge number of fields. Let us take a look at few uncommon and non trivial applications.

1. Nearest Neighbor based Content Retrieval
This is one the fascinating applications of KNN – Basically we can use it in Computer Vision for many cases – You can consider handwriting detection as a rudimentary nearest neighbor problem. The problem becomes more fascinating if the content is a video – given a video find the video closest to the query from the database – Although this looks abstract, it has lot of practical applications – Eg : Consider ASL (American Sign Language)  . Here the communication is done using hand gestures.

So lets say if we want to prepare a dictionary for ASL so that user can query it doing a gesture. Now the problem reduces to find the (possibly k) closest gesture(s) stored in the database and show to user. In its heart it is nothing but a KNN problem. One of the professors from my dept , Vassilis Athitsos , does research in this interesting topic – See Nearest Neighbor Retrieval and Classification for more details.

2. Gene Expression
This is another cool area where many a time, KNN performs better than other state of the art techniques . In fact a combination of KNN-SVM is one of the most popular techniques there. This is a huge topic on its own and hence I will refrain from talking much more about it.

3. Protein-Protein interaction and 3D structure prediction
Graph based KNN is used in protein interaction prediction. Similarly KNN is used in structure prediction.

### References

There are lot of excellent references for KNN. Probably, the finest is the book "Pattern Classification" by Duda and Hart. Most of the other references are application specific. Computational Geometry has some elegant algorithms for KNN range searching. Bioinformatics and Proteomics also has lot of topical references.

After this post, I hope you had a better appreciation of KNN !

If you liked this post , please subscribe to the RSS feed.

## Introduction to Bayesian Decision Theory – Part 1

In this series of articles , I intend to discuss Bayesian Decision Theory and its most important basic ideas. The articles are mostly based on the classic book "Pattern Classification" by Duda,Hart and Stork. If you want the ideas in all its glory, go get the book !

As I was reading the book, I realized that in its heart , this field is a set of mostly common sense ideas validated by rigorous mathematics. So I will try to discuss the basic ideas in plain english without much mathematical rigor. Since I am still learning how to best explain these complex ideas, any comments on how to improve will be welcome !

You may ask what happened to PCA . Well, I still intend to write more on it but have not found enough time to sit and write it all. I have written a draft version of it but felt it was too technical and not very intuitive.So I am hoping to rewrite it again.

### Background

You would need to know basics of probability to understand the following. In particular the ideas of Prior , Posterior  and the idea of Likelihood  . Of course if you know all these , you will know the big idea of Bayes Theorem . I will try to explain them lightly but if you have any doubts check out Wikipedia or some old books.

We will take the example used in Duda et al. There are two types of fish : Sea Bass and Salmon. We catch a lot of fish of these two types but which are mixed together and our aim is to separate them by automation. We have a conveyor belt in which  the fishes come one by one and we need to decide if the current fish is sea bass or salmon. Of course, we want to make it as accurate as possible but also don’t want to spend lot of money on this project. This is in its heart a classification project. We will be given a few examples of both sea bass and salmon and based on it we need to infer the general characteristics using which we can distinguish them.

### Basic Probability Ideas

Now let us be slightly more formal. We say that there are two "classes" of fish – sea bass and salmon. According to our system, there is no other type of fish. If we treat it as a state machine , then our system has two states. The book uses a notation of $\omega_1 \; and \; \omega_2$ to represent them. We will use the names seabass and salmon as it is more intuitive.

The first basic idea is that of prior probability  . This is represented as $P(seabass) \; and \; P(salmon)$ which basically give the probability that the next fish in the conveyor is seabass or salmon. Of course, both of them have to sum to one. From the Bayesian perspective, this probability is usually obtained from prior (domain) knowledge. We will not talk about Frequentist interpretation as we will focus on Bayesian decision theory.

Let us assume that we use length of the fish to differentiate the fishes. So whenever the fish comes to the conveyor belt, we calculate its length (how, we don’t really care here ) . So we have transformed the fish into a simple representation using a single number, its length. So the length is a feature  that we use to classify and the step of converting  the fish into length is called feature extraction .

In a real life scenario, we will have multiple features and the input will converted to a vector. For eg we may use length , lightness of skin , fin length etc as feature. In this case , the fish will be transformed into a triplet. Converting the input to a feature vector makes further processing easy and more robust. We will usually use the letter x to represent the feature. So you can consider $P(x)$ is the probability of evidence. Eg lets say we got a fish (we dont know what it is yet) of length 5 inches. Now $P(x)$ gives the probability that some fish (either seabass or salmon) has the length 5 inches.

The next idea is that of likelihood . It is also called class conditional probability. It is represented as either $P(x|seabass) \; or \; P(x|salmon)$ . The interpretation is simple. This answers the question that if the fish is seabass what is the probability that it will have length $x$ inches (ditto salmon). Alternatively , what is the probability that a 5 inch seabass exists and so on. Or even how "likely" is a 5 inch seabass ?

The posterior probability is the other side of the story. This is represented by $P(seabass|x) \; or \; P(salmon|x)$ . Intuitively, given that we have a fish of length $x$ inches , what is the probability that it is a seabass (or salmon).  The interesting thing is that knowing prior probability and likelihood we can calculate posterior probability using the famous "Bayes Theorem". We can represent it in words as , $posterior = \frac{likelihood \times prior}{evidence}$

This gives another rationale for the word "likelihood". Among all other things being equal , the item with higher likelihood is more "likely" to final result. For eg if the likelihood of a 10 inch seabass is more than that of salmon then when we observe an unknown fish of length 10 inches , it is most likely a seabass.

PS : There is an excellent (but long) tutorial on Bayes Theorem at "An Intuitive Explanation of Bayes’ Theorem" . True to its title, it does try to explain the bizarre (atleast initially) result of Bayes Theorem using multiple examples. I highly recommend reading it.

### Bayesian Decision Theory

Let us enter into the decision theory at last. In a very high level definition, you can consider decision theory as a field which studies about "decisions" (to classify as seabass or not to be) – more exactly, it considers these decisions in terms of cost or loss functions. (More on that later). In essence , you can think of decision theory as providing a decision rule which tells us what action to taken when we make a particular observation. Decision theory can be thought of as all about evaluating decision rules. (Of course, I am grossly simplifying things, but I think I have conveyed the essence).

### Informal Discussion of Decision Theory for Two Class System with Single Feature

Let us take a look at the simplest application of decision theory to our problem. We have a two class system (seabass,salmon) and we are using a single feature (length) to make a decision. Be aware that length is not an ideal feature because many a time you will be having both seabass and salmon of same length (say 5 inches). So when we come across a fish with length 5 inches, we are stuck. We don’t know what decision to take as we know both seabass and salmon can be 5 inches. Decision theory to the rescue !

Instead of providing the theoretical ideas, I will discuss various scenarios and what is the best decision theoretic action to do. In all the scenarios let us assume that we want to be as accurate as possible.

### Case I : We don’t know anything and we are not allowed to see the fish

This is the worst case to be in. We have no idea about seabass and salmon (a vegetarian , perhaps ? 🙂 ). You are also not allowed to see the fish. But you are asked is the next fish in conveyor a seabass or salmon ? All is not lost – The best thing to do is to randomize. So the decision rule is with probability 50% say the next fish is seabass and  with probability 50% say it is salmon.

Convince yourself that this is the best thing to do – Not only when the seabass and salmon are in 50:50 , even when they are in 90:10 ratio.

### Case II : You know the prior probability but still you don’t see the fish

We are in a slightly better position here. We don’t get to see the fish yet , but we know the prior probability that the next fish is a seabass or salmon. ie We are give $P(seabass) \; and \; P(salmon)$ Remember, we want to be as accurate as possible and we want to as reliable about accuracy rate as possible.

A common mistake to do is to randomize again. ie with $P(seabass)$ say that the next fish is seabass and salmon otherwise. For eg let us say, $P(seabass) = .70 \; and \; P(salmon) = 0.3$ . Let me attempt an informal reasoning – In the (sample) worst case, you will get first 40 as seabass, next 30 as salmon and next 30 as seabass. But you say first 30 as salmon and next 70 as seabass. In this hypothetical example you are only at the most 40% accurate even though you can theoretically do better.

What does decision theory say here ? If $P(seabass) > P(salmon)$ then ALWAYS say seabass. Else ALWAYS say salmon. In this case the accuracy rate is $max(P(seabass),P(salmon))$ . Conversely, the error rate is the minimum of both the prior probabilities. Convince yourself that this is the best you can do . It sure is counterintuitive to always say seabass when you know you will get salmon too. But we can easily prove that this is the best you can do "reliably".

Mathematically, decision rule is decide seabass if $P(seabass) > P(salmon)$  else decide salmon .

Error is $min(P(seabass),P(salmon)$ .

### Case III : You know the likelihood function and the length but not the prior probability

This case is really hypothetical. ie we can see the fish and hence find its length. Let say x inches. We have $P(x inches|seabass) \; and \; P(x inches | salmon)$ but we don’t know the prior probability. The decision rule here is : For each fish , find the appropriate likelihood values. If the likelihood of seabass is higher than that of salmon , say the fish is seabass and salmon otherwise.

Note that we are making a decision based on our "observation" in contrast to previous cases. Unless, you are really unlucky and the prior probabilities are really skewed you can do well with this decision rule.

Mathematically, decision rule is decide seabass if $P(x|seabass) > P(x|salmon)$  else  decide  salmon .

### Case IV : You know the length, prior probability and the likelihood function

This is the scenario we are mostly in. We know the length of the fish (say 5 inches). We know the prior probability (say 60% salmon and 40% seabass). We also know the likelihood of them. (say P(5 inches|seabass) is 60% and P(5 inches|salmon) is 10% )

Now we can apply our favorite Bayes rule to get posterior probability. If the length of the fish is 5 inches then what is the probability that it a seabass ? A salmon ? Once you can calculate the posterior , the decision rule becomes simpler. If the posterior probability of seabass is higher than say the fish is seabass else say it is salmon.

Mathematically, decision rule is decide seabass if $P(seabass|x) > P(salmon|x)$ else decide salmon . This rule is very important and is called as Bayes Decision Rule.

For this decision rule, the error is $min(P(seabass|x),P(salmon|x)$

We can expand Bayes decision rule using Bayes theorem.

Decide seabass if $p(x|seabass)P(seabass) > p(x|salmon)P(salmon)$   else decide salmon.

There are two special cases.
1. If likelihood are equal then our decision depends on prior probabilities.
2. If prior probabilities are equal then our decisions depend on likelihoods.

We have only scratched the surface of decision theory. In particular we did not focus much on bounding the error today. Also we did not discuss the cases where there are multiple classes or features. Hopefully, I will discuss them in a future post.

### Reference

Pattern Classification by Duda,Hart and Stork. Chapter 2.

## K-Means Clustering Algorithm

Data Mining is one of my favorite areas in CS. I have so far taken courses on intro data mining, web mining and in this semester (Spring 2010) an advanced data mining course using Graphs and Matrices . IMO, one of the problems with data mining and machine learning is that most of the algorithms / concepts are stuffed with mathematics , even though the basic ideas are very simple. So I am hoping to write some simple to read posts on some of the cool topics in data mining (DM for short) /machine learning (ML for short) . So the writing will be very informal and may lack in rigor.

In this first post, I plan to write on K-Means algorithm. But before talking about the algorithm some basics of Data Mining and Machine Learning. Two very common tasks in Data Mining are Classification and Clustering.

### Classification

In Classification , we are given some training data where someone manually had classified each data in it to some category (aka class). Our aim is to do the classify the future data (aka testing data) into categories/classes as accurately as possible. Classification is a technique in Supervised Learning. Common practical examples include finding if a tumor is malignant or is a particular credit card transaction fraudulent. You can even consider, handwriting recognition as classification.

### Clustering

Clustering , on the other hand , is an unsupervised learning technique . The aim here is to group the data points into clusters such that similar items are lumped together in the same cluster. Here , nobody "trains" the algorithm and it is expected to do a good job. Clustering is one of the important tools in exploratory analysis.

There are two broad types of Clustering. One is called Hierarchical clustering where you try to create a hierarchy of related objects or clusters. One can consider biological taxonomy as a simple example of clustering where similar organisms are lumped into same species . Then all similar species are put into same genus and so on. A more recent example is that of Phylogenetic trees where geneticists find the evolutionary relation between species . Here organisms with common ancestors  are clubbed together.

If the above examples seem very biological consider Google News . Google News groups articles on similar topics together.  For eg all articles that talk about India’s cricket victory today (around 1000 news articles) are grouped into a single cluster , as their primary topic is very similar . Other sports clusters are similarly formed. Now , since all these sports clusters have a common theme , they will be clubbed and put into a bigger cluster called Sports , Similarly , all articles on Sri Lankan elections are grouped together into a cluster. Then other political clusters are grouped to form the Politics section.

The other major type of Clustering is called partitional or iterative clustering. Consider you are given a bin of colored balls. If you separate them into piles of same colored balls , you are doing a rudimentary partitional clustering. I will talk more about Hierarchical clustering in some future post. Here ,I will talk about K-Means which is the most popular partitional clustering algorithm. It is considered as one of the top 10 data mining algorithms .

### Distance Measures

If we want to compare two items we need a notion of similarity or dissimilarity (aka distance). Most of the time, if we know one of these values , we can calculate the other as they related in a antagonistic way. Let us focus on distance. If the items are simple points in a n-dimensional space, then a simple metric will be the Euclidean distance. If we consider p,q as two points , $d(p,q) = \sqrt\sum_{i=1}^{n}{(p_{i}-q_{i})^{2}}$

There are other distance measures like Manhattan Distance , Mahalanobis distance. Popular similarity measures are Jaccard coefficient , Cosine similarity etc.

Things are a bit complicated when the items are not points but objects with some attributes (for eg news articles). In this we might need to design our own distance measures. For checking if news articles talk about same topic, a simple distance measure can be to find the entities in an article and compare their frequencies. Two articles with common words like India, Bangladesh, Cricket, Victor, Test "might" talk about the same topic. For a real world application , finding a good distance measure requires lot of work – The distance measure should be efficient but also reasonably accurate.

### K-Means algorithm

Let us talk about the K-means algorithm. The algorithm accepts two inputs. The data itself, and "k", the number of clusters. We will talk about the implications of specifying "k" later. The output is k clusters with input data partitioned among them.

The aim of K-means (or clustering) is this : We want to group the items into k clusters such that all items in same cluster are as similar to each other as possible. And items not in same cluster are as different as possible. We use the distance measures to calculate similarity and dissimilarity. One of the important concept in K-means is that of centroid. Each cluster has a centroid. You can consider it as the point that is most representative of the cluster. Equivalently, centroid is point that is the "center" of a cluster.

### Algorithm

1. Randomly choose k items and make them as initial centroids.
2. For each point, find the nearest centroid and assign the point to the cluster associated with the nearest centroid.
3. Update the centroid of each cluster based on the items in that cluster. Typically, the new centroid will be the average of all points in the cluster.
4. Repeats steps 2 and 3, till no point switches clusters.

As you can see, the algorithm is extremely simple. After some iterations, we will get k-clusters within which each points are similar. It is quite obvious that the algorithm should work. There is also a formal proof for convergence based on error criterion but we will skip it.

### Some Random Points

1. The algorithm converges even though sometimes it may take exponential time. Although, in practice, it converges very fast.
2. The centroid of a cluster need not correspond to any point in the cluster. Initially, the centroid is a point in the cluster but after some iterations it need not be the case. If you force the algorithm to always select an item in the cluster as its centroid then we get the k-medoids algorithm.
3. There are lot of good things about this algorithm. It is simple to understand, easy to code , scalable , efficient etc.
4. Let N be the number of data points. k be the number of clusters. i be the number of iterations. The time complexity is O(Nki).
5. Although not very obvious, some times, the K-Means algorithm gets stuck in a local minima and might return a bad clustering.
6. The selection of initial points is very critical to avoid local minima. Basic algorithm selects them randomly but much better heuristics exist. A good set of initial points are important as it influences both quality of clusters and also the running time of the algorithm.
7. Finding the correct number of clusters is one of the big problems in K-means. A wrong value of k can give you a sub optimal result. There are many techniques to predict the number of k , for eg by "preprocessing" the data using hierarchical clustering etc.
8. K-means is lot in common with other algorithms like PCA and EM . Both of them are really cool ideas and I will post about them sometime in the future. One interesting tidbit is that professor Chris Ding who is taking my Advanced Data Mining course , proved the equivalence of K-means and PCA.

### My Experience With K-Means

As my data mining course project, I did Clustering of wikipedia results. The task is something like this : Lets say you search for the word Jaguar in wikipedia. It will return a list of articles that contain Jaguar. The word Jaguar has multiple interpretation – as an animal , as a car , or as a version of Mac OS . My project took these pages as input and applied K-means algorithm on the pages . It will group pages that talk about similar concepts and more interestingly , it will also try to give a name for the cluster. (One of the thing about K-means is that it forms the clusters based on similarity – interpreting what each cluster means is up to us ) . So for Jaguar, it will group all articles that talk about Jaguar as a car into one cluster and so on.

Infact, it worked well even if you give a more abstract query term like Obama or McCain . (Remember, it was written in Fall 2008 , when there was US election fever 🙂 ) . So for McCain, it will group pages into clusters talking about his military career, senate career, presidential election and so on. That project was real fun and I learnt about lot of things in data mining while doing it. In case you are interested, the project report is here.

### Code

I wrote a simple code in java to do clustering but I found a more generic code at Antonio Gulli’s blog . The code is here.