Feeds:
Posts

## Posts Tagged ‘genomics’

1. Diaspora : Developer Release
As promised the folks behind Diaspora has released the code. Interestingly it uses Ruby ! I got my code running but could not add any of my friends. Their architecture is pretty interesting – the idea of decentralization permeates the design. One of their next steps is to integrate/interoperate with Facebook. It should be interesting to see FaceBook’s response – given that they deny access to Google and Twitter 🙂 I also joined Diaspora’s mailing list. It is quite high volume (around 100+ mails a day), although I expect it to drop to acceptable levels soon. Diaspora code uses so many interesting technologies that just scanning the mails introduces me to new technologies – WebID and OStatus are the most recent. Hopefully, I will learn many more ideas on the way !

2. How do I become a data scientist?
Has an interesting discussion on how to become a better data scientist – One of the responses is very comprehensive and other respected people have also chimed in.

3. Testing, testing…YouTube begins trial of new live streaming platform
A very interesting move by Google – I am not sure of the long term implications but it looks to me like a potential game changer.

4. User Experiences: Customizing Pinned Sites
In the demo during IE9 beta release, one of the examples was WordPress and Amazon using pinned sites creatively. This post explains how other sites can use the pinned sites more effectively. I am impressed with the ability of sites to provide custom jump list commands and the thumbnail toolbar commands are even more cool ! Looks like IE9 seems to have done some cool job .

At last Twitter becomes more user friendly. The proposed redesign looks promising – let us see how it changes the way people use Twitter applications.

6. Precursor to H.I.V. Was in Monkeys for Millenniums
Some powerful conclusions from very basic observations 🙂 Hopefully it will lead to some new insights on acquiring immunity against HIV if not cure !

1. Visions for Theoretical Computer Science
The basic aim of the this page is to highlight the broad research directions in TCS in such a way that even lay persons can understand. I think they have done an admirable job.

2. Express social objects in Atom format
I learnt about Activity Streams from Diaspora’s page on technology they are evaluating. I also played around with Google Buzz’s activity stream data. The idea looks very cool even though I feel the standard is less expressive.

3. How to Make an Artificial Cell
A detailed discussion of the recent break through by Craig Venter and his team.

4. Bringing improved PDF support to Google Chrome
Good move. I would be glad if they bring Flash support too 😉 In the other Google news, they introduced a command line tool to control many of their services. It is limited by gdata APIs but still there are lot of them. See more details at Introducing the Google Command Line Tool.

5. Tech News
Lot of interesting stuff happening in Tech World . There is a new price war between Amazon and Barnes & Noble – See E-Reader Prices Are Slashed . Also check out Swype and SeaMicro .

1. Quantum teleportation achieved over ten miles of free space
I could not find a better article on this incredible feat. The methodology seems interesting although , I have concerns about how to scale the size of particles and the distance. Lets see how this effort is bettered !

2. What are the lesser known but cool data structures ?
Another fantastic discussion from Stack Overflow. The list does come up with some novel data structures.

3. How We Created the First Synthetic Cell
This is a more equanimous op-ed from Craig Venter in WSJ. The difference between the shrilled hype and this post is striking. He explains in lay man terms what has been done and most importantly what they have not done (creating artificial life for one ) . It is a pity that this genius has not (and may be never) got a Nobel prize.

In a related note, Safety Rules Can’t Keep Up With Biotech Industry discusses about some safety issues in a field doing innovation at break-neck pace.

4. How (and Why) to Stop Multitasking
A nice post from a HBR blog. This talks about some advantages of forcibly single tasking yourself. The claim seems to be that you become more efficient and less stressed. I have been trying this for a couple of days – I will tell the result after a week.

5. MasterCard Wants Programmers to Use Its Payment Technology
This looks promising. Hope it is developer friendly and less bureacratic. I am quite interested in this API partly because I have been dabbling recently with OFX . I will explain my experience in a future post.

If you liked this post , please subscribe to the RSS feed.

1. ‘Artificial life’ breakthrough announced by scientists
Probably the biggest news of this week. Craig Venter seems to have done it again ! The potential look exciting. In another news related to genomics, Berkeley Asks Freshmen for DNA Samples . Start of an era of Personalized Medicine ?

The last week seemed to be all about Google. Google had so much announcements that individually detailing them will make the post long 🙂

a. Google got FTC approval for AbMob acquisition.
c. New Google Latitude API – Google Latitude kind of makes me uncomfortable but I have to admit that the potential apps are truly mind blowing (check it out in the link !)
d. Google Prediction API  – This looks like a Game Change but I need to try it out before commenting.
e. Google releases WebM and VP8 and makes them open source – Details here   – A more critical look at VP8 is here  .
f. Lastly, Google’s Pacman doodle was way too cool. The best part was it was fully done using HTML/JS. Details here.

3. Amazon S3 Reduced Redundancy Storage
I think this a smart move – Recognizing the different types of storage data and modifying prices based on it – Very clever !

4. The Life of a Typeahead Query
Has some interesting stuff about how FaceBook the searches in the home page. On a related note, Reclaim Privacy is a bookmarklet that checks your FaceBook privacy changes. There is a high chance that FaceBook will simplify their Privacy controls – but this will be useful till that time.

5. I’ll Be Bach
I never know Computational Music has come this far !

6. STEPHEN HAWKING: How to build a time machine
A nice article by Stephen Hawking where he discusses time travel. He also introduces Mad Scientist Paradox which he says is simpler than the Grand Father Paradox. I personally feel its the other way round. Its hard to understand the central argument of Mad Scientist paradox immediately (namely who fired the shot) – The reasoning goes along the lines of causes must occur before effects. Enjoy the article for the fun of it !

If you liked this post , please subscribe to the RSS feed.

## A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm

K Nearest Neighbor (KNN from now on) is one of those algorithms that are very simple to understand but works incredibly well in practice. Also it is surprisingly versatile and its applications range from vision to proteins to computational geometry to graphs and so on . Most people learn the algorithm and do not use it much which is a pity as a clever use of KNN can make things very simple. It also might surprise many to know that KNN is one of the top 10 data mining algorithms. Lets see why this is the case !

In this post, I will talk about KNN and how to apply it in various scenarios. I will focus primarily on classification even though it can also be used in regression). I also will not discuss much about Voronoi diagram or  tessellation.

### KNN Introduction

KNN is an non parametric lazy learning algorithm. That is a pretty concise statement. When you say a technique is non parametric , it means that it does not make any assumptions on the underlying data distribution. This is pretty useful , as in the real world , most of the practical data does not obey the typical theoretical assumptions made (eg gaussian mixtures, linearly separable etc) . Non parametric algorithms like KNN come to the rescue here.

It is also a lazy algorithm. What this means is that it does not use the training data points to do any generalization. In other words, there is no explicit training phase or it is very minimal. This means the training phase is pretty fast . Lack of generalization means that KNN keeps all the training data. More exactly, all the training data is needed during the testing phase. (Well this is an exaggeration, but not far from truth). This is in contrast to other techniques like SVM where you can discard all non support vectors without any problem.  Most of the lazy algorithms – especially KNN – makes decision based on the entire training data set (in the best case a subset of them).

The dichotomy is pretty obvious here – There is a non existent or minimal training phase but a costly testing phase. The cost is in terms of both time and memory. More time might be needed as in the worst case, all data points might take point in decision. More memory is needed as we need to store all training data.

### Assumptions in KNN

Before using KNN, let us revisit some of the assumptions in KNN.

KNN assumes that the data is in a feature space. More exactly, the data points are in a metric space. The data can be scalars or possibly even multidimensional vectors. Since the points are in feature space, they have a notion of distance – This need not necessarily be Euclidean distance although it is the one commonly used.

Each of the training data consists of a set of vectors and class label associated with each vector. In the simplest case , it will be either + or – (for positive or negative classes). But KNN , can work equally well with arbitrary number of classes.

We are also given a single number "k" . This number decides how many neighbors (where neighbors is defined based on the distance metric) influence the classification. This is usually a odd number if the number of classes is 2. If k=1 , then the algorithm is simply called the nearest neighbor algorithm.

### KNN for Density Estimation

Although classification remains the primary application of KNN, we can use it to do density estimation also. Since KNN is non parametric, it can do estimation for arbitrary distributions. The idea is very similar to use of Parzen window . Instead of using hypercube and kernel functions, here we do the estimation as follows – For estimating the density at a point x, place a hypercube centered at x and keep increasing its size till k neighbors are captured. Now estimate the density using the formula,

$p(x) = \frac{k/n}{V}$

Where n is the total number of V is the volume of the hypercube. Notice that the numerator is essentially a constant and the density is influenced by the volume. The intuition is this : Lets say density at x is very high. Now, we can find k points near x very quickly . These points are also very close to x (by definition of high density). This means the volume of hypercube is small and the resultant density is high. Lets say the density around x is very low. Then the volume of the hypercube needed to encompass k nearest neighbors is large and consequently, the ratio is low.

The volume performs a job similar to the bandwidth parameter in kernel density estimation. In fact , KNN is one of common methods to estimate the bandwidth (eg adaptive mean shift) .

### KNN for Classification

Lets see how to use KNN for classification. In this case, we are given some data points for training and also a new unlabelled data for testing. Our aim is to find the class label for the new point. The algorithm has different behavior based on k.

### Case 1 : k = 1 or Nearest Neighbor Rule

This is the simplest scenario. Let x be the point to be labeled . Find the point closest to x . Let it be y. Now nearest neighbor rule asks to assign the label of y to x. This seems too simplistic and some times even counter intuitive. If you feel that this procedure will result a huge error , you are right – but there is a catch. This reasoning holds only when the number of data points is not very large.

If the number of data points is very large, then there is a very high chance that label of x and y are same. An example might help – Lets say you have a (potentially) biased coin. You toss it for 1 million time and you have got head 900,000 times. Then most likely your next call will be head. We can use a similar argument here.

Let me try an informal argument here -  Assume all points are in a D dimensional plane . The number of points is reasonably large. This means that the density of the plane at any point is fairly high. In other words , within any subspace there is adequate number of points. Consider a point x in the subspace which also has a lot of neighbors. Now let y be the nearest neighbor. If x and y are sufficiently close, then we can assume that probability that x and y belong to same class is fairly same – Then by decision theory, x and y have the same class.

The book "Pattern Classification" by Duda and Hart has an excellent discussion about this Nearest Neighbor rule. One of their striking results is to obtain a fairly tight error bound to the Nearest Neighbor rule. The bound is

$P^* \leq P \leq P^* ( 2 - \frac{c}{c-1} P^*)$

Where $P^*$ is the Bayes error rate, c is the number of classes and P is the error rate of Nearest Neighbor. The result is indeed very striking (atleast to me) because it says that if the number of points is fairly large then the error rate of Nearest Neighbor is less that twice the Bayes error rate. Pretty cool for a simple algorithm like KNN. Do read the book for all the juicy details.

### Case 2 : k = K or k-Nearest Neighbor Rule

This is a straightforward extension of 1NN. Basically what we do is that we try to find the k nearest neighbor and do a majority voting. Typically k is odd when the number of classes is 2. Lets say k = 5 and there are 3 instances of C1 and 2 instances of C2. In this case , KNN says that new point has to labeled as C1 as it forms the majority. We follow a similar argument when there are multiple classes.

One of the straight forward extension is not to give 1 vote to all the neighbors. A very common thing to do is weighted kNN where each point has a weight which is typically calculated using its distance. For eg under inverse distance weighting, each point has a weight equal to the inverse of its distance to the point to be classified. This means that neighboring points have a higher vote than the farther points.

It is quite obvious that the accuracy *might* increase when you increase k but the computation cost also increases.

### Some Basic Observations

1. If we assume that the points are d-dimensional, then the straight forward implementation of finding k Nearest Neighbor takes O(dn) time.
2. We can think of KNN in two ways  – One way is that KNN tries to estimate the posterior probability of the point to be labeled (and apply bayesian decision theory based on the posterior probability). An alternate way is that KNN calculates the decision surface (either implicitly or explicitly) and then uses it to decide on the class of the new points.
3. There are many possible ways to apply weights for KNN – One popular example is the Shephard’s method.
4. Even though the naive method takes O(dn) time, it is very hard to do better unless we make some other assumptions. There are some efficient data structures like KD-Tree  which can reduce the time complexity but they do it at the cost of increased training time and complexity.
5. In KNN, k is usually chosen as an odd number if the number of classes is 2.
6. Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes ) .A simple approach to select k is set  $k=\sqrt{n}$
7. There are some interesting data structures and algorithms when you apply KNN on graphs – See Euclidean minimum spanning tree and Nearest neighbor graph .

8. There are also some nice techniques like condensing, search tree and partial distance that try to reduce the time taken to find the k nearest neighbor. Duda et al has a discussion of all these techniques.

### Applications of KNN

KNN is a versatile algorithm and is used in a huge number of fields. Let us take a look at few uncommon and non trivial applications.

1. Nearest Neighbor based Content Retrieval
This is one the fascinating applications of KNN – Basically we can use it in Computer Vision for many cases – You can consider handwriting detection as a rudimentary nearest neighbor problem. The problem becomes more fascinating if the content is a video – given a video find the video closest to the query from the database – Although this looks abstract, it has lot of practical applications – Eg : Consider ASL (American Sign Language)  . Here the communication is done using hand gestures.

So lets say if we want to prepare a dictionary for ASL so that user can query it doing a gesture. Now the problem reduces to find the (possibly k) closest gesture(s) stored in the database and show to user. In its heart it is nothing but a KNN problem. One of the professors from my dept , Vassilis Athitsos , does research in this interesting topic – See Nearest Neighbor Retrieval and Classification for more details.

2. Gene Expression
This is another cool area where many a time, KNN performs better than other state of the art techniques . In fact a combination of KNN-SVM is one of the most popular techniques there. This is a huge topic on its own and hence I will refrain from talking much more about it.

3. Protein-Protein interaction and 3D structure prediction
Graph based KNN is used in protein interaction prediction. Similarly KNN is used in structure prediction.

### References

There are lot of excellent references for KNN. Probably, the finest is the book "Pattern Classification" by Duda and Hart. Most of the other references are application specific. Computational Geometry has some elegant algorithms for KNN range searching. Bioinformatics and Proteomics also has lot of topical references.

After this post, I hope you had a better appreciation of KNN !

If you liked this post , please subscribe to the RSS feed.