Feeds:
Posts

## Posts Tagged ‘Data Mining’

1. Desktop Notifications: Now available to extensions in Chrome
Google is adding more APIs to Chrome at a rapid clip. This looks pretty neat and might be a good feature for Chrome Nanny.

2. How to Reboot Your Sleep Cycle and Get the Rest You Deserve
I guess this has some good ideas about how to bring your sleep cycle to normalcy. Some of the ideas seems to be promising !

3. How a Search Engine Might Identify Possible Query Suggestions
Has a good discussion about how search engines might improve user’s search experience : show possibly related queries , give query suggestions that frame the intent better etc. Many of the techniques seems to be quite straightforward. I am sure the current state of art is more complex though!

4. Sam’s Club Personalizes Discounts for Buyers
Possibly another big news in data mining. Even for the relatively small target population (Sam’s club’s Plus members) the permutations were mind boggling. I hope this gives some experience that can be used to scale it further. (Applying it to Costco or even Walmart). In a related news , Web Start-Ups Offer Bargains for Users’ Data .

5. In graphics: Supercomputing superpowers
A neat graphic about Top 500 Super computers. Its a pity that India seems to far behind the leaders. The graphic allows different filters. It was most satisfying to see Linux dominating in OS for the super computers.

If you liked this post , please subscribe to the RSS feed.

1. An Open Mind
This NYTimes articles talks open course ware like MIT OCW and asks the critical question – what next ? One glaring issue in OCW is that even if you master the course contents you are not really “certified”. It also talks about other initiatives like CMU’s OLI (its statistics and causal reasoning courses are among my favs)  and Peer 2 Peer University . I was not aware of P2PU and it looks like terrific concept. I am hoping to take part in its next offering. Lets see how it goes.

2. Web Coupons Know Lots About You, and They Tell
Another interesting article that talks about Web Coupons , information in it and profile matching. The Facebook scenario is the scariest, although I think the anonymity set in the other case should also be small. RevTrax may be the most important company we never were aware of 🙂

3. The History of Beauty
A post about a book on the beauty industry. It is a \$30 billion industry without much analysis. The post has some interesting points.

4. Why Linux Is Not Attracting Young Developers
Has some very interesting insights.

5. XAuth: The Open Web Fires a Shot Against Facebook Connect
Looks like a promising concept. I think Facebook and Twitter will ultimately join this even while supporting their own protocols.

If you liked this post , please subscribe to the RSS feed.

1. Choosing the number of clusters II: Diminishing Returns and the ROC curve
A nice intuition for ROC curve and how to use it to choose number of clusters.

2. A beautiful model (of the stock market)
Quite interesting post which argues at the macro level, stock market is quite well behaved and predictable 🙂

3. Square Dancing
An intuitive proof to Pythagoras theorem.

4. Law Enforcement Appliance Subverts SSL
This attack is quite clever – a straight forward man in the middle attack which allows any one to observe all the transactions you make to any targeted website. It does require co-operation from both ISP and CAs but that should be easy for at least the feds. A more shocking vulnerability is this – Vulnerabilities Allow Attacker to Impersonate Any Website . Clever use of nulls to get a certificate for a fake domain from CAs.

5. Interesting Math Problems
A nice set of interest math problems . Got the link from Sriram.

6. A single sperm has 37.5MB of DNA information in it. That means that a normal ejaculation represents a data transfer of 1,587.5TB
Mostly tongue in cheek geek humor.

1. Choosing the number of clusters I: The Elbow Method
A post with some ideas on choosing number of clusters in a principled way.

2. 4 Chatboxes for wordpress.com blogs
An interesting way to embed chats in blogs. I have added a chat window in my blog to chat with me just for the fun of it. Lets see how it goes.

3. This seat is reserved  and More women, more power?
Two interesting posts on Women reservation bill in India. Its a shame that the bill is not passed.

Another nice post on complex numbers by Steven Strogatz. It was a nice refresher to learn again how to take roots (square , cube et al) for imaginary numbers.

5. reddit.com Interviews Peter Norvig
An interesting interview with Peter Norvig. He answered some interesting questions . For eg I liked his answer that linear classifiers have progressed beyond everyone’s expectations. Another good question was why Lisp is not much used in Google. For the list of questions see reddit site.I found this interview through this site.

6. A collection of code competition sites
Good collection of code competition sites.

1. Fill in the Blanks: Using Math to Turn Lo-Res Datasets Into Hi-Res Samples
An interesting article from Wired about Compressive Sensing. For more details see Igor’s page . I don’t claim to understand CS fully but i guess I know the major ideas. CS is a exciting field and what it does borders on the impossible – pure magic ! The topic has been in my to-do list for long and hopefully this article will be the trigger to learn it more thoroughly.

2. Cellphones Let Shoppers Point, Click and Purchase
An interesting use of Smartphones for shopping. I guess coupled with Predictive analysis, this will make shopping more compelling.

3. Math Vs Massive Data Overload
An interesting article about a new algorithm from IBM research to do data mining on large datasets. I also have to say that I really liked the article title. A Forbes article on the same topic is here . The basic idea is to avoid matrix factorizations which are costly and move towards stochastic estimation of the diagonal of inverse covariance matrix. Clever idea indeed !

4. The Game of Punctuality: Coincidence or Equilibrium? and Words as Networks in Spam Detection
Two interesting applications from Game Theory and Networks. I really like the idea of using word networks to combat spam. The idea is intriguing but I am not sure how this technique will handle posts like mine where each paragraph talks a different idea.

5. Fun Posts

Controversy is the new advertising from Krish Ashok.

Vinnai Thaandi Varuvaaya: Poetry – A review of VTV. I liked the movie even though I felt the second half was slightly long. The songs were a delight – especially Aaromale. The post has some good observations like Gautam Menon’s uncanny ability for perfect dialogues, his directorial approach which makes even movies with clichéd topics seem less clichéd. Like the author , I also observed that the movie was less crowded. It so obsessively focused on Simbhu and Trisha that other characters hardly had screen space or impact. It is in contrast to old Balachander movies where even minor characters had a well developed sketch. Not that I am complaining !

Guys are Guys – A cartoon from DogHouse.

Two great cartoons from Tom Toles – here and here . To fully enjoy it you might need to know the current happenings in US politics !

## Impressions on Berkeley Machine Learning Workshop

Machine Learning is one of my areas of interest and I am trying to learn as much as possible. It is a very demanding but also a fun field .

Recently , I came across a huge list of video lectures on Machine Learning from the blog Onionesque Reality. More exactly, the post titled Demystifying Support Vector Machines for Beginners had a section “Webcasts/ Video Lectures on Learning Theory, Support Vector Machines and related ideas” where links to lot of lectures were provided. I sampled the lectures given here and I have to admit most of them are really high quality stuff.

As a first step, I decided to finish the Berkeley’s Machine Learning workshop. The website for video lectures is here. It contains 11 lectures over two days. I find it incredible that they were able to finish this off in two days. If I had been there my brain would have exploded with all this information  🙂

The courses cover most of the important areas of Machine Learning. Best of all, two of the lectures – on Graphical Models (my area of research) and Non Parametric Bayesian Models were taken by Professor Michael Jordan. Well, he is an excellent teacher and it is no wonder that many of his students are leaders in Machine Learning.

Overall , the courses were good – some excellent and some good.

The lectures were encoded in real player format which surprised me. It is hard but not impossible to play rm files in Linux. Even though all of  totem, mplayer or vlc can play it , the navigation was very awkward and clumsy and is prone to frequent crashes. I am thinking of converting them to mp4 format but have not found the exact way to do it .

The lectures were streamed via rtsp protocol. If you are using Linux and wanted to download the lectures there is an easy way to do it. Just click on one of the lectures and find its rtsp url. (Most of the time it is done by right clicking and saying Copy Url ).  You can then use mplayer to dump the file . The syntax is

mplayer -noframedrop -dumpfile Manifold_Learning_and_Visualization.rm -dumpstream rtsp://169.229.131.16:554/events/eecs/eecs_20070824_2.rm

The dumpfile argument takes the output file into which the lecture will be downloaded into. The dumpstream takes the rtsp url as input.

Have fun with the lectures !

## Introduction to Bayesian Decision Theory – Part 1

In this series of articles , I intend to discuss Bayesian Decision Theory and its most important basic ideas. The articles are mostly based on the classic book "Pattern Classification" by Duda,Hart and Stork. If you want the ideas in all its glory, go get the book !

As I was reading the book, I realized that in its heart , this field is a set of mostly common sense ideas validated by rigorous mathematics. So I will try to discuss the basic ideas in plain english without much mathematical rigor. Since I am still learning how to best explain these complex ideas, any comments on how to improve will be welcome !

You may ask what happened to PCA . Well, I still intend to write more on it but have not found enough time to sit and write it all. I have written a draft version of it but felt it was too technical and not very intuitive.So I am hoping to rewrite it again.

### Background

You would need to know basics of probability to understand the following. In particular the ideas of Prior , Posterior  and the idea of Likelihood  . Of course if you know all these , you will know the big idea of Bayes Theorem . I will try to explain them lightly but if you have any doubts check out Wikipedia or some old books.

We will take the example used in Duda et al. There are two types of fish : Sea Bass and Salmon. We catch a lot of fish of these two types but which are mixed together and our aim is to separate them by automation. We have a conveyor belt in which  the fishes come one by one and we need to decide if the current fish is sea bass or salmon. Of course, we want to make it as accurate as possible but also don’t want to spend lot of money on this project. This is in its heart a classification project. We will be given a few examples of both sea bass and salmon and based on it we need to infer the general characteristics using which we can distinguish them.

### Basic Probability Ideas

Now let us be slightly more formal. We say that there are two "classes" of fish – sea bass and salmon. According to our system, there is no other type of fish. If we treat it as a state machine , then our system has two states. The book uses a notation of $\omega_1 \; and \; \omega_2$ to represent them. We will use the names seabass and salmon as it is more intuitive.

The first basic idea is that of prior probability  . This is represented as $P(seabass) \; and \; P(salmon)$ which basically give the probability that the next fish in the conveyor is seabass or salmon. Of course, both of them have to sum to one. From the Bayesian perspective, this probability is usually obtained from prior (domain) knowledge. We will not talk about Frequentist interpretation as we will focus on Bayesian decision theory.

Let us assume that we use length of the fish to differentiate the fishes. So whenever the fish comes to the conveyor belt, we calculate its length (how, we don’t really care here ) . So we have transformed the fish into a simple representation using a single number, its length. So the length is a feature  that we use to classify and the step of converting  the fish into length is called feature extraction .

In a real life scenario, we will have multiple features and the input will converted to a vector. For eg we may use length , lightness of skin , fin length etc as feature. In this case , the fish will be transformed into a triplet. Converting the input to a feature vector makes further processing easy and more robust. We will usually use the letter x to represent the feature. So you can consider $P(x)$ is the probability of evidence. Eg lets say we got a fish (we dont know what it is yet) of length 5 inches. Now $P(x)$ gives the probability that some fish (either seabass or salmon) has the length 5 inches.

The next idea is that of likelihood . It is also called class conditional probability. It is represented as either $P(x|seabass) \; or \; P(x|salmon)$ . The interpretation is simple. This answers the question that if the fish is seabass what is the probability that it will have length $x$ inches (ditto salmon). Alternatively , what is the probability that a 5 inch seabass exists and so on. Or even how "likely" is a 5 inch seabass ?

The posterior probability is the other side of the story. This is represented by $P(seabass|x) \; or \; P(salmon|x)$ . Intuitively, given that we have a fish of length $x$ inches , what is the probability that it is a seabass (or salmon).  The interesting thing is that knowing prior probability and likelihood we can calculate posterior probability using the famous "Bayes Theorem". We can represent it in words as ,

$posterior = \frac{likelihood \times prior}{evidence}$

This gives another rationale for the word "likelihood". Among all other things being equal , the item with higher likelihood is more "likely" to final result. For eg if the likelihood of a 10 inch seabass is more than that of salmon then when we observe an unknown fish of length 10 inches , it is most likely a seabass.

PS : There is an excellent (but long) tutorial on Bayes Theorem at "An Intuitive Explanation of Bayes’ Theorem" . True to its title, it does try to explain the bizarre (atleast initially) result of Bayes Theorem using multiple examples. I highly recommend reading it.

### Bayesian Decision Theory

Let us enter into the decision theory at last. In a very high level definition, you can consider decision theory as a field which studies about "decisions" (to classify as seabass or not to be) – more exactly, it considers these decisions in terms of cost or loss functions. (More on that later). In essence , you can think of decision theory as providing a decision rule which tells us what action to taken when we make a particular observation. Decision theory can be thought of as all about evaluating decision rules. (Of course, I am grossly simplifying things, but I think I have conveyed the essence).

### Informal Discussion of Decision Theory for Two Class System with Single Feature

Let us take a look at the simplest application of decision theory to our problem. We have a two class system (seabass,salmon) and we are using a single feature (length) to make a decision. Be aware that length is not an ideal feature because many a time you will be having both seabass and salmon of same length (say 5 inches). So when we come across a fish with length 5 inches, we are stuck. We don’t know what decision to take as we know both seabass and salmon can be 5 inches. Decision theory to the rescue !

Instead of providing the theoretical ideas, I will discuss various scenarios and what is the best decision theoretic action to do. In all the scenarios let us assume that we want to be as accurate as possible.

### Case I : We don’t know anything and we are not allowed to see the fish

This is the worst case to be in. We have no idea about seabass and salmon (a vegetarian , perhaps ? 🙂 ). You are also not allowed to see the fish. But you are asked is the next fish in conveyor a seabass or salmon ? All is not lost – The best thing to do is to randomize. So the decision rule is with probability 50% say the next fish is seabass and  with probability 50% say it is salmon.

Convince yourself that this is the best thing to do – Not only when the seabass and salmon are in 50:50 , even when they are in 90:10 ratio.

### Case II : You know the prior probability but still you don’t see the fish

We are in a slightly better position here. We don’t get to see the fish yet , but we know the prior probability that the next fish is a seabass or salmon. ie We are give $P(seabass) \; and \; P(salmon)$ Remember, we want to be as accurate as possible and we want to as reliable about accuracy rate as possible.

A common mistake to do is to randomize again. ie with $P(seabass)$ say that the next fish is seabass and salmon otherwise. For eg let us say, $P(seabass) = .70 \; and \; P(salmon) = 0.3$ . Let me attempt an informal reasoning – In the (sample) worst case, you will get first 40 as seabass, next 30 as salmon and next 30 as seabass. But you say first 30 as salmon and next 70 as seabass. In this hypothetical example you are only at the most 40% accurate even though you can theoretically do better.

What does decision theory say here ? If $P(seabass) > P(salmon)$ then ALWAYS say seabass. Else ALWAYS say salmon. In this case the accuracy rate is $max(P(seabass),P(salmon))$ . Conversely, the error rate is the minimum of both the prior probabilities. Convince yourself that this is the best you can do . It sure is counterintuitive to always say seabass when you know you will get salmon too. But we can easily prove that this is the best you can do "reliably".

Mathematically, decision rule is decide seabass if $P(seabass) > P(salmon)$  else decide salmon .

Error is $min(P(seabass),P(salmon)$ .

### Case III : You know the likelihood function and the length but not the prior probability

This case is really hypothetical. ie we can see the fish and hence find its length. Let say x inches. We have $P(x inches|seabass) \; and \; P(x inches | salmon)$ but we don’t know the prior probability. The decision rule here is : For each fish , find the appropriate likelihood values. If the likelihood of seabass is higher than that of salmon , say the fish is seabass and salmon otherwise.

Note that we are making a decision based on our "observation" in contrast to previous cases. Unless, you are really unlucky and the prior probabilities are really skewed you can do well with this decision rule.

Mathematically, decision rule is decide seabass if $P(x|seabass) > P(x|salmon)$  else  decide  salmon .

### Case IV : You know the length, prior probability and the likelihood function

This is the scenario we are mostly in. We know the length of the fish (say 5 inches). We know the prior probability (say 60% salmon and 40% seabass). We also know the likelihood of them. (say P(5 inches|seabass) is 60% and P(5 inches|salmon) is 10% )

Now we can apply our favorite Bayes rule to get posterior probability. If the length of the fish is 5 inches then what is the probability that it a seabass ? A salmon ? Once you can calculate the posterior , the decision rule becomes simpler. If the posterior probability of seabass is higher than say the fish is seabass else say it is salmon.

Mathematically, decision rule is decide seabass if $P(seabass|x) > P(salmon|x)$ else decide salmon . This rule is very important and is called as Bayes Decision Rule.

For this decision rule, the error is $min(P(seabass|x),P(salmon|x)$

We can expand Bayes decision rule using Bayes theorem.

Decide seabass if  $p(x|seabass)P(seabass) > p(x|salmon)P(salmon)$   else decide salmon.

There are two special cases.
1. If likelihood are equal then our decision depends on prior probabilities.
2. If prior probabilities are equal then our decisions depend on likelihoods.

We have only scratched the surface of decision theory. In particular we did not focus much on bounding the error today. Also we did not discuss the cases where there are multiple classes or features. Hopefully, I will discuss them in a future post.

### Reference

Pattern Classification by Duda,Hart and Stork. Chapter 2.