I recently installed Ubuntu 12.10 in my system and found that the upgrade a lot of my system. A reasonable number of fixes were simple but some of them were very annoying. In this post I just wanted to discuss two specific errors I faced and how to fix them.

I use the student version of Matlab (32 bit) on a 64 bit machine. As part of upgrade to Ubuntu 12.10 , the installer removed multiple 32 bit files which caused the issue. When I tried to run Matlab , I got the following errors :

~/matlab/bin/matlab: 1: ~/matlab/bin/util/oscheck.sh: /lib/libc.so.6: not found

~/matlab/bin/glnx86/MATLAB: error while loading shared libraries: libXpm.so.4: cannot open shared object file: No such file or directory

The first error is mostly harmless – I will describe how to fix it later in the blog post. Fixing the first error needed to install 386 (32bit) version of few packages. Of course, fixing one exposed the next error and so on. In the interest of time, I will put all the packages in the single command.

sudo apt-get install libxpm4:i386 libxmu6:i386 libxp6:i386

Running the above command worked and allowed Matlab to run. However, I then faced another issue – when I tried to save a plot, it was failing again with the following error : (fixing first caused the second)

MATLAB:dispatcher:loadLibrary Can’t load ‘~/matlab/bin/glnx86/libmwdastudio.so’: libXfixes.so.3: cannot open shared object file: No such file or directory.
??? Error while evaluating uipushtool ClickedCallback

MATLAB:dispatcher:loadLibrary Can’t load ‘~/matlab/bin/glnx86/libmwdastudio.so’: libGLU.so.1: cannot open shared object file: No such file or directory.
??? Error while evaluating uipushtool ClickedCallback


To fix this, run the following command :

sudo apt-get install libxfixes3:i386 libglu1-mesa:i386

Finally, to fix the innocuous error :

~matlab/bin/matlab: 1: /home/neo/gLingua/matlab/bin/util/oscheck.sh: /lib/libc.so.6: not found

do the following :

sudo ln -s /lib/x86_64-linux-gnu/libc-2.15.so /lib/libc.so.6

Of course, make sure the libc-2.xx.so version is the correct one before running this command.


Hope this post helped !

As part of my research, I collaborate and share datasets with fellow researchers. These datasets are usually huge and hence cannot be shared via email. Since I started using Dropbox, I felt it to be a convenient mechanism t share them.

The workflow I came up with was very simple. Create a folder inside my Dropbox public folder, put the dataset in it and share the link with my friends. Sounds simple enough. Although in practice I hit a snag.

I was surprised to find that I was not able to get a shareable link for folders within Public folder. For any other folder in my Dropbox, I was able to right click and get  a shareable link. Long story short, the way to enable it is very simple. You just need to visit this site :



Now you can right click on any folder (either using the client or in Web UI) to get the shareable link. An interesting side effect was that I was able to get the shareable link for any file/folder in my Dropbox folder. That’s right – I was able to share a folder that was not in my Public folder. This is a bit surprising but could be useful someday nevertheless.

I have been a happy Amazon Prime customer for the last couple of years. One of the biggest perks of using it the availability of large number of videos available for instantly watching. Infact, I watched almost all episodes of Star Trek (TOS to Voyager) using this method.

Sometime in the second or third week of January, this method broke down. Whenever, I tried to play the episodes of Voyager, I got an error in Flash player. Basically, it will open a dialog box saying ‘Updating Player’ which will soon error out saying "an error occurred and your player could not be updated”. If you retry, it will get stuck with ‘Updating Player’ .

I was using Ubuntu 11.10 on a 64 bit machine. I tried lot of things and nothing really worked. I installed and reinstalled Adobe Flash plugin and other codecs and basically made a mess of my system. Finally, I found a simple solution in Amazon Instant Video forum in an unrelated thread. The link is here . The solution is very simple . Install hal and libhal1 package for your distro. If you are using Ubuntu, the command is

sudo apt-get install libhal1 hal


Few of my friends also had this issue and installing these packages seems to fix the issue. Unfortunately, this useful tip seems buried under other  noise and hence I decided to put a separate blog post. If this did not fix the issue I recommend looking at Adobe’s Problems playing protected video content on 64-bit Ubuntu Linux page. This has some additional information on making flash work.

Author Note : My blog is getting the first guest post ! This post is written by dear friend and fellow geek Kripa who just graduated from Columbia University. He specializes in Machine Learning, NLP and Data Mining. Over the last few months,  he has been doing some amazing exploratory work in Data Mining. This blog post describes one such work. Here he describes how he applied LDA on movie scripts to determine the mixture of genres. Aside from being a cool application of LDA, finding such genre mixture has multiple applications. For eg, this can provide a mechanism to improve movie recommendations. You can even use it to deconstruct the work of your favorite artist !

Every movie predominantly falls into a particular genre that can be one of war, crime, adventure, thriller, comedy, romance, etc. Sometimes there is more than one genre a movie can belong to. For example a movie like Minority Report though predominantly belongs to sci-fi has very strong elements of crime in the storyline. Given a movie, I wanted to look at some approximate percentages of genres contained in it.

If I throw a movie like Pearl Harbor at my system, I expect it to return an output like:

war: 35%
crime: 13%
comedy: 13%
horror: 9%
romance: 19%
sci-fi: 11%

I decided to consider only these six genres as others like Adventure, Action, Thriller, etc are mostly overlaps. For example, action is a broad category – war and crime could also be considered action; and a thriller could be a medley of action and crime. For my experiment I considered only those genres which I felt were distinct from each other.


The first question is what kind of data to go after? Movie plots are available from the IMDB dataset. They do express the major genre well but they don’t contain a lot of information to get the minor genres out. Wikipedia usually contains a detailed plot, but I found something better: Internet Movie Script Database (imsdb here forth). This is an online repository of movie scripts which contain dialogues and also contextual scene information.

For example consider this scene from The Patriot:

Marion leaves the workshop with Susan at his side. Nathan and Samuel walk past, exhausted from their day in the field.
NATHAN: Father, I saw a post rider at the house.
MARION: Thank you. Did you finish the upper field?
SAMUEL: We got it all cut and we bundled half of it.
MARION: Those swimming breaks cut into the day, don’t they?

This contains both the dialogues and the description of the scene. It struck me as the right kind of data I should be looking at. Also this had a good amount of scripts (~1000 movies) for running a quick experiment. I did a quick and dirty wget of all the webpages followed by a bunch of shell scripts to weed out html tags and expressions such as "CAMERA PANS OUT" and "SCENE SHIFTS TO" that add no information to the script. I also removed the stop words and proper nouns. After pre-processing I had just the text of dialogues and the scene description.

Now that I have data, how did I go about finding the composition of genres in a movie?

Supervised vs Unsupervised

Let us define the problem elaborately. I have the pre-processed imsdb content which is the data and its genre is its label. A simple classification task will involve extracting features from the labeled data and training it using a classifier to build a model. This model can then be used to predict the genre of a movie that is out of the training data. This is a supervised approach where we already have the labeled data – genre of a movie. For my task of detecting a mixture of genres, all I would need to do is to tweak the classifier output so that it gives the individual probabilities of each genre a movie can possibly belong to. But is this the right approach? Will the output probabilities be close to the actual representation of genres in a movie?

There are couple of issues with this supervised approach.

  • For training using a classifier, I need to assign a label to a script. For this problem I should be assigning more than one label to an instance. This is not possible. If during traning, I were to assign a single label (genre) to a whole movie, I’m making a strong (and wrong!) assumption that each movie script represents one genre.
  • Consider an example like Minority Report . Here imdb does not list ‘Crime’ as a category at all! We know very well that minority report is a ‘Sci-Fi Crime’ movie. The validity of labels itself poses an issue here.

    My expectation from an unsupervised approach is simple. I have a huge collection of unlabeled scripts. I’m assuming that each script contains a mixture of genres.

    I now need to extract features (words) that are relevant to each genre from this corpus. In better words, given a corpus of scripts, the unsupervised system needs to discover from the corpus a bunch of topics and list the most relevant features (words) that belong to each topic. I would then be able to use the features to infer the topic mixture of each individual script. This is known as Topic Modelling . Intuitively I’m trying to generate a bag of words model out of a corpus of scripts.

    Latent Dirichlet Allocation

    LDA is a topic model and one which is most widely used today. I will not spend much effort in giving a rigorous treatment on LDA since there are excellent resources available online. I’m listing those that I found extremely useful for understanding LDA.

    1. Prof. David Blei gave a talk on LDA and its applications at NYU during the NYC Machine Learning Week sometime in February 2011. There is a nice video recording of it: Part 1 and Part 2
    2. You can find Prof.Blei’s implementation of LDA in C on his webpage.
    3. Edwin Chen’s blog article is one of the best sources for grasping the intuition behind LDA even if you are mathematically less inclined.

    A word of warning here! I will write a brief note about my understanding of LDA. Honestly, there are better ways of explaining LDA. At this point all you need know is that LDA is a statistical black box – throw a bunch of documents at it; specify ‘k’ – the number of topics you think are represented by the documents and LDA will output ‘k’ topic vectors. Each topic vector contains words arranged in decreasing probabilities of being associated with that particular topic. You need to know ONLY THIS! So feel free to skip to the next section.

    In the case of Maximum Likelihood given n data points, we assume the underlying distribution that generated this data is a Gaussian and fit the data to the best Gaussian possible. LDA also makes a similar assumption that there is a hidden structure to the data. And that hidden structure is a multinomial whose parameter \theta comes from a Dirichlet Prior.

    Let us say that I want to generate a random document; I don’t care if its meaningful or not. I first fix the number of words I would want to generate in that document. I can on the other hand draw a random number from say a Poisson. Once I have the number of words (N) to be generated, I go ahead to generate those many words from the corpus.

    Each word is generated thus: draw a \theta from a Dirichlet. (Dirichlet is a distribution over the simplex.) Consider \alpha as the parameter that decides the shape of the Dirichlet similar to how mean and variance decide the shape of the Gaussian bell curve. In a 3-D space, for some choice of \alpha, consider the probability to be more near (1,0,0), (0,1,0) and (0,0,1). For some other choice of \alpha all points in the 3-D simplex might get the same probability! This represents what kind of topic mixtures I can generally expect! (If my initial guess is that each document has only one topic, mostly I will choose an \alpha such that I get more probability on the (1,0,0) points. This is just a prior which could be wrong! And in this way it is not strictly analogous to Maximum Likelihood).

    So I have an \alpha now and I draw a sample from the Dirichlet. What I actually get is a vector that sums up to 1. I call this \theta .

    Remember that I’m trying to generate a random document and I haven’t generated a single word yet! The \theta I have is my guess on the topic vector! I have obtained this \theta by sampling from a k-dimensional vector (here k=3 in the above example.)

    Now that \theta represents a topic vector which can also be re-imagined as a probability distribution and because any draw is guaranteed to be from the simplex, I can use this drawn vector (\theta ) as the weights of a loaded ‘k’ faced die. And I throw this die! Lets say it shows up 5 (a number between 1 and k). I will now say that the word I’m going to generate belongs to Topic-5.

    I have not yet generated a word! To generate a word that belongs to a topic, I need a |V| faced die. |V| is the size of the vocabulary of the corpus. How do I get such a huge die?!

    I will get that in a similar way as for the topic vector. I will sample again from a Dirichlet – but a different Dirichlet – one that is over a v-dimensional simplex. Any draw from this Dirichlet will give a v-faced die. Call this the Dirichlet \beta. For each topic (\theta) you need a different v-faced die (\beta). Thus I end up drawing ‘k’ such ‘v’-faced dice.

    So for topic-5, I throw the 5th v-faced die. Let us say it shows 42; I then go to the vocabulary and pick the 42nd word!  I will do this whole process ‘N’ times (N was the number of words to be generated for the random document.)

    The crux of this discussion is this: for every document , the dice (i.e samples from the dirichlet(\alpha) and dirichelt(\beta) ) are generated only once. It is just that to generate each word, the dice are rolled multiple times. Once to get a topic and once to get a word given this topic!

  • Running LDA on Movie Scripts

    I converted the imsdb data to a format accepted by Prof. Blei’s code. For more information you should look at his README. I initially ran the code with k = 10 (guessing that the movie scripts could represent a mixture of 10 genres). I will jump from here to give a sneak peek at the end results. The end results have been very encouraging. A sample of words clustered under different topics is here:


    At this point eyeballing the clusters it is apparent that the starting from the red cloud, clockwise, the clusters represent romance, sci-fi, war, horror, comedy and crime respectively.

    Just running LDA on the data with k=10 did not turn out such relevant results. So what was that extra mile which had to be crossed?

    LDA does not always guarantee relevant features. Nor is every topic discovered meaningful. After the first run, by eyeballing through the 10 topics, I was able to distinguish 3 topics very easily – war, romance and sci-fi. Playing around with different ‘k’ values did not yield more discernible topics. There are reasons to this:

  • ‘war’ and ‘sci-fi’ are distinctly different from other genres – they are pretty exclusive.
  • words belonging to ‘romance’ were getting repeated in other topics. This clearly indicated a skew in the data.
  • so if I remove the scripts of genres that have been already detected from the dataset, maybe LDA will detect more latent topics due to a change in the underlying data distribution caused by removal of data.

    I removed the scripts corresponding to war and sci-fi from the dataset. This was achieved by comparing each script in the dataset against the top features from these topics. Each script was scored based on the occurance of features. I removed the scripts that scored greater than a threshold. The new dataset D* contained scripts other than war and science fiction movies. Then I ran LDA on this new dataset D*. Now I found that the topics pertaining to crime and comedy were getting discovered but their features were mangled with romance. This feature set was improved by removing words which were common to most categories and were not exclusive to one particular category.

    This paper deals with using human judgments to examine topics. This is one way of picking the odd feature out of a topic vector.  The final feature set looked something like this.


    I wrote a simple python script that does soft-clustering based on occurance and count of features. A sample of the results is here:


    The results were quite convincing to look at. I wanted to obtain some hard metrics. IMSDB already categorises movies based on its genre. So I consider the category as the actual label. A movie is considered to be rightly classified if the label matches one of the top two genres detected by my system. The following are some encouraging results!

    These proportions cannot be considered accurate, but they do give an idea of what to expect from a movie! The complete list of results is here .

  • I wrote a Chrome extension called Chrome Nanny. Sometime back, Google took it down as it was against their branding guidelines … So I have created another extension with the same code but with the name : Nanny for Google Chrome. You can find the extension at https://chrome.google.com/webstore/detail/cljcgchbnolheggdgaeclffeagnnmhno .


    All new development will take place in this extension . I am still thinking about how to create a migration plan for Chrome Nanny users. I will post as I have more information . I will also update older blogs to point to the new url. Please spread the word !

    As many of you noticed, this blog has been very silent for the past 3 months – This was partly due to one of my research papers and few other minor reasons. Now that I know the drill, hopefully, the future papers should not result in full silence in the blog 🙂 Lets see how it goes….

    I learned lot of very fascinating this in data mining, machine learning etc Expect few good blog posts. I will try to add a page discussion the next few topics to be blogged. If you want me to discuss about any topic feel free to post it in the comments !

    During the past 3 months, my blog readership has almost doubled .. Hi to all my new readers ! Hope you enjoy my new posts too !


    This blog post is based on the lecture notes I prepared for one of the courses in which I am the teaching assistant. Hopefully, this notes will be useful for anyone interested in either Approximate Query Processing (AQP) or basic Sampling theory as applied in Databases.

    Approximate Query Processing

    Consider a database D which contains billions of tuples. Such a large database is not uncommon in data warehousing. When we want to get some answers from the database, we construct a SQL query to get the results. Due to the large size of the database, any query should take quite a bit of time. This is regardless of the use of techniques like indexing which can speed up the processing time but does not really reduce the time asymptotically.

    One observation is that the queries that are posed to the database returns very precise and accurate answers – probably after looking at each and every tuple. For a lot of use cases of OLAP we may not need such a precise answer. Consider some sample queries like – what is the ratio of male to female in UTA ? What percentage of US people live in Texas? What is the average salary of all employees in the company? and so on.

    Notice that for such queries we really do not need answers that are correct to the last decimal point. Also notice that each of those query is an aggregate over some column. Approximate query processing is a viable technique to use in these cases. A slightly less accurate result but which is computed instantly is desirable in these cases. This is because most analysts are performing exploratory operation on the database and do not need precise answers. An approximate answer along with a confidence interval would suit most of the use cases.

    There are multiple techniques to perform approximate query processing. The two most popular involve histograms and sampling. Histograms store some statistic about the database and then use it to answer queries approximately. Sampling creates a very small subset of the database and then uses this smaller database to answer the query. In this course we will focus on AQP techniques that use sampling. Sampling is a very well studied problem backed up by a rich theory that can guide us in selecting the subset so that we can provide reasonably accurate results and also provide statistical error bounds.

    Introduction to Sampling

    The idea behind sampling is very simple. We want to estimate some specific characteristic of a population. For eg this might be the fraction of people who support some presidential candidate or the fraction of people who work in a specific field or fraction of people infected with a disease and so on. The naive strategy is to evaluate the entire population. Most of the time , this is infeasible due to constraints on time, cost or other factors.

    An alternate approach that is usually used is to pick a subset of people . This subset is usually called a sample. The size of the sample is usually an order of magnitude smaller than the population. We then use the sample to perform our desired evaluation. Once we get some result, we can use this to estimate the characteristic for the entire population. Sampling theory helps, among other things, on how to select the subset ,  what is the size of population, how to extrapolate the result from sample to original population , how to estimate the confidence interval of our prediction etc.

    The process of randomly picking a subset of the population and using it to perform some estimation should appear very strange. On a first glance it might look that we might get wildly inaccurate results. We will later see how to give statistical guarantees over our prediction. Sampling is a very powerful and popular technique. More and more problems in the real world are being solved using sampling. Lots of recent problems in data mining and machine learning essentially use sampling and randomization to approximately solve very complex problems which are not at all solvable otherwise. This is for this reason that most DBMS provide sampling as a fundamental operator. (Eg Oracle provides a SAMPLE operator in select and also dbms_stats package. SQL Server provides TABLESAMPLE operator and so on).

    We represent the population with P and the sample with S. N represents the size of population and n represents the size of the sample. We will use these letters to denote statistics on the population and sample. For eg, \mu_P represents the mean of the population and \mu_S represents the mean of the sample. Similarly, \sigma_P and \sigma_S represent the standard deviation of the population and sample respectively.

    Types of Sampling

    There are different ways to perform sampling. The ones that we will use most in AQP are :

    Sampling With/Without Replacement

    The aim of sampling is to get a subset (sample) from a larger population. There are two ways to go about it. In the first approach, we randomly pick some entity, perform measurements if any and add it to the sample. We then replace the entity back to the original population and repeat the experiment. This means that same entity can come in the sample multiple times. This approach is called Sampling with replacement. This is the simplest approach to sampling. There is no additional overhead to check if an entity is already in sample or not. Typically, sampling with replacement is modeled with binomial distribution.

    In the second approach, we explicitly make sure that an entity does not appear in the sample more than once. So we randomly pick an entity from the population, verify it is not already in the sample, perform measurement and so on. Alternatively, we can remove entities that were added to sample from the population. This approach is called Sampling without replacement and is usually modeled with an hypergeometric distribution.

    Bernoulli Sampling

    Bernoulli Trial : A Bernoulli Trial is an experiment which has exactly two outcomes – Success or failure. Each of these outcomes has an associated probability that completely determines the trial. For eg consider a coin which produces head with probability 0.6 and tail with probability 0.4 . This constitutes a bernoulli trial as there are exactly two outcomes.

    In Bernoulli Sampling, we perform a Bernoulli trial on each tuple in the database. Each tuple can be selected into a sample with uniform probability. If the trial is a success, the tuple is added to the sample. Else it is not. The most important thing to notice is that all the tuples have exactly the sample probability of getting into the sample. Alternatively, the success probability for Bernoulli trial remains the same for each tuple. Pseudo code for Bernoulli Sampling is given below :

    success probability SP = \frac{n}{N}
    for i = 1 to N
        Perform a Bernoulli trial with success probability SP
        If outcome is success add i-th tuple to sample

    It is important to note that Bernoulli sampling falls under Sampling without replacement. Size of the sample follows a binomial distribution with parameters B(N,\frac{n}{N}). ie it can vary between 0 and N-1 with the expected size of sample as n.

    Uniform Random Sampling

    In Uniform Random Sampling, we pick each tuple in the database with a constant probability. This means that the probability of any tuple entering the sample is constant. Typically, this is implemented as sampling with replacement. Pseudo code for Uniform Random Sampling is given below :

    1. Generate a set of n random numbers S between 1 and N.
    2. Select tuples with index in S and add it to sample.

    Note that in this case we have exactly n tuples in the sample. We can also notice that sample tuple might appear multiple times. The number of times a tuple appears in the sample forms a binomial distribution with parameters B(n,\frac{1}{n}).

    Weighted Random Sampling

    In Weighted Random Sampling, we perform a Bernoulli trial on each tuple in the database. The difference with Uniform random sampling that the success probability for each Bernoulli trial varies. In other words, each tuple has a different probability of getting into the sample.

    Stratified Sampling

    If the population can be subdivided into sub population that are distinct and non overlapping , stratified sampling can be used. In Stratified sampling, we split the population into a bunch of strata and then form sampling over each strata independently. For eg a political poll can be stratified on gender, race , state etc.

    There are multiple advantages in using stratified sampling. For one, this allows the convenience to use different sampling techniques over each strata. If there is some specific strata that might be under represented in a general random sampling, we can easily provide additional weights for the samples taken from that. It is also possible to vary the number of samples from a strata to minimize the error. For eg, we can take less number of samples from a strata with low variance while preserving them for strata with high variance.

    Stratified sampling is not a panacea because getting a good stratification strategy may not be obvious. In lot of population, the best feature to stratify is not obvious. Even worse, the population may not contain subgroups that are homogeneous and non overlapping.

    Given a population , sample size and strata there are many ways to allocate the sample across different strata. The two commonly used strategies are :

    1. Proportional Allocation : In this approach the contribution of each strata to the sample is proportional to the size of the strata. So a strata that accounts to 10\% of the population will use 10\% of the sample. Note that this approach does not use any information about a sub population other than its size.
    2. Variance based Allocation : This strategy allocates samples in proportion to their variance. So a strata with high variance will have higher representation than one with smaller variance. This is logical as we need few samples to accurately estimate the parameters of a sub population when its variance is low. Any additional samples do not add much additional information or reduce the final estimation dramatically.

    Reservoir Sampling

    Reservoir Sampling is an algorithm that is widely used to take n random samples from a large database of size N. The real utility of reservoir sampling is realized when N is a large number or is not really known at sample time. This scenario is quite common when the input is a streaming data or when the database is frequently updated . Running the simple uniform random sampling algorithm (say Bernoulli sampling) is inefficient as N is large or the old tuples may be purged (or goes out of Sliding Window). Reservoir sampling allows you to get the random sample in a linear pass such that you only inspect any tuple at most once.

    Reservoir Sampling with Single Sample

    Consider the following contrived example. We have a database which is constantly updated and we want to have a single random tuple from it.

    The base case occurs when there is only one tuple in the database. In this case our random sample is the first tuple. Hence the sample  S = t1.

    Lets say the database is updated and a new tuple  t2 is added. The naive approach is to restart the entire sampling process. Instead, in reservoir sampling, we accept the new tuple as the random sample with probability \frac{1}{2}. ie toss a coin which returns head with probability 0.5 and if it returns head , then replace t1 with t2.

    We can see why S is a uniform sample. The probability that S contains t1 or t2 remains the same.

    1. Pr(S=t1) = 1 * \frac{1}{2} = \frac{1}{2} . The random sample is t1 when it was selected first into S (with probability 1) and then not rejected by t2 with probability 1-\frac{1}{2}.
    2. Pr(S=t2) = \frac{1}{2} . The random sample is t2 when it replaces  t1 in the second step. This occurs with probability \frac{1}{2}

    The database is updated and lets assume a new tuple t3 is added. We accept the new tuple as the random sample with probability \frac{1}{3}. ie toss a coin which returns head with probability 0.33 and if it returns head , then replace the previous value of S (t1 or t2) with t3. More generally when inspecting the i-th tuple, accept it with probability \frac{1}{i}.

    It might look as if we are treating t3 unfairly because we only accept it with probability 0.33. But we can show probabilistically that the sample is still uniform. The probability that S is t1 or t2 or t3 remains the same. 

    1. Pr(S=t1) = 1 \times \frac{1}{2} \times \frac{2}{3}=\frac{1}{3} . The only scenario when random sample is still t1 occurs when it was selected first into S (with probability 1) ,not rejected by t2 with probability 1-\frac{1}{2} and not rejected by t3 with probability 1-\frac{1}{3} = \frac{2}{3}.
    2. Pr(S=t2) = \frac{1}{2} \times \frac{2}{3} = \frac{1}{3} . The random sample is 2 when it replaces t1 in the second step. This occurs with probability \frac{1}{2}. Then in the next step is not replaced by t3. This occurs with probability 1-\frac{1}{3} = \frac{2}{3}.
    3. Pr(S=t3) = \frac{1}{3} . The random sample is 3 when S contains either t1 or t2 and it is replaced by t3. This occurs with probability \frac{1}{3}.

    The pseudo code looks like :

    S = t1
    for i = 2 to N
        Replace S with tuple t_i with probability \frac{1}{i}

    Reservoir Sampling with Multiple Samples

    A very similar approach works when the sample size is more than 1. Lets say that we need a sample of size n. Then we initially set the first n tuples of the database as the sample. The next steps is a bit different. In the previous case, there was only one sample so we replaced the sample with the selected tuple. When sample size is more than 1, then this steps splits to two parts :

    1. Acceptance : For any new tuple t_i, we need to decide if this tuple enters the sample. This occurs with probability \frac{n}{i}.
    2. Replacement : Once we decided to accept the new tuple into the sample, some tuple already in the sample needs to make way. This is usually done randomly. We randomly pick a tuple in the sample and replace it with tuple t_i.

    The pseudo code looks like :

    Store first n elements into S
    for i = n+1 to N
        Accept tuple t_i with probability \frac{n}{i}
        If accepted, replace a random tuple in S with tuple t_i

    A coding trick that avoids the "coin tossing" by generating a random index and then accepts it if it is less than our sample size. The pseudo code looks like :

    Store first n elements into S
    for i = n+1 to N
        randIndex = random number between 1 and i
        if randIndex <= n
            replace tuple at index "randIndex" in the sample with tuple t_i

    We can similarly analyze that the classical reservoir sampling does provide a uniform random sample. Please refer to the paper  Random Sampling with a Reservoir by Jeffrey Vitter for additional details.

    Sampling in AQP

    As discussed above, our main aim is to discuss how sampling techniques is used in AQP. Let us assume that we have a sample S of size n. The usual strategy that we will follow is to apply any aggregate query on the sample S instead of database D. We then use the result of the query from S to estimate the result for D.

    One thing to note is that we will only use aggregate queries for approximate processing. Specifically we will look at COUNT, SUM and AVERAGE. The formulas for estimating the values of the aggregate query for the entire database from the sample for these 3 operators is well studied. For additional details refer to the paper “Random Sampling from Databases" by Frank Olken.

    Uniform Random Sample

    1. Count : \frac{\sum_{i=1}^{n} T_i p_i}{n} = \frac{\sum_{i=1}^{n} T_i \frac{1}{N}}{n} where T_i is an indicator random variable that is 1 when tuple t_i satisfies our clause. p_i is the probability that tuple will be selected into the sample. Intuitively, the formula finds the fraction of tuples in Sample which satisfied the query and applies the same fraction to the entire database.
    2. Sum : \frac{\sum_{i=1}^{n} x_i \frac{1}{p_i}}{n} = \frac{\sum_{i=1}^{n} x_i N}{n}
    3. Average : \frac{Sum}{Count}

    Weighted Random Sample

    1. Count : \frac{\sum_{i=1}^{n} T_i p_i}{n} where T_i is an indicator random variable that is 1 when tuple t_i satisfies our clause. p_i is the probability that tuple will be selected into the sample. Intuitively, the formula reweighs each tuple according to the selection probability of the tuple.
    2. Sum : \frac{\sum_{i=1}^{n} x_i \frac{1}{p_i}}{n}
    3. Average : \frac{Sum}{Count}

    Probability/Statistics Refresher

    Here are few commonly used equations related to Expectation and variance.

    1. E[X] = \sum_{i=1}^{n} x_i p_i
    2. E[aX+b] = aE[X] + b
    3. E[X+Y] = E[X] + E[Y] (also called Linearity of Expectations)
    4. Var[X] = E[ (X - E[X] ) ^2]
    5. Var[X+a] = Var[X]
    6. Var[aX] = a^2 Var[X]
    7. Var[X+Y] = Var[X] + Var[Y] if X and Y are independent.

    Law of Large Numbers : This law is one of the fundamental laws in probability. Let X_1,X_2 \ldots , X_n be random variables drawn iid. Very informally, as n increases, the average of the variables approaches the expected value of the distribution from which the variables are drawn. For eg, if we have a coin which provides head with probability 0.5 and toss it 10000 times, the number of heads will be very close to 5000.

    Binomial Distribution: Suppose you repeat a Bernoulli trial with success probability of p , n times. The distribution of the number of successes in the n trials is provided by binomial distribution B(n,p). This is a very important distribution for modeling sampling with replacement. For eg if we perform Bernoulli sampling, the final size of the sample is a Binomial distribution. Also if we randomly pick n tuples from database of size N , the number of times any tuple is repeated in the sample can also modeled by Binomial distribution. The expected value is given by E[X]=np and variance is given by np(1-p).

    Normal Distribution : Normal distribution , aka Gaussian distribution, is one of the most important probability distributions. It is usually represented with parameters N(\mu,\sigma^2). It has the familiar bell curve shape. \mu determines the center of the normal curve and \sigma^2 determines the width of it. A smaller value results in a tighter curve while a larger value results in a more flat/wide curve.

    Equations (2) and (7) above gives us some detail about how expected value and variance of sum of two random variables can be computed from the expected value/variance of the constituent random variables . We can extend by induction for the rules to hold for the sum of arbitrarily large number of random variables. This introduces us to the concept of Central Limit Theorem.

    Central Limit Theorem :  This is one of the most fundamental rules in probability. Assume that we have n different random variables X_1,X_2 \ldots , X_n. Each of these random variables have mean \mu and variance \sigma^2. For a large n, the distribution of sums X = X_1 + X_2 + \ldots + X_n is normal with parameters N(n \mu, n \sigma^2). Similarly, the distribution of the averages , X = \frac{X_1 + X_2 + \ldots + X_n}{n} is normal with parameters N(\mu, \frac{\sigma^2}{n}).

    Law of large numbers and Central limit theorem are two important results that allow analysis of the results. law of large number says that if we pick a large enough sample then the average of the sample will be closer to the true average of population. Central limit theorem states that if you repeat the sampling experiment multiple times and plot the distribution of the average value of the samples, they follow a normal distribution. Jointly, they allow you to derive the expression for Standard error.

    Standard Error

    The essential idea behind sampling is to perform the experiment on a randomly chosen subset of the population instead of the original population. So far we discussed how to perform the sampling and how to estimate the value from sample to the larger population. In this section let us discuss about the error in our prediction. The concept of standard error is usually used for the same.

    Lets say we want to estimate the mean of the population \mu_P . We performed the sampling and found the sample mean as \mu_S. Since the sample mean is an unbiased estimator of the population mean we can announce that they are the same. But it need not always be the case that these two values are same. There are two common ways to analyze the error.

    1. Absolute Error : |\mu_P - \mu_S|^2
    2. Relative Error : \frac{|\mu_P - \mu_S|^2}{\mu_P}

    The primary problem we have in estimating these error metrics is that we do not know the value of \mu_P. We will use an indirect way to estimate the error.

    Consider the sampling process again. We picked a sample S={X_1,X_2,\ldots,X_n} of random tuples from the database and computed the sample mean as \mu_S=\frac{\sum_{i=1}^{n} X_i}{n}. But since the tuples in S are picked randomly, the sample mean changes based on the sample. This means that the sample mean \mu_S is itself a random variable.

    Let the population P={X_1,X_2,\ldots,X_N}. Then the population mean \mu_P = \overline{X} = \frac{\sum_{i=1}^{N} X_i}{N}.  Let the sample be S={Y_1,Y_2,\ldots,Y_n} and sample mean is \mu_S = \frac{\sum_{i=1}^{n} Y_i}{n}.

    Standard Error when Sample Size = 1

    Consider the case where the sample S consists of only one element. Our aim is to find population mean. As per our approach, we pick a random element Y_1 and then announce it as the sample mean ie {\mu}_S=Y_1 . The error in this case is the difference between Y_1 and \mu_P. Since the sample is randomly picked , the sample mean is a random variable. Then it also implies that the error is also a random variable.

    We can derive the expected value for the error as follows :
    E[err^2] = \sum_{i=1}^{N}\frac{(\overline{X}-X_i)^2}{N} = \sigma_P^2.

    We can see that the expected value of the error is nothing but the standard deviation of the population !

    Standard Error when Sample Size = 2

    Consider the case where the sample S consists of exactly two elements. Our aim is to find population mean. As per our approach, we pick two random elements Y_1,Y_2 and then announce the sample mean as {\mu}_S = \frac{Y_1+Y_2}{2}. The error in this case is the difference between \mu_S and \mu_P. Since the samples are randomly picked , the sample mean is again a random variable. Then it also implies that the error is also a random variable.

    We can derive the expected value for the error as E[err^2] = E[(\overline{X} - \overline{Y})^2] = Var[\overline{Y}]. We can see that the sample can be any one of N^2 different combination of elements . The calculation for variance might be tricky if not for the independence of samples. Since the two elements were picked randomly , they two are independent and we can use that to estimate the variance easily.

    Var[\overline{Y}] = Var [\frac{Y_1 + Y_2}{2}] = Var [\frac{Y_1}{2} + \frac{Y_2}{2}] = \frac{1}{4} Var[Y_1] + \frac{1}{4} Var[Y_2] = \frac{\sigma_P^2}{4} + \frac{\sigma_P^2}{4} = \frac{\sigma_P^2}{2}

    Note that we used rules 3, 6 and 7 from above.

    Standard Error when Sample Size = n

    We will not derive the formula here but we can easily show that when the sample contains n different elements the standard error is given by E[err^2] = \frac{\sigma_P^2}{n} = \frac{Variance\;of\;population}{Sample\;size}.

    There are some interesting things to note here : 

    1. The expected error when using 2 samples is less than that of the expected error when we used only one sample.
    2. As the size of sample increases the error decreases even faster. The rate of decrease is inversely proportional to square of size of the sample size.
    3. This also formalizes the intuition that if the variance of the population is less, we need less number of samples to provide estimates with small error.
    4. Usually our work will dictate the tolerable error and we can use the formula to find the appropriate n that will make standard error less than our tolerance factor.

    Hope this post was useful 🙂