I use the student version of Matlab (32 bit) on a 64 bit machine. As part of upgrade to Ubuntu 12.10 , the installer removed multiple 32 bit files which caused the issue. When I tried to run Matlab , I got the following errors :
~/matlab/bin/matlab: 1: ~/matlab/bin/util/oscheck.sh: /lib/libc.so.6: not found
~/matlab/bin/glnx86/MATLAB: error while loading shared libraries: libXpm.so.4: cannot open shared object file: No such file or directory
The first error is mostly harmless – I will describe how to fix it later in the blog post. Fixing the first error needed to install 386 (32bit) version of few packages. Of course, fixing one exposed the next error and so on. In the interest of time, I will put all the packages in the single command.
sudo apt-get install libxpm4:i386 libxmu6:i386 libxp6:i386
Running the above command worked and allowed Matlab to run. However, I then faced another issue – when I tried to save a plot, it was failing again with the following error : (fixing first caused the second)
MATLAB:dispatcher:loadLibrary Can’t load ‘~/matlab/bin/glnx86/libmwdastudio.so’: libXfixes.so.3: cannot open shared object file: No such file or directory.
??? Error while evaluating uipushtool ClickedCallbackMATLAB:dispatcher:loadLibrary Can’t load ‘~/matlab/bin/glnx86/libmwdastudio.so’: libGLU.so.1: cannot open shared object file: No such file or directory.
??? Error while evaluating uipushtool ClickedCallback
To fix this, run the following command :
sudo apt-get install libxfixes3:i386 libglu1-mesa:i386
Finally, to fix the innocuous error :
~matlab/bin/matlab: 1: /home/neo/gLingua/matlab/bin/util/oscheck.sh: /lib/libc.so.6: not found
do the following :
sudo ln -s /lib/x86_64-linux-gnu/libc-2.15.so /lib/libc.so.6
Of course, make sure the libc-2.xx.so version is the correct one before running this command.
Hope this post helped !
The workflow I came up with was very simple. Create a folder inside my Dropbox public folder, put the dataset in it and share the link with my friends. Sounds simple enough. Although in practice I hit a snag.
I was surprised to find that I was not able to get a shareable link for folders within Public folder. For any other folder in my Dropbox, I was able to right click and get a shareable link. Long story short, the way to enable it is very simple. You just need to visit this site :
Now you can right click on any folder (either using the client or in Web UI) to get the shareable link. An interesting side effect was that I was able to get the shareable link for any file/folder in my Dropbox folder. That’s right – I was able to share a folder that was not in my Public folder. This is a bit surprising but could be useful someday nevertheless.
Sometime in the second or third week of January, this method broke down. Whenever, I tried to play the episodes of Voyager, I got an error in Flash player. Basically, it will open a dialog box saying ‘Updating Player’ which will soon error out saying "an error occurred and your player could not be updated”. If you retry, it will get stuck with ‘Updating Player’ .
I was using Ubuntu 11.10 on a 64 bit machine. I tried lot of things and nothing really worked. I installed and reinstalled Adobe Flash plugin and other codecs and basically made a mess of my system. Finally, I found a simple solution in Amazon Instant Video forum in an unrelated thread. The link is here . The solution is very simple . Install hal and libhal1 package for your distro. If you are using Ubuntu, the command is
sudo apt-get install libhal1 hal
Few of my friends also had this issue and installing these packages seems to fix the issue. Unfortunately, this useful tip seems buried under other noise and hence I decided to put a separate blog post. If this did not fix the issue I recommend looking at Adobe’s Problems playing protected video content on 64-bit Ubuntu Linux page. This has some additional information on making flash work.
Every movie predominantly falls into a particular genre that can be one of war, crime, adventure, thriller, comedy, romance, etc. Sometimes there is more than one genre a movie can belong to. For example a movie like Minority Report though predominantly belongs to sci-fi has very strong elements of crime in the storyline. Given a movie, I wanted to look at some approximate percentages of genres contained in it.
If I throw a movie like Pearl Harbor at my system, I expect it to return an output like:
war: 35%
crime: 13%
comedy: 13%
horror: 9%
romance: 19%
sci-fi: 11%
I decided to consider only these six genres as others like Adventure, Action, Thriller, etc are mostly overlaps. For example, action is a broad category – war and crime could also be considered action; and a thriller could be a medley of action and crime. For my experiment I considered only those genres which I felt were distinct from each other.
The first question is what kind of data to go after? Movie plots are available from the IMDB dataset. They do express the major genre well but they don’t contain a lot of information to get the minor genres out. Wikipedia usually contains a detailed plot, but I found something better: Internet Movie Script Database (imsdb here forth). This is an online repository of movie scripts which contain dialogues and also contextual scene information.
For example consider this scene from The Patriot:
Marion leaves the workshop with Susan at his side. Nathan and Samuel walk past, exhausted from their day in the field.
NATHAN: Father, I saw a post rider at the house.
MARION: Thank you. Did you finish the upper field?
SAMUEL: We got it all cut and we bundled half of it.
MARION: Those swimming breaks cut into the day, don’t they?
This contains both the dialogues and the description of the scene. It struck me as the right kind of data I should be looking at. Also this had a good amount of scripts (~1000 movies) for running a quick experiment. I did a quick and dirty wget of all the webpages followed by a bunch of shell scripts to weed out html tags and expressions such as "CAMERA PANS OUT" and "SCENE SHIFTS TO" that add no information to the script. I also removed the stop words and proper nouns. After pre-processing I had just the text of dialogues and the scene description.
Now that I have data, how did I go about finding the composition of genres in a movie?
Let us define the problem elaborately. I have the pre-processed imsdb content which is the data and its genre is its label. A simple classification task will involve extracting features from the labeled data and training it using a classifier to build a model. This model can then be used to predict the genre of a movie that is out of the training data. This is a supervised approach where we already have the labeled data – genre of a movie. For my task of detecting a mixture of genres, all I would need to do is to tweak the classifier output so that it gives the individual probabilities of each genre a movie can possibly belong to. But is this the right approach? Will the output probabilities be close to the actual representation of genres in a movie?
There are couple of issues with this supervised approach.
My expectation from an unsupervised approach is simple. I have a huge collection of unlabeled scripts. I’m assuming that each script contains a mixture of genres.
I now need to extract features (words) that are relevant to each genre from this corpus. In better words, given a corpus of scripts, the unsupervised system needs to discover from the corpus a bunch of topics and list the most relevant features (words) that belong to each topic. I would then be able to use the features to infer the topic mixture of each individual script. This is known as Topic Modelling . Intuitively I’m trying to generate a bag of words model out of a corpus of scripts.
LDA is a topic model and one which is most widely used today. I will not spend much effort in giving a rigorous treatment on LDA since there are excellent resources available online. I’m listing those that I found extremely useful for understanding LDA.
A word of warning here! I will write a brief note about my understanding of LDA. Honestly, there are better ways of explaining LDA. At this point all you need know is that LDA is a statistical black box – throw a bunch of documents at it; specify ‘k’ – the number of topics you think are represented by the documents and LDA will output ‘k’ topic vectors. Each topic vector contains words arranged in decreasing probabilities of being associated with that particular topic. You need to know ONLY THIS! So feel free to skip to the next section.
In the case of Maximum Likelihood given n data points, we assume the underlying distribution that generated this data is a Gaussian and fit the data to the best Gaussian possible. LDA also makes a similar assumption that there is a hidden structure to the data. And that hidden structure is a multinomial whose parameter comes from a Dirichlet Prior.
Let us say that I want to generate a random document; I don’t care if its meaningful or not. I first fix the number of words I would want to generate in that document. I can on the other hand draw a random number from say a Poisson. Once I have the number of words (N) to be generated, I go ahead to generate those many words from the corpus.
Each word is generated thus: draw a from a Dirichlet. (Dirichlet is a distribution over the simplex.) Consider as the parameter that decides the shape of the Dirichlet similar to how mean and variance decide the shape of the Gaussian bell curve. In a 3-D space, for some choice of , consider the probability to be more near (1,0,0), (0,1,0) and (0,0,1). For some other choice of all points in the 3-D simplex might get the same probability! This represents what kind of topic mixtures I can generally expect! (If my initial guess is that each document has only one topic, mostly I will choose an such that I get more probability on the (1,0,0) points. This is just a prior which could be wrong! And in this way it is not strictly analogous to Maximum Likelihood).
So I have an now and I draw a sample from the Dirichlet. What I actually get is a vector that sums up to 1. I call this .
Remember that I’m trying to generate a random document and I haven’t generated a single word yet! The I have is my guess on the topic vector! I have obtained this by sampling from a k-dimensional vector (here k=3 in the above example.)
Now that represents a topic vector which can also be re-imagined as a probability distribution and because any draw is guaranteed to be from the simplex, I can use this drawn vector () as the weights of a loaded ‘k’ faced die. And I throw this die! Lets say it shows up 5 (a number between 1 and k). I will now say that the word I’m going to generate belongs to Topic-5.
I have not yet generated a word! To generate a word that belongs to a topic, I need a |V| faced die. |V| is the size of the vocabulary of the corpus. How do I get such a huge die?!
I will get that in a similar way as for the topic vector. I will sample again from a Dirichlet – but a different Dirichlet – one that is over a v-dimensional simplex. Any draw from this Dirichlet will give a v-faced die. Call this the Dirichlet . For each topic () you need a different v-faced die (). Thus I end up drawing ‘k’ such ‘v’-faced dice.
So for topic-5, I throw the 5th v-faced die. Let us say it shows 42; I then go to the vocabulary and pick the 42nd word! I will do this whole process ‘N’ times (N was the number of words to be generated for the random document.)
The crux of this discussion is this: for every document , the dice (i.e samples from the dirichlet() and dirichelt() ) are generated only once. It is just that to generate each word, the dice are rolled multiple times. Once to get a topic and once to get a word given this topic!
I converted the imsdb data to a format accepted by Prof. Blei’s code. For more information you should look at his README. I initially ran the code with k = 10 (guessing that the movie scripts could represent a mixture of 10 genres). I will jump from here to give a sneak peek at the end results. The end results have been very encouraging. A sample of words clustered under different topics is here:
At this point eyeballing the clusters it is apparent that the starting from the red cloud, clockwise, the clusters represent romance, sci-fi, war, horror, comedy and crime respectively.
Just running LDA on the data with k=10 did not turn out such relevant results. So what was that extra mile which had to be crossed?
LDA does not always guarantee relevant features. Nor is every topic discovered meaningful. After the first run, by eyeballing through the 10 topics, I was able to distinguish 3 topics very easily – war, romance and sci-fi. Playing around with different ‘k’ values did not yield more discernible topics. There are reasons to this:
I removed the scripts corresponding to war and sci-fi from the dataset. This was achieved by comparing each script in the dataset against the top features from these topics. Each script was scored based on the occurance of features. I removed the scripts that scored greater than a threshold. The new dataset D* contained scripts other than war and science fiction movies. Then I ran LDA on this new dataset D*. Now I found that the topics pertaining to crime and comedy were getting discovered but their features were mangled with romance. This feature set was improved by removing words which were common to most categories and were not exclusive to one particular category.
This paper deals with using human judgments to examine topics. This is one way of picking the odd feature out of a topic vector. The final feature set looked something like this.
I wrote a simple python script that does soft-clustering based on occurance and count of features. A sample of the results is here:
The results were quite convincing to look at. I wanted to obtain some hard metrics. IMSDB already categorises movies based on its genre. So I consider the category as the actual label. A movie is considered to be rightly classified if the label matches one of the top two genres detected by my system. The following are some encouraging results!
These proportions cannot be considered accurate, but they do give an idea of what to expect from a movie! The complete list of results is here .
All new development will take place in this extension . I am still thinking about how to create a migration plan for Chrome Nanny users. I will post as I have more information . I will also update older blogs to point to the new url. Please spread the word !
I learned lot of very fascinating this in data mining, machine learning etc Expect few good blog posts. I will try to add a page discussion the next few topics to be blogged. If you want me to discuss about any topic feel free to post it in the comments !
During the past 3 months, my blog readership has almost doubled .. Hi to all my new readers ! Hope you enjoy my new posts too !
This blog post is based on the lecture notes I prepared for one of the courses in which I am the teaching assistant. Hopefully, this notes will be useful for anyone interested in either Approximate Query Processing (AQP) or basic Sampling theory as applied in Databases.
Consider a database D which contains billions of tuples. Such a large database is not uncommon in data warehousing. When we want to get some answers from the database, we construct a SQL query to get the results. Due to the large size of the database, any query should take quite a bit of time. This is regardless of the use of techniques like indexing which can speed up the processing time but does not really reduce the time asymptotically.
One observation is that the queries that are posed to the database returns very precise and accurate answers – probably after looking at each and every tuple. For a lot of use cases of OLAP we may not need such a precise answer. Consider some sample queries like – what is the ratio of male to female in UTA ? What percentage of US people live in Texas? What is the average salary of all employees in the company? and so on.
Notice that for such queries we really do not need answers that are correct to the last decimal point. Also notice that each of those query is an aggregate over some column. Approximate query processing is a viable technique to use in these cases. A slightly less accurate result but which is computed instantly is desirable in these cases. This is because most analysts are performing exploratory operation on the database and do not need precise answers. An approximate answer along with a confidence interval would suit most of the use cases.
There are multiple techniques to perform approximate query processing. The two most popular involve histograms and sampling. Histograms store some statistic about the database and then use it to answer queries approximately. Sampling creates a very small subset of the database and then uses this smaller database to answer the query. In this course we will focus on AQP techniques that use sampling. Sampling is a very well studied problem backed up by a rich theory that can guide us in selecting the subset so that we can provide reasonably accurate results and also provide statistical error bounds.
The idea behind sampling is very simple. We want to estimate some specific characteristic of a population. For eg this might be the fraction of people who support some presidential candidate or the fraction of people who work in a specific field or fraction of people infected with a disease and so on. The naive strategy is to evaluate the entire population. Most of the time , this is infeasible due to constraints on time, cost or other factors.
An alternate approach that is usually used is to pick a subset of people . This subset is usually called a sample. The size of the sample is usually an order of magnitude smaller than the population. We then use the sample to perform our desired evaluation. Once we get some result, we can use this to estimate the characteristic for the entire population. Sampling theory helps, among other things, on how to select the subset , what is the size of population, how to extrapolate the result from sample to original population , how to estimate the confidence interval of our prediction etc.
The process of randomly picking a subset of the population and using it to perform some estimation should appear very strange. On a first glance it might look that we might get wildly inaccurate results. We will later see how to give statistical guarantees over our prediction. Sampling is a very powerful and popular technique. More and more problems in the real world are being solved using sampling. Lots of recent problems in data mining and machine learning essentially use sampling and randomization to approximately solve very complex problems which are not at all solvable otherwise. This is for this reason that most DBMS provide sampling as a fundamental operator. (Eg Oracle provides a SAMPLE operator in select and also dbms_stats package. SQL Server provides TABLESAMPLE operator and so on).
We represent the population with P and the sample with S. N represents the size of population and n represents the size of the sample. We will use these letters to denote statistics on the population and sample. For eg, represents the mean of the population and represents the mean of the sample. Similarly, and represent the standard deviation of the population and sample respectively.
There are different ways to perform sampling. The ones that we will use most in AQP are :
The aim of sampling is to get a subset (sample) from a larger population. There are two ways to go about it. In the first approach, we randomly pick some entity, perform measurements if any and add it to the sample. We then replace the entity back to the original population and repeat the experiment. This means that same entity can come in the sample multiple times. This approach is called Sampling with replacement. This is the simplest approach to sampling. There is no additional overhead to check if an entity is already in sample or not. Typically, sampling with replacement is modeled with binomial distribution.
In the second approach, we explicitly make sure that an entity does not appear in the sample more than once. So we randomly pick an entity from the population, verify it is not already in the sample, perform measurement and so on. Alternatively, we can remove entities that were added to sample from the population. This approach is called Sampling without replacement and is usually modeled with an hypergeometric distribution.
Bernoulli Trial : A Bernoulli Trial is an experiment which has exactly two outcomes – Success or failure. Each of these outcomes has an associated probability that completely determines the trial. For eg consider a coin which produces head with probability 0.6 and tail with probability 0.4 . This constitutes a bernoulli trial as there are exactly two outcomes.
In Bernoulli Sampling, we perform a Bernoulli trial on each tuple in the database. Each tuple can be selected into a sample with uniform probability. If the trial is a success, the tuple is added to the sample. Else it is not. The most important thing to notice is that all the tuples have exactly the sample probability of getting into the sample. Alternatively, the success probability for Bernoulli trial remains the same for each tuple. Pseudo code for Bernoulli Sampling is given below :
success probability SP =
for i = 1 to N
Perform a Bernoulli trial with success probability SP
If outcome is success add i-th tuple to sample
It is important to note that Bernoulli sampling falls under Sampling without replacement. Size of the sample follows a binomial distribution with parameters . ie it can vary between 0 and N-1 with the expected size of sample as n.
In Uniform Random Sampling, we pick each tuple in the database with a constant probability. This means that the probability of any tuple entering the sample is constant. Typically, this is implemented as sampling with replacement. Pseudo code for Uniform Random Sampling is given below :
1. Generate a set of n random numbers S between 1 and N.
2. Select tuples with index in S and add it to sample.
Note that in this case we have exactly n tuples in the sample. We can also notice that sample tuple might appear multiple times. The number of times a tuple appears in the sample forms a binomial distribution with parameters .
In Weighted Random Sampling, we perform a Bernoulli trial on each tuple in the database. The difference with Uniform random sampling that the success probability for each Bernoulli trial varies. In other words, each tuple has a different probability of getting into the sample.
If the population can be subdivided into sub population that are distinct and non overlapping , stratified sampling can be used. In Stratified sampling, we split the population into a bunch of strata and then form sampling over each strata independently. For eg a political poll can be stratified on gender, race , state etc.
There are multiple advantages in using stratified sampling. For one, this allows the convenience to use different sampling techniques over each strata. If there is some specific strata that might be under represented in a general random sampling, we can easily provide additional weights for the samples taken from that. It is also possible to vary the number of samples from a strata to minimize the error. For eg, we can take less number of samples from a strata with low variance while preserving them for strata with high variance.
Stratified sampling is not a panacea because getting a good stratification strategy may not be obvious. In lot of population, the best feature to stratify is not obvious. Even worse, the population may not contain subgroups that are homogeneous and non overlapping.
Given a population , sample size and strata there are many ways to allocate the sample across different strata. The two commonly used strategies are :
1. Proportional Allocation : In this approach the contribution of each strata to the sample is proportional to the size of the strata. So a strata that accounts to 10\% of the population will use 10\% of the sample. Note that this approach does not use any information about a sub population other than its size.
2. Variance based Allocation : This strategy allocates samples in proportion to their variance. So a strata with high variance will have higher representation than one with smaller variance. This is logical as we need few samples to accurately estimate the parameters of a sub population when its variance is low. Any additional samples do not add much additional information or reduce the final estimation dramatically.
Reservoir Sampling is an algorithm that is widely used to take n random samples from a large database of size N. The real utility of reservoir sampling is realized when N is a large number or is not really known at sample time. This scenario is quite common when the input is a streaming data or when the database is frequently updated . Running the simple uniform random sampling algorithm (say Bernoulli sampling) is inefficient as N is large or the old tuples may be purged (or goes out of Sliding Window). Reservoir sampling allows you to get the random sample in a linear pass such that you only inspect any tuple at most once.
Consider the following contrived example. We have a database which is constantly updated and we want to have a single random tuple from it.
The base case occurs when there is only one tuple in the database. In this case our random sample is the first tuple. Hence the sample S = t1.
Lets say the database is updated and a new tuple t2 is added. The naive approach is to restart the entire sampling process. Instead, in reservoir sampling, we accept the new tuple as the random sample with probability . ie toss a coin which returns head with probability 0.5 and if it returns head , then replace t1 with t2.
We can see why S is a uniform sample. The probability that S contains t1 or t2 remains the same.
1. . The random sample is t1 when it was selected first into S (with probability 1) and then not rejected by t2 with probability .
2. . The random sample is t2 when it replaces t1 in the second step. This occurs with probability
The database is updated and lets assume a new tuple t3 is added. We accept the new tuple as the random sample with probability . ie toss a coin which returns head with probability 0.33 and if it returns head , then replace the previous value of S (t1 or t2) with t3. More generally when inspecting the i-th tuple, accept it with probability .
It might look as if we are treating t3 unfairly because we only accept it with probability 0.33. But we can show probabilistically that the sample is still uniform. The probability that S is t1 or t2 or t3 remains the same.
1. . The only scenario when random sample is still t1 occurs when it was selected first into S (with probability 1) ,not rejected by t2 with probability and not rejected by t3 with probability .
2. . The random sample is 2 when it replaces t1 in the second step. This occurs with probability . Then in the next step is not replaced by t3. This occurs with probability .
3. . The random sample is 3 when S contains either t1 or t2 and it is replaced by t3. This occurs with probability .
The pseudo code looks like :
S = t1
for i = 2 to N
Replace S with tuple t_i with probability
A very similar approach works when the sample size is more than 1. Lets say that we need a sample of size n. Then we initially set the first n tuples of the database as the sample. The next steps is a bit different. In the previous case, there was only one sample so we replaced the sample with the selected tuple. When sample size is more than 1, then this steps splits to two parts :
1. Acceptance : For any new tuple t_i, we need to decide if this tuple enters the sample. This occurs with probability .
2. Replacement : Once we decided to accept the new tuple into the sample, some tuple already in the sample needs to make way. This is usually done randomly. We randomly pick a tuple in the sample and replace it with tuple t_i.
The pseudo code looks like :
Store first n elements into S
for i = n+1 to N
Accept tuple t_i with probability
If accepted, replace a random tuple in S with tuple t_i
A coding trick that avoids the "coin tossing" by generating a random index and then accepts it if it is less than our sample size. The pseudo code looks like :
Store first n elements into S
for i = n+1 to N
randIndex = random number between 1 and i
if randIndex <= n
replace tuple at index "randIndex" in the sample with tuple t_i
We can similarly analyze that the classical reservoir sampling does provide a uniform random sample. Please refer to the paper Random Sampling with a Reservoir by Jeffrey Vitter for additional details.
As discussed above, our main aim is to discuss how sampling techniques is used in AQP. Let us assume that we have a sample S of size n. The usual strategy that we will follow is to apply any aggregate query on the sample S instead of database D. We then use the result of the query from S to estimate the result for D.
One thing to note is that we will only use aggregate queries for approximate processing. Specifically we will look at COUNT, SUM and AVERAGE. The formulas for estimating the values of the aggregate query for the entire database from the sample for these 3 operators is well studied. For additional details refer to the paper “Random Sampling from Databases" by Frank Olken.
Uniform Random Sample
1. Count : where is an indicator random variable that is 1 when tuple t_i satisfies our clause. is the probability that tuple will be selected into the sample. Intuitively, the formula finds the fraction of tuples in Sample which satisfied the query and applies the same fraction to the entire database.
2. Sum :
3. Average :
\end{itemize}
Weighted Random Sample
1. Count : where is an indicator random variable that is 1 when tuple t_i satisfies our clause. is the probability that tuple will be selected into the sample. Intuitively, the formula reweighs each tuple according to the selection probability of the tuple.
2. Sum :
3. Average :
Here are few commonly used equations related to Expectation and variance.
1. E[X] =
2. E[aX+b] = aE[X] + b
3. E[X+Y] = E[X] + E[Y] (also called Linearity of Expectations)
4. Var[X] =
5. Var[X+a] = Var[X]
6. Var[aX] = Var[X]
7. Var[X+Y] = Var[X] + Var[Y] if X and Y are independent.
Law of Large Numbers : This law is one of the fundamental laws in probability. Let be random variables drawn iid. Very informally, as n increases, the average of the variables approaches the expected value of the distribution from which the variables are drawn. For eg, if we have a coin which provides head with probability 0.5 and toss it 10000 times, the number of heads will be very close to 5000.
Binomial Distribution: Suppose you repeat a Bernoulli trial with success probability of p , n times. The distribution of the number of successes in the n trials is provided by binomial distribution B(n,p). This is a very important distribution for modeling sampling with replacement. For eg if we perform Bernoulli sampling, the final size of the sample is a Binomial distribution. Also if we randomly pick n tuples from database of size N , the number of times any tuple is repeated in the sample can also modeled by Binomial distribution. The expected value is given by and variance is given by .
Normal Distribution : Normal distribution , aka Gaussian distribution, is one of the most important probability distributions. It is usually represented with parameters . It has the familiar bell curve shape. determines the center of the normal curve and determines the width of it. A smaller value results in a tighter curve while a larger value results in a more flat/wide curve.
Equations (2) and (7) above gives us some detail about how expected value and variance of sum of two random variables can be computed from the expected value/variance of the constituent random variables . We can extend by induction for the rules to hold for the sum of arbitrarily large number of random variables. This introduces us to the concept of Central Limit Theorem.
Central Limit Theorem : This is one of the most fundamental rules in probability. Assume that we have n different random variables . Each of these random variables have mean and variance . For a large n, the distribution of sums is normal with parameters . Similarly, the distribution of the averages , is normal with parameters .
Law of large numbers and Central limit theorem are two important results that allow analysis of the results. law of large number says that if we pick a large enough sample then the average of the sample will be closer to the true average of population. Central limit theorem states that if you repeat the sampling experiment multiple times and plot the distribution of the average value of the samples, they follow a normal distribution. Jointly, they allow you to derive the expression for Standard error.
The essential idea behind sampling is to perform the experiment on a randomly chosen subset of the population instead of the original population. So far we discussed how to perform the sampling and how to estimate the value from sample to the larger population. In this section let us discuss about the error in our prediction. The concept of standard error is usually used for the same.
Lets say we want to estimate the mean of the population . We performed the sampling and found the sample mean as . Since the sample mean is an unbiased estimator of the population mean we can announce that they are the same. But it need not always be the case that these two values are same. There are two common ways to analyze the error.
1. Absolute Error :
2. Relative Error :
The primary problem we have in estimating these error metrics is that we do not know the value of . We will use an indirect way to estimate the error.
Consider the sampling process again. We picked a sample of random tuples from the database and computed the sample mean as . But since the tuples in S are picked randomly, the sample mean changes based on the sample. This means that the sample mean is itself a random variable.
Let the population . Then the population mean . Let the sample be and sample mean is .
Consider the case where the sample S consists of only one element. Our aim is to find population mean. As per our approach, we pick a random element and then announce it as the sample mean ie . The error in this case is the difference between and . Since the sample is randomly picked , the sample mean is a random variable. Then it also implies that the error is also a random variable.
We can derive the expected value for the error as follows :
.
We can see that the expected value of the error is nothing but the standard deviation of the population !
Consider the case where the sample S consists of exactly two elements. Our aim is to find population mean. As per our approach, we pick two random elements and then announce the sample mean as . The error in this case is the difference between and . Since the samples are randomly picked , the sample mean is again a random variable. Then it also implies that the error is also a random variable.
We can derive the expected value for the error as . We can see that the sample can be any one of different combination of elements . The calculation for variance might be tricky if not for the independence of samples. Since the two elements were picked randomly , they two are independent and we can use that to estimate the variance easily.
Note that we used rules 3, 6 and 7 from above.
We will not derive the formula here but we can easily show that when the sample contains n different elements the standard error is given by .
There are some interesting things to note here :
1. The expected error when using 2 samples is less than that of the expected error when we used only one sample.
2. As the size of sample increases the error decreases even faster. The rate of decrease is inversely proportional to square of size of the sample size.
3. This also formalizes the intuition that if the variance of the population is less, we need less number of samples to provide estimates with small error.
4. Usually our work will dictate the tolerable error and we can use the formula to find the appropriate n that will make standard error less than our tolerance factor.
Hope this post was useful
In my tutorial on Unity , I mentioned that if no one else writes a Lens tutorial, I will write one and here it is Lenses in Ubuntu are a really neat idea – One simplistic way is to view them as specialized search tools. You open the appropriate lens, type something and you get a list of results. But this explanation does not really do justice to the incredible amount of flexibility that they give. Lenses can group their results, have Unity render the results in different styles, have multiple sections that influence they way search is done and so on. Unity also provides a very powerful API that allows you to write literally anything that you can think of. This post will discuss the architecture, components and their potential capabilities in detail.
If you are using Unity, then you must be familiar with two default lens that come with it : Application lens and Files & Folders lens. Let me use that to explain the various parts of a Lens.
Places/Lenses : Both these words represent the same idea. Places is a terminology that was primarily used when Unity was introduced. Its usage is now slowly deprecated and the word Lenses is preferred. In the rest of this blog post, I will refer it as Lenses for uniformity. Let us take a look at the various parts of lens by taking the application lens as an example.
If you click on the Application lens, you can see that it is split into two important pieces. At the top, is a search textbox like entity that allows you to enter the query. At the bottom is the results pane that shows results of your query. At the left end of the search textbox is a search result indicator. When the search is still going on, the indicator will show a spinning circle. Once the search is done, it reverts to its default icon of a cross.
Sections : The simplest way to describe sections is that they behave like categories that are non overlapping. So if a given search query might result in different results when searched under different categories, each of the qualify as a section.
Consider a simple example – You are writing a lens for StackOverflow. You may want to search questions, tags, users and so on. The results of search query q as a question is not the same as when searching q as a tag. Hence questions, tags, users become sections. Unity allows you to easily have different sections and switch between them.
Sections are the entries in the dropdown list at the right end of the search textbox. They are optional and many places need not have it. You can see all sections of a lens by clicking on the dropdown. Taking the example of Application lens, some of the sections are “All Applications”, “Accessories”, “Games”, “Graphics” etc.
Groups : Groups is a mechanism by which you can club certain search results under a meaningful entity. Each search result, might have one or more groups that allow you to organize the results. For eg, when you search in Application lens, the groups are “Most frequently used”, “Installed”, “Available for download” etc. Groups need not be non overlapping like sections. Given a search query, same result can occur in different groups. In addition, Unity allows you to have different icons for different groups. Also, each group can be rendered in a different style. I will discuss further details later in the blog post.
Difference between Sections and Groups : There are lot of differences between groups and sections. Sections are mostly non overlapping set of entries. Each section provides a different perspective of the same search query. In contrast, groups allow you to classify the search results from a query (and a section). Groups can be considered as a way to organize the results rather than influencing how the results are obtained (which the sections do).
Results : These are the output of a search query in a lens. Given a query and a section, you get a list of entries for results. Each of them can optionally be assigned to a group. Each result entry can display information about themselves in a concise way. In addition, Unity allows them to have any URI and a mime type. So the results of one lens might be files clicking which executes them. Or it can be results from web and clicking on them opens the link in a browser and so on. Unity also allows them to have custom icons.
In the following section, we will discuss some terms used in Lens development.
DBUS : DBUS is one of the most important component in the development of lens. One way to view a lens is as a daemon that accepts a search query from Unity, does some processing and returns a result along with information on how to display them. DBUS is the conduit through which the entire communication between the daemon and Unity happens. For the purpose of the lens, you need to own a bus which will be used by Unity for communication.
Lens/Place Daemon : Lens/Place daemon is a program that exposes lens functionality. A single daemon can expose multiple lens functionalities. Each lens is identified by an entry. It accepts search string from Unity and sends out search results that annotated with group and other additional information.
Place File : A place file is how you inform Unity about the various lenses that your daemon is going to expose. It is a text file that ends with .place extension and usually placed in /usr/share/unity/places . It contains different groups of entries that describe the different lenses that are exposed.
Lens Entry : A lens or an entry in the place file is distinct entity that can accept a search query and provide appropriate results. Every place file contains atleast one lens entry. Every entry has a distinct location for itself in the Unity places bar (the vertical strip) on the left. The lens entry has 3 broad components : A model each for sections, groups and results. Results contain information about the groups they fall in and how to display themselves.
Results : As discussed above, a result is an entry that is the response to a query. A result can be assigned to one or more groups. In the most general sense, a result consists of 5 subparts :
(a) URI : This is a resource identifier to uniquely denote the location of search result. It can be a local item (file://) , web item (http://) and so on.
(b) Display Name : This is a name for the result that the user will see in the results pane.
(c) Comment : This is additional information about the result entry.
(d) MimeType : Since the URI can point to anything, it is important to inform Unity what action to perform when the user clicks on the result. The mimetype allows Unity to invoke the appropriate launcher.
(e) Icon : Each result can have its own icon to represent it visually. It can be an icon file or the generic name of the icon (eg video).
Renderer : Unity provides some flexibility in how the results are displayed. Each result falls into some group. We can request Unity to display each group in a specific way. This is accomplished by the use of renderers. When you setup the model for groups for the lens, we also specify how each group will be rendered. Some common renderers are :
(a) UnityDefaultRenderer : This is the default way to display the results. In this renderer, each result shows the icon and the display name beneath it.
(b) UnityHorizontalTileRenderer : In this renderer, the icon and display name for each result is shown. In addition, it also displays the comment associated with the result.
(c) UnityEmptySearchRenderer : This renderer is used when the current search query returned no results. The result item construction is very similar to the results above. But the display name is displayed across the results pane.
(d) UnityShowcaseRenderer : I tried using it and the result was not much different from UnityDefaultRenderer. But according to the documentation, it should have a bigger icon.
(e) UnityEmptySectionRenderer : I did not try this as it not make sense in the application I developed. Supposedly, it is used to imply that the current section is empty. If my interpretation is correct, then this is used when the current section returned no results for query but a different section might produce results. UnityEmptySearchRenderer on the other hand is used to imply that the current search query itself (regardless of the section) provides zero results.
(f) UnityFileInfoRenderer : Copying from the reference, this renderer is used in files place to show search results. Expects a valid URI, and will show extra information such as the parent folder and whatever data is in the comment column of the results.
There are three major steps in writing a Lens.
(1) Writing a .place file
(2) Writing a Lens daemon
(3) If you want to distribute , Writing a service file
As an example, let us develop a Lens that does the following :
(1) It has two sections – In “Movie Names” section, only the names of the movie are shown. In “Genre” section, all movies are clubbed according to their genre which act as a groups. So if movie X is in genres Y and Z, X will be displayed in both groups Y and Z.
(2) Clicking on the result will open a browser and redirect to the IMDB page for the movie.
(3) The lens accepts queries from two places – When entered directly in the IMDB Lens and also from the dash. When searched from Dash , it will display movie names as a separate group.
At the initial state, the lens will look like this :
For the sake of completeness, here is now it looks when you just search for Movie Names.
For the sake of completeness, here is now it looks when you just search for Genres. Basically, the results are grouped using the Genre. If a movie falls under multiple genres, it also falls into different groups.
Let us take a look at how to write the IMDB lens as per the major steps above.
As discussed above a place file contains the information that allows Unity to display the entries in the places bar. Information like which bus to use for communication, what shortcut to use etc are also involved. For the purpose of this tutorial, I will put up the place file I used for the IMDB lens. I will then explain the significance of each line. The file is named as imdbSearch.place and is placed in /usr/share/unity/places/ .
[Place] DBusName=net.launchpad.IMDBSearchLens DBusObjectPath=/net/launchpad/imdbsearchlens [Entry:IMDBSearch] DBusObjectPath=/net/launchpad/imdbsearchlens/mainentry Icon=/usr/share/app-install/icons/totem.png Name=IMDB Search Description=Search Movies using IMDB SearchHint=Seach Movies using IMDB Shortcut=i ShowGlobal=true ShowEntry=true [Desktop Entry] X-Ubuntu-Gettext-Domain=imdbSearch-place
As you can see, the file has three sections. Each section contains a set of key-value pairs.
Place Section :
This is the simplest section and contains two key values – DBusName and DBusObjectPath. DBusName gives the name of the bus under which this entry can be found while DBusObjectPath provides the actual dbus object path. The names can be any valid DBus names.
Entry Section :
A place file can contain one or more entries. Each entry must contain a section that corresponds to it. Each entry will occupy a distinct location in the places bar. If there are multiple entries, each of them must have a unique name and object path. In the daemon code, all those entries must be added individually.
The various keys are :
DBusObjectPath : This is the DBus path used to communicate with this specific entry. Note that this path must be a child of mail DBusObjectPath. In our case, there is only one entry and hence we put it under the path “mainentry” of the original path.
Icon : Link to the icon file
Name : Name of the entry. When you hover over the entry in the places bar, this will be displayed.
Description : A long description of the lens. I am not sure where this is used.
SearchHint : This is the text that is displayed by default in the lens when the user selects it. In the image above, the search hint “Search Movies using IMDB”is displayed when the user selects the IMDB lens.
Shortcut : This gives the key which is used to trigger the lens. Of course, you can always use the mouse to select it. If this key is specified,then pressing Super+shortcut operns the lens. For the IMDB lens, pressing Super+i opens it.
ShowGlobal : This is an optional key and defaults to true. If it is set to false, then searches from the main dash are not sent to the lens. This seems to override the specification inside the lens daemon. ie If the place file specifies ShowGlobal as false and the daemon adds a listener to the event where user enters a query in the main dash, it is not invoked. Most of the time, I think it makes more sense for the lens to set this as false. For eg, it certainly does not make sense for IMDB lens to offer its service in the dash. Most of the time, when the user is searching in the dash, he is looking for some file or application. Polluting those results with IMDB results may not be wise. This is even more true if your lens takes a while to provide all the results.
ShowEntry : This is an optional key and defaults to true. If it is set to false, then the entry will not be displayed in the places bar. If this is set to false, then the Shortcut key becomes useless as well. Pressing the shortcut key does not invoke the lens. But if the ShowGlobal is set to true, then the lens will still be searchable via the dash. For eg, if for some reason, I decide that IMDB lens must only provide results to queries from dash I will set ShowGlobal to true and ShowEntry to false.
This section is mostly optional. The most used key is that X-Ubuntu-Gettext-Domain. This key is used for internationalization purposes. If you want the lens should support internationalization, provide the appropriate entry name. If you are not familiar with internationalization in Ubuntu, there are two broad ways of achieving it : Either put all the translations statically in the file. Or put it in the appropriate .mo file which will then be put in some language-pack file. I included this line because Unity whines as “** Message: PlaceEntry Entry:IMDBSearch does not contain a translation gettext name: Key file does not have group ‘Desktop Entry'”.
The next step is to write a daemon that communicates with Unity over dbus path, does the search and returns annotated search results so that Unity can render them meaningfully. The daemon can be written in any language that support GObject introspection. Most common languages include C, Vala and Python. Vey informally, GOject is a mechanism by which bindings for other languages can be done relatively easily. The actual process needs multiple components. Based on the reference [2], this means that you must verify if the following packages are installed in the system – gir1.2-unity-3.0 , gir1.2-dee-0.5 and gir1.2-dbusmenu-glib-0.4 . Of course, the actual version numbers may change in the future. If installing does not make it work, look at the reference for additional instructions [2].
I will use the IMDB Search lens as an example to explain writing a lens daemon in Python. The full source code is in [1]. I will use snippets of the code to make the appropriate points. The first step is of course importing the appropriate libraries. If you see other Python lens file, they used to have a hack that probes Dee before importing Unity. In my tests I found it unnecessary. If you get some error in importing Unity, then take a look at other sample files [3] and do accordingly.
from gi.repository import GLib, GObject, Gio, Dee, Unity
The next step is to define the name of the bus that we will use to communicate with Unity. Note that this must exactly match the “DBusName” key in the place file.
BUS_NAME = "net.launchpad.IMDBSearchLens"
The next step is to define some constants for the sections in your lens. If your lens contain only one section, feel free to ignore the initial lines. Else define section constants appropriately. The only place where you will use these constants is to figure out which section the lens currently is in. The section ids are integers and are ordered from 0. Note that the order given here is the *exact* order in which the sections must be added to our lens in Unity.
Similarly, we can have constants for groups. Instead of creating lot of constants, I created the group ids dynamically. First, I create a list of all IMDB genres and use a hash to map the name to an integer (group id). Also note that I also have constants for groups – GROUP_EMPTY, GROUP_MOVIE_NAMES_ONLY and GROUP_OTHER.
One thing to notice is that if different sections have non overlapping groups, all of them must be defined here. Then based on current section, use the appropriate group. As an example, GROUP_MOVIE_NAMES_ONLY is used only with SECTION_NAME_ONLY. But we define it alongside other groups. Also note the group for empty, GROUP_EMPTY. This will be used if the search query returns no results. As with sections, the order of groups defined here must exactly match the order in which they are added to group model. If you have reasonably small number of sections and groups, you can avoid the elaborate setup of writing constants.
SECTION_NAME_ONLY = 0 SECTION_GENRE_INFO = 1 allGenreGroupNames = [ "Action", "Adventure", "Animation", "Biography", "Comedy", "Crime", "Documentary", "Drama", "Family", "Fantasy", "Film-Noir", "Game-Show", "History", "Horror", "Music", "Musical", "Mystery", "News", "Reality-TV", "Romance", "Sci-Fi", "Sport", "Talk-Show", "Thriller", "War", "Western"] numGenreGroups = len(allGenreGroupNames) GROUP_EMPTY = numGenreGroups GROUP_MOVIE_NAMES_ONLY = numGenreGroups + 1 GROUP_OTHER = numGenreGroups + 2 groupNameTogroupId = {} #We create a hash which allows to find the group name from genre. for i in range(len(allGenreGroupNames)): groupName = allGenreGroupNames[i] groupID = i groupNameTogroupId[groupName] = groupID
The next step is to define the skeleton of the lens daemon. It is a simple Python class. In the constructor, you define the entire model corresponding to each entry in the Lens. If you place file contains n entries, you will be defining n different PlaceEntryInfo. In our case we have only one. Hence we create a variable that points to the corresponding DBusObjectPath. It is impertative that the path exactly match DBusObjectPath of the entry.
Each PlaceEntryInfo consists of different models : sections_model, groups_model and results_model. If you want the lens to respond to searches in dash, you will also need to setup global_groups_model and global_results_model . The code below contains information about the schema of the different models. You can consider them as more or less a code that can be blindly copied. If you want additional information about things that you can tweak in PlaceEntryInfo, take a look this url .
Note on Property based access :
One important thing to notice is that all the access is done using properties. There are two methods to do that :
(1) x.props.y
(2) x.get_property(“y”)
Choosing one of them is mostly based on your coding style. Choose one use it consistently.
class Daemon: def __init__ (self): self._entry = Unity.PlaceEntryInfo.new ("/net/launchpad/imdbsearchlens/mainentry") #set_schema("s","s") corresponds to display name for section , the icon used to display sections_model = Dee.SharedModel.new (BUS_NAME + ".SectionsModel"); sections_model.set_schema ("s", "s"); self._entry.props.sections_model = sections_model #set_schema("s","s") corresponds to renderer used to display group, display name for group , the icon used to display groups_model = Dee.SharedModel.new (BUS_NAME + ".GroupsModel"); groups_model.set_schema ("s", "s", "s"); self._entry.props.entry_renderer_info.props.groups_model = groups_model #Same as above global_groups_model = Dee.SharedModel.new (BUS_NAME + ".GlobalGroupsModel"); global_groups_model.set_schema ("s", "s", "s"); self._entry.props.global_renderer_info.props.groups_model = global_groups_model #set_schema(s,s,u,s,s,s) corresponds to URI, Icon name, Group id, MIME type, display name for entry, comment results_model = Dee.SharedModel.new (BUS_NAME + ".ResultsModel"); results_model.set_schema ("s", "s", "u", "s", "s", "s"); self._entry.props.entry_renderer_info.props.results_model = results_model #Same as above global_results_model = Dee.SharedModel.new (BUS_NAME + ".GlobalResultsModel"); global_results_model.set_schema ("s", "s", "u", "s", "s", "s"); self._entry.props.global_renderer_info.props.results_model = global_results_model
Once you have setup the models, you need to add listeners to events that you want to catch. The most important one is “notify::synchronized”. This called when Unity sets up your lens and wants to know the various groups and sections in your entry. In the code below, we add three different functions to that event. One gives the sections in the lens, other gives the groups and last gives the groups in dash.
Next we catch two events that are core to Lens – notify::active-search and notify::active-global-search . The first is triggered when the user searches something in the search textbox of the place while the second is triggered when user searches something in the dash. It is important to notice that the same search can trigger the events multiple times. By default, Unity does a decent job of batching calls providing search strings, but it is an important thing to consider if your lens does expensive operations. Take a look at the section of caching for additional details.
The event notify::active-section is triggered when the user changes the section of the lens by using the dropdown. You can then use your section constants to decide which section is currently selected.
notify::active is triggered when the user selects and leaves your place. Hence its an indirect way for your daemon to know if the lens is being visible to the user.
# Populate the sections and groups once we are in sync with Unity sections_model.connect ("notify::synchronized", self._on_sections_synchronized) groups_model.connect ("notify::synchronized", self._on_groups_synchronized) #Comment the next line if you do not want your lens to be searched in dash global_groups_model.connect ("notify::synchronized", self._on_global_groups_synchronized) # Set up the signals we'll receive when Unity starts to talk to us # The 'active-search' property is changed when the users searches within this particular place self._entry.connect ("notify::active-search", self._on_search_changed) # The 'active-global-search' property is changed when the users searches from the Dash aka Home Screen # Every place can provide results for the search query. #Comment the next line if you do not want your lens to be searched in dash self._entry.connect ("notify::active-global-search", self._on_global_search_changed) # Listen for changes to the section. self._entry.connect("notify::active-section", self._on_section_change) # Listen for changes to the status - Is our place active or hidden? self._entry.connect("notify::active", self._on_active_change)
Once you have set up the different models and functions to populate them, you need to add the entry to a PlaceController. If your lens has multiple entries, repeat the above process for each entry and once constructed call the add_entry of PlaceController to set them up. Note that the argument to PlaceController’s constructor is same as the DBusObjectPath for the Place section in your .place file.
self._ctrl = Unity.PlaceController.new ("/net/launchpad/imdbsearchlens") self._ctrl.add_entry (self._entry) self._ctrl.export ()
Once we have setup all the models, the next step is to define the functions that are used.The first is to define the function that populates the entry’s section model. Note that we call append on the sections_model with two parameters : the name of the genre and an icon for it. You can also observe that when defined the “schema” of the sections_model, it accepted two strings. One other thing to note is the lack of use of any section constants – You must be careful to define the sections in the exact same order as the constants. In our case, it means that SECTION_NAME_ONLY followed by SECTION_GENRE_INFO.
def _on_sections_synchronized (self, sections_model, *args): # Column0: display name # Column1: GIcon in string format. Or you can pass entire path (or use GIcon). sections_model.clear () sections_model.append ("Movie Names", Gio.ThemedIcon.new ("video").to_string()) sections_model.append ("Movie Genre", Gio.ThemedIcon.new ("video").to_string())
The next step is to define a set of groups. Due to the nature of this lens, we reuse the list to form it. Notice the use of UnityEmptySearchRenderer for the GROUP_EMPTY. For this example, I have used UnityHorizontalTileRenderer. Your lens might need some other renderer – Take a look at the renderer section above for a discussion of the different renderers.
# Column0: group renderer, # Column1: display name, # Column2: GIcon in string format Or you can pass entire path (or use GIcon). groups_model.clear () #Remember to add groups in the order you defined above (ie when defining constants) for groupName in allGenreGroupNames: groups_model.append ("UnityHorizontalTileRenderer", groupName, Gio.ThemedIcon.new ("sound").to_string()) #GROUP_EMPTY groups_model.append ("UnityEmptySearchRenderer", "No results found from IMDB", Gio.ThemedIcon.new ("sound").to_string()) #GROUP_MOVIE_NAMES_ONLY groups_model.append ("UnityHorizontalTileRenderer", "Movie Names", Gio.ThemedIcon.new ("sound").to_string()) #GROUP_OTHER groups_model.append ("UnityHorizontalTileRenderer", "Other", Gio.ThemedIcon.new ("sound").to_string())
The next step is to define the function that is called when the search changes in the place search box. Notice how we use the entry object to get the current section, the current results, current search query and so on. Finally, we invoke search_finished function which calls search.finished() which signals to Unity that we are done processing the query.
def _on_search_changed (self, *args): entry = self._entry self.active_section = entry.get_property("active-section") search = self.get_search_string() results = self._entry.props.entry_renderer_info.props.results_model print "Search changed to: '%s'" % search self._update_results_model (search, results) #Signal completion of search self.search_finished()
There are other functions which we can detail. But mostly, they are very similar to code shown here. You can take a look at the python file linked below [1] for full glory.
So far all we have done is using some boilerplate code. From now on though, the code starts to differ based on the application. So I have not included the code that actually produces the results here. Take a look at the python file for details.
Once you have written the python file , you want to test your baby. Here are the steps :
(1) Include the following code in your daemon. This code was literaly copied from Unity Sample Place [3] code.
if __name__ == "__main__": session_bus_connection = Gio.bus_get_sync (Gio.BusType.SESSION, None) session_bus = Gio.DBusProxy.new_sync (session_bus_connection, 0, None, 'org.freedesktop.DBus', '/org/freedesktop/DBus', 'org.freedesktop.DBus', None) result = session_bus.call_sync('RequestName', GLib.Variant ("(su)", (BUS_NAME, 0x4)), 0, -1, None) # Unpack variant response with signature "(u)". 1 means we got it. result = result.unpack()[0] if result != 1 : print >> sys.stderr, "Failed to own name %s. Bailing out." % BUS_NAME raise SystemExit (1) daemon = Daemon() GObject.MainLoop().run()
(2) Copy your place file to /usr/share/unity/places/ . If you have any custom icons, copy them to the location specified in your .place file.
(3) Open up a new terminal and invoke your daemon file. Also remember to set the executable bit on.
(4) Open up another terminal and type “setsid unity”. Using setsid is a very neat trick I learnt from the sample place file. Basically, it starts Unity in a new session. This forces it to read all the place files again. Also if you make any change to the daemon code, you can kill it and restart it. Unity will start communicating with your new daemon. Additionally, setsid will also ensure that Unity does not have controlling terminal but will still dump all debug information in the terminal it was opened.
(5) Open the lens and type your query and watch the results appear !
If your lens is not working as expected, here are somethings to verify :
(1) Make sure that your place file is copied to /usr/share/unity/places .
(2) Make sure that busname and object path match exactly in your place file and your daemon file.
(3) Make sure that Python file has its executable bit on. This is important if you want to run it a service/daemon that runs automatically. More on that below.
(4) If you are simply testing it, then is your Daemon file running ? If you are running it as a service, is your service file contain information that match your place and daemon file ?
(5) Did you restart Unity – preferably as “setsid unity”
(6) Did you make any recent changes to place file ? If so make sure it is copied and restart Unity.
(7) Check for any exception information in the terminal that you started your daemon. If its a service, enable logging and examing the log files.
(8) Check if your lens is actually started. If you used “setsid unity” in a terminal, it would be logging the various steps it does. One sample invocation showing successful starting of lenses is this :
** (<unknown>:6114): DEBUG: PlaceEntry: Applications
** (<unknown>:6114): DEBUG: PlaceEntry: Commands
** (<unknown>:6114): DEBUG: PlaceEntry: Files & Folders
** (<unknown>:6114): DEBUG: PlaceEntry: IMDB Search
** (<unknown>:6114): DEBUG: /com/canonical/unity/ applicationsplace
** (<unknown>:6114): DEBUG: /com/canonical/unity/filesplace
** (<unknown>:6114): DEBUG: /net/launchpad/imdbsearchlens
(9) If your lens works but puts results in strange groups, make sure that order of groups in synchronize and the constants are same.
(10) Common errors and fixes :
** Message: PlaceEntry Entry:IMDBSearch does not contain a translation gettext name: Key file does not have group ‘Desktop Entry’ :
This is not an error. If this annoys you add a [Desktop Entry] section. See above for details.
** (<unknown>:6114): WARNING **: Unable to connect to PlaceEntryRemote net.launchpad.IMDBSearchLens: No name owner
This is again not an error but indicates that either your daemon program is not running or is not able to own the bus specified in the .place file.
Since I primarily wrote the IMDB lens for learning Unity, I do not know much about the actual installation process. Here are some generic tips :
(1) If you want a simple method of installation, use distutils. It accepts a list of files and the location where they must be copied. This is the cleanest way.
(2) If you are going to install it and expect the lens to start automatically, then you should write a .service file that knows which bus start and which file to invoke to make that happen. The service file is usually put in “/usr/share/dbus-1/services/” . In this case, also make sure that the daemon file is actually executable !
[D-BUS Service] Name=net.launchpad.IMDBSearchLens Exec=/path/to/daemom/file
(1) Most of the simple daemons use constants for sections and groups. If you want them to setup them up dynamically, take a look how it is done in the IMDB lens.
(2) Avoiding repeated searches :
When the user enters a query in the lens, it is not necessary that the search function will be called only once. Remember that there is no concept of return key to indicate that query is complete. In fact, if the user presses the return key, it selects (and opens) the first result. The way Unity handles this scenario is by batching the calls. Lets say the user wants to search “blah blah2”. If the user continuously types the query with minimal interruption of the keyword, then the entire query will be sent to lens in one shot. On the other hand, if the user is a slow typer and enters “blah b” before waiting for a second or so, Unity will call the search function with partial query. When the user completes the query , the lens will be called with the entire query.
This behavior complicates your lens behavior, especially if the processing is time consuming. There are different techniques to avoid running calculations repeatedly (IMDB lens uses the first two techniques):
(a) Have a cutoff : Your code might decide not to search if the length of string is less that say 4 characters.
(b) Have a cache with timeout : You can use a local cache that stores your previous results . When you get a new query, check your cache first and return the results from it. For added benefit, have a cache timeout that removes elements that were added long time ago.
(c) Memoize using functools : A very cool technique that is used in other lenses is to memoize the function call. This technique works if the function is reasonably simple – the key is the function argument and the value is the return value of the function. One simple example is shown below. In the example, we defined a decorator called memoize_infinitely and apply it to function foo. Foo accepts an input and returns an output. This decorator, automagically creates a cache for each function. When the function is called with some argument, it is first checked in the cache. If it is found, then it is returned without invoking the function. Else the function is called and the result is stored in the cache. This technique is widely used. For a specific example,take a look at the AskUbuntu lens [3] .
import functools def memoize_infinitely(function): cache = {} @functools.wraps(function) def wrapped(*args): if not args in cache: cache[args] = function(*args) return cache[args] return wrapped @memoize_infinitely def foo(input): return input + 1
(3) Handling long computations :
If your lens returns lot of results or the processing for each result entry takes a lot of time, then its a good idea to display the information as it arrives. The results model has a function called flush_revision_queue which periodically sends the data so far to Unity. For eg, in the case of IMDB lens , fetching the genre information is expensive. So, the code flushes the results every time five movies were processed. This allows the user to see some results without waiting for eternity to get complete results.
[1] The code for IMDB lens which I have tried to comment as much as possible is my github page. For convenience, I have also given the individual files : Readme, imdb-search.py, imdbSearch.place .
[2] Unity Places – now with 100% More Python : This blog post contains a brief discussion on developing Unity lenses in Python. It has some info on installation and links to a sample lens in Python. I would advise you to start with the sample lens code given in this blog post before making drastic changes. This way you can make sure that the initial lens setup is fine and the issue is really in your code
[3] Link to some Good Unity Python Lens codes : As of now there are few other good Unity lenses written in Python. Here are some of my favorites – AskUbuntu lens, Launchpad lens, Evernote lens, Book-Search lens, Music search lens , Spotify lens . I have tried to incorporate all cool features in IMDB Search lens , so that it will become a one stop place for most important Lens features
[4] Ubuntu Unity Architecture : This contains some language independent discussion of Unity architecture and additional information of the data structure it uses.
[4] IMDbPy : The IMDB search lens was developed using the excellent IMDbPy library. It has very convenient API that allows development of IMDB based applications a snap.
[5] Ubuntu Unity Lens Python API Links : Here are some links for documentation that I found – Dee API, Unity API . The document for Unity also encompasses the documentation for Places.
1. Validate regular expressions when they are entered in blockset.
2. More robust error handling inside the extension.
3. Always allow access to Chrome extension links even if you block *.*
4. Disallow *.* from White list.
5. Do not strip www/http from the blocksite. So a block url of www.ft.com will not block www.microsoft.com unless the url is ft.com.
6. Multiple internal changes that will allow implementation of other Chrome Nanny features faster.
Let me know how this version works !
One interesting way of looking at the inequalities is from an adversarial perspective. The adversary has given you some limited information and you are expected to come up with some bound on the probability of an event. For eg, in the case of Markov inequality, all you know is that the random variable is non negative and its (finite) expected value. Based on this information, Markov inequality allows you to provide some bound on the tail inequalities. Similarly, in the case of Chebyshev inequality, you know that the random variable has a finite expected value and variance. Armed with this information Chebyshev inequality allows you to provide some bound on the tail inequalities. The most fascinating this about these inequalities is that you do not have to know the probabilistic mass function(pmf). For any arbitrary pmf satisfying some mild conditions, Markov and Chebyshev inequalities allow you to make intelligent guesses about the tail probability.
Another way of looking at these inequalities is this. Supposed we do not know anything about the pmf of a random variable and we are forced to make some prediction about the value it takes. If the expected value is known, a reasonable strategy is to use it. But then the actual value might deviate from our prediction. Markov and Chebyshev inequalities are very useful tools that allow us to estimate how likely or unlikely that the actual value varies from our prediction. For eg, we can use Markov inequality to bound the probability that the actual varies by some multiple of the expected value from the mean. Similarly, using Chebyshev we can bound the probability that the difference from mean is some multiple of its standard deviation.
One thing to notice is that you really do not need the pmf of the random variable to bound the probability of the deviations. Both these inequalities allow you to make deterministic statements of probabilistic bounds without knowing much about the pmf.
Let us first take a look at the Markov Inequality. Even though the statement looks very simple, clever application of the inequality is at the heart of more powerful inequalities like Chebyshev or Chernoff. Initially, we will see the simplest version of the inequality and then we will discuss the more general version. The basic Markov inequality states that given a random variable X that can only take non negative values, then
There are some basic things to note here. First the term P(X >= k E(X)) estimates the probability that the random variable will take the value that exceeds k times the expected value. The term P(X >= E(X)) is related to the cumulative density function as 1 – P(X < E(X)). Since the variable is non negative, this estimates the deviation on one side of the error.
Intuitively, what this means is that , given a non negative random variable and its expected value E(X)
(1) The probability that X takes a value that is greater than twice the expected value is atmost half. In other words, if you consider the pmf curve, the area under the curve for values that are beyond 2*E(X) is atmost half.
(2) The probability that X takes a value that is greater than thrice the expected value is atmost one third.
and so on.
Let us see why that makes sense. Let X be a random variable corresponding to the scores of 100 students in an exam. The variable is clearly non negative as the lowest score is 0. Tentatively lets assume the highest value is 100 (even though we will not need it). Let us see how we can derive the bounds given by Markov inequality in this scenario. Let us also assume that the average score is 20 (must be a lousy class!). By definition, we know that the combined score of all students is 2000 (20*100).
Let us take the first claim – The probability that X takes a value that is greater than twice the expected value is atmost half. In this example, it means the fraction of students who have score greater than 40 (2*20) is atmost 0.5. In other words atmost 50 students could have scored 40 or more. It is very clear that it must be the case. If 50 students got exactly 40 and the remaining students all got 0, then the average of the whole class is 20. Now , if even one additional student got a score greater than 40, then the total score of 100 students become 2040 and the average becomes 20.4 which is a contradiction to our original information. Note that the scores of other students that we assumed to be 0 is an over simplification and we can do without that. For eg, we can argue that if 50 students got 40 then the total score is atleast 2000 and hence the mean is atleast 20.
We can also see how the second claim is true. The probability that X takes a value that is greater than thrice the expected value is atmost one third. If 33.3 students got 60 and others got 0 , then we get the total score as around 2000 and the average remains the same. Similarly, regardless of the scores of other 66.6 students, we know that the mean is atleast 20 now.
This also must have made clear why the variable must be non negative. If some of the values are negative, then we cannot claim that mean is atleast some constant C. The values that do not exceed the threshold may well be negative and hence can pull the mean below the estimated value.
Let us look at it from the other perspective : Let p be the fraction of students who have a score of atleast a . Then it is very clear to us that the mean is atleast a*p. What Markov inequality does is to turn this around. It says, if the mean is a*p then the fraction of students with a score greater than a is atmost p. That is, we know the mean here and hence use the threshold to estimate the fraction .
The probability that the random variable takes a value thats greater than k*E(X) is at most 1/k. The fraction 1/k act as some kind of a limit. Taking this further, you can observe that given an arbitrary constant a, the probability that the random variable X takes a value >= a ie P(X >= a) is atmost 1/a times the expected value. This gives the general version of Markov inequality.
In the equation above, I seperated the fraction 1/a because that is the only varying part. We will later see that for Chebyshev we get a similar fraction. The proof of this inequality is straightforward. There are multiple proofs even though we will use the follow proof as it allows us to show Markov inequality graphically.This proof is partly taken from Mitzenmacher and Upfal’s exceptional book on Randomized Algorithms.
Consider a constant a >= 0. Then define an indicator random variable I which takes value of 1 is X >=a . ie
Now we make a clever observation. We know that X is non negative. ie X >= 0. This means that the fraction X/a is atleast 0 and atmost can be infinty. Also, if X < a, then X/a < 1. When X > a, X/a > 1. Using these facts,
If we take expectation on both sides, we get
But we also know that the expectation of indicator random variable is also the probability that it takes the value 1. This means E[I] = Pr(X>=a). Putting it all together, we get the Markov inequality.
Sometimes, it might happen that the random variable is not non-negative. In cases like this, a clever hack helps. Design a function f(x) such that f(x) is non negative. Then we can apply Markov inequality on the modified random variable f(X). The Markov inequality for this special case is :
This is a very powerful technique. Careful selection of f(X) allows you to derive more powerful bounds.
(1) One of the simplest examples is f(X) = |X| which guarantees f(X) to be non negative.
(2) Later we will show that Chebyshev inequality is nothing but Markov inequality that uses
(3) Under some additional constraints, Chernoff inequality uses .
Let us consider a simple example where it provides a decent bound and one where it does not. A typical example where Markov inequality works well is when the expected value is small but the threshold to test is very large.
Example 1:
Consider a coin that comes up with head with probability 0.2 . Let us toss it n times. Now we can use Markov inequality to bound the probability that we got atleast 80% of heads.
Let X be the random variable indicating the number of heads we got in n tosses. Clearly, X is non negative. Using linearity of expectation, we know that E[X] is 0.2n.We want to bound the probability P(X >= 0.8n). Using Markov inequality , we get
Of course we can estimate a finer value using the Binomial distribution, but the core idea here is that we do not need to know it !
Example 2:
For an example where Markov inequality gives a bad result, let us the example of a dice. Let X be the face that shows up when we toss it. We know that E[X] is 7/2 = 3.5. Now lets say we want to find the probability that X >= 5. By Markov inequality,
The actual answer of course is 2/6 and the answer is quite off. This becomes even more bizarre , for example, if we find P(X >= 3) . By Markov inequality,
The upper bound is greater than 1 ! Of course using axioms of probability, we can set it to 1 while the actual probability is closer to 0.66 . You can play around with the coin example or the score example to find cases where Markov inequality provides really weak results.
The last example might have made you think that the Markov inequality is useless. On the contrary, it provided a weak bound because the amount of information we provided to it is limited. All we provided to it were that the variable is non negative and that the expected value is known and finite. In this section, we will show that it is indeed tight – that is Markov inequality is already doing as much as it can.
From the previous example, we can see an example where Markov inequality is tight. If the mean of 100 students is 20 and if 50 students got a score of exactly 0, then Markov implies that atmost 50 students can get a score of atleast 40.
Note : I am not 100% sure if the following argument is fully valid – But atleast it seems to me
Consider a random variable X such that
We can estimate its expected value as
We can see that ,
This implies that the bound is actually tight ! Of course one of the reasons why it was tight is that the other value is 0 and the value of the random variable is exactly k. This is consistent with the score example we saw above.
Chebyshev inequality is another powerful tool that we can use. In this inequality, we remove the restriction that the random variable has to be non negative. As a price, we now need to know additional information about the variable – (finite) expected value and (finite) variance. In contrast to Markov, Chebyshev allows you to estimate the deviation of the random variable from its mean. A common use of it estimates the probability of the deviation from its mean in terms of its standard deviation.
Similar to Markov inequality, we can state two variants of Chebyshev. Let us first take a look at the simplest version. Given a random variable X and its finite mean and variance, we can bound the deviation as
There are few interesting things to observe here :
(1) In contrast to Markov inequality, Chebyshev inequality allows you to bound the deviation on both sides of the mean.
(2) The length of the deviation is on both sides which is usually (but not always) tighter than the bound k E[X]. Similarly, the fraction 1/k^2 is much more tighter than 1/k that we got from Markov inequality.
(3) Intuitively, if the variance of X is small, then Chebyshev inequality tells us that X is close to its expected value with high probability.
(4) Using Chebyshev inequality, we can claim that atmost one fourth of the values that X can take is beyond 2 standard deviation of the mean.
A more general Chebyshev inequality bounds the deviation from mean to any constant a . Given a positive constant a ,
The proof of this inequality is straightforward and comes from a clever application of Markov inequality. As discussed above we select . Using it we get ,
We used the Markov inequality in the second line and used the fact that .
It is important to notice that Chebyshev provides bound on both sides of the error. One common mistake to do when applying Chebyshev is to divide the resulting probabilistic bound by 2 to get one sided error. This is valid only if the distribution is symmetric. Else it will give incorrect results. You can refer Wikipedia to see one sided Chebyshev inequalities.
One of the neat applications of Chebyshev inequality is to use it for higher moments. As you would have observed, in Markov inequality, we used only the first moment. In the Chebyshev inequality, we use the second moment (and first). We can use the proof above to adapt Chebyshev inequality for higher moments. In this post, I will give a simple argument for even moments only. For general argument (odd and even) look at this Math Overflow post.
The proof of Chebyshev for higher moments is almost exactly the same as the one above. The only observation we make is that is always non negative for any k. Of course the next observation is gives the 2k^th central moment . Using the statement from Mitzenmacher and Upfal’s book we get ,
It should be intuitive to note that the more information we get the tighter the bound is. For Markov we got 1/t as the fraction. It was 1/a^2 for second order Chebyshev and 1/a^k for k^th order Chebyshev inequality.
Using Chebyshev inequality, we previously claimed that atmost one fourth of the values that X can take is beyond 2 standard deviation of the mean. It is possible to turn this statement around to get a confidence interval.
If atmost 25% of the population are beyond 2 standard deviations away from mean, then we can be confident that atleast 75% of the population lie in the interval . More generally, we can claim that, percentage of the population lies in the interval . We can similarly derive that 94% of the population lie within 4 standard deviations away from mean.
We previously saw two applications of Chebyshev inequality – One to get tighter bounds using higher moments without using complex inequalities. The other is to estimate confidence interval. There are some other cool applications that we will state without providing the proof. For proofs refer the Wikipedia entry on Chebyshev inequality.
(1) Using Chebyshev inequality, we can prove that the median is atmost one standard deviation away from the mean.
(2) Chebyshev inequality also provides the simplest proof for weak law of large numbers.
Similar to Markov inequality, we can prove the tightness of Chebyshev inequality. I had fun deriving this proof and hopefully some one will find it useful. Define a random variable X as ,
[Note: I could not make the case statement work in WordPress Latex and hence the crude work around]
X = { + C with probability p
{ – C with probability p
{ with probability 1-2p
If we want to find the probability that the variable deviates from mean by constant C, the bound provided by Chebyshev is ,
which is tight !
Markov and Chebyshev inequalities are two of the simplest , yet very powerful inequalities. Clever application of them provide very useful bounds without knowing anything about the distribution of the random variable. Markov inequality bounds the probability that a nonnegative random variable exceeds any multiple of its expected value (or any constant). Chebyshev’s inequality , on the other hand, bounds the probability that a random variable deviates from its expected value by any multiple of its standard deviation. Chebyshev does not expect the variable to non negative but needs additional information to provide a tighter bound. Both Markov and Chebyshev inequalities are tight – This means with the information provided, the inequalities provide the most information they can provide.
Hope this post was useful ! Let me know if there is any insight I had missed !
(1) Probability and Computing by Mitzenmacher and Upfal.
(2) An interactive lesson plan on Markov’s inequality – An extremely good discussion on how to teach Markov inequality to students.
(3) This lecture note from Stanford – Treats the inequalities from a prediction perspective.
(4) Found this interesting link from Berkeley recently.