Feeds:
Posts

## Tutorial on Sampling Theory for Approximate Query Processing

### Prelude

This blog post is based on the lecture notes I prepared for one of the courses in which I am the teaching assistant. Hopefully, this notes will be useful for anyone interested in either Approximate Query Processing (AQP) or basic Sampling theory as applied in Databases.

### Approximate Query Processing

Consider a database D which contains billions of tuples. Such a large database is not uncommon in data warehousing. When we want to get some answers from the database, we construct a SQL query to get the results. Due to the large size of the database, any query should take quite a bit of time. This is regardless of the use of techniques like indexing which can speed up the processing time but does not really reduce the time asymptotically.

One observation is that the queries that are posed to the database returns very precise and accurate answers – probably after looking at each and every tuple. For a lot of use cases of OLAP we may not need such a precise answer. Consider some sample queries like – what is the ratio of male to female in UTA ? What percentage of US people live in Texas? What is the average salary of all employees in the company? and so on.

Notice that for such queries we really do not need answers that are correct to the last decimal point. Also notice that each of those query is an aggregate over some column. Approximate query processing is a viable technique to use in these cases. A slightly less accurate result but which is computed instantly is desirable in these cases. This is because most analysts are performing exploratory operation on the database and do not need precise answers. An approximate answer along with a confidence interval would suit most of the use cases.

There are multiple techniques to perform approximate query processing. The two most popular involve histograms and sampling. Histograms store some statistic about the database and then use it to answer queries approximately. Sampling creates a very small subset of the database and then uses this smaller database to answer the query. In this course we will focus on AQP techniques that use sampling. Sampling is a very well studied problem backed up by a rich theory that can guide us in selecting the subset so that we can provide reasonably accurate results and also provide statistical error bounds.

### Introduction to Sampling

The idea behind sampling is very simple. We want to estimate some specific characteristic of a population. For eg this might be the fraction of people who support some presidential candidate or the fraction of people who work in a specific field or fraction of people infected with a disease and so on. The naive strategy is to evaluate the entire population. Most of the time , this is infeasible due to constraints on time, cost or other factors.

An alternate approach that is usually used is to pick a subset of people . This subset is usually called a sample. The size of the sample is usually an order of magnitude smaller than the population. We then use the sample to perform our desired evaluation. Once we get some result, we can use this to estimate the characteristic for the entire population. Sampling theory helps, among other things, on how to select the subset ,  what is the size of population, how to extrapolate the result from sample to original population , how to estimate the confidence interval of our prediction etc.

The process of randomly picking a subset of the population and using it to perform some estimation should appear very strange. On a first glance it might look that we might get wildly inaccurate results. We will later see how to give statistical guarantees over our prediction. Sampling is a very powerful and popular technique. More and more problems in the real world are being solved using sampling. Lots of recent problems in data mining and machine learning essentially use sampling and randomization to approximately solve very complex problems which are not at all solvable otherwise. This is for this reason that most DBMS provide sampling as a fundamental operator. (Eg Oracle provides a SAMPLE operator in select and also dbms_stats package. SQL Server provides TABLESAMPLE operator and so on).

We represent the population with P and the sample with S. N represents the size of population and n represents the size of the sample. We will use these letters to denote statistics on the population and sample. For eg, $\mu_P$ represents the mean of the population and $\mu_S$ represents the mean of the sample. Similarly, $\sigma_P$ and $\sigma_S$ represent the standard deviation of the population and sample respectively.

### Types of Sampling

There are different ways to perform sampling. The ones that we will use most in AQP are :

### Sampling With/Without Replacement

The aim of sampling is to get a subset (sample) from a larger population. There are two ways to go about it. In the first approach, we randomly pick some entity, perform measurements if any and add it to the sample. We then replace the entity back to the original population and repeat the experiment. This means that same entity can come in the sample multiple times. This approach is called Sampling with replacement. This is the simplest approach to sampling. There is no additional overhead to check if an entity is already in sample or not. Typically, sampling with replacement is modeled with binomial distribution.

In the second approach, we explicitly make sure that an entity does not appear in the sample more than once. So we randomly pick an entity from the population, verify it is not already in the sample, perform measurement and so on. Alternatively, we can remove entities that were added to sample from the population. This approach is called Sampling without replacement and is usually modeled with an hypergeometric distribution.

### Bernoulli Sampling

Bernoulli Trial : A Bernoulli Trial is an experiment which has exactly two outcomes – Success or failure. Each of these outcomes has an associated probability that completely determines the trial. For eg consider a coin which produces head with probability 0.6 and tail with probability 0.4 . This constitutes a bernoulli trial as there are exactly two outcomes.

In Bernoulli Sampling, we perform a Bernoulli trial on each tuple in the database. Each tuple can be selected into a sample with uniform probability. If the trial is a success, the tuple is added to the sample. Else it is not. The most important thing to notice is that all the tuples have exactly the sample probability of getting into the sample. Alternatively, the success probability for Bernoulli trial remains the same for each tuple. Pseudo code for Bernoulli Sampling is given below :

success probability SP = $\frac{n}{N}$
for i = 1 to N
Perform a Bernoulli trial with success probability SP
If outcome is success add i-th tuple to sample

It is important to note that Bernoulli sampling falls under Sampling without replacement. Size of the sample follows a binomial distribution with parameters $B(N,\frac{n}{N})$. ie it can vary between 0 and N-1 with the expected size of sample as n.

### Uniform Random Sampling

In Uniform Random Sampling, we pick each tuple in the database with a constant probability. This means that the probability of any tuple entering the sample is constant. Typically, this is implemented as sampling with replacement. Pseudo code for Uniform Random Sampling is given below :

1. Generate a set of n random numbers S between 1 and N.
2. Select tuples with index in S and add it to sample.

Note that in this case we have exactly n tuples in the sample. We can also notice that sample tuple might appear multiple times. The number of times a tuple appears in the sample forms a binomial distribution with parameters $B(n,\frac{1}{n})$.

### Weighted Random Sampling

In Weighted Random Sampling, we perform a Bernoulli trial on each tuple in the database. The difference with Uniform random sampling that the success probability for each Bernoulli trial varies. In other words, each tuple has a different probability of getting into the sample.

### Stratified Sampling

If the population can be subdivided into sub population that are distinct and non overlapping , stratified sampling can be used. In Stratified sampling, we split the population into a bunch of strata and then form sampling over each strata independently. For eg a political poll can be stratified on gender, race , state etc.

There are multiple advantages in using stratified sampling. For one, this allows the convenience to use different sampling techniques over each strata. If there is some specific strata that might be under represented in a general random sampling, we can easily provide additional weights for the samples taken from that. It is also possible to vary the number of samples from a strata to minimize the error. For eg, we can take less number of samples from a strata with low variance while preserving them for strata with high variance.

Stratified sampling is not a panacea because getting a good stratification strategy may not be obvious. In lot of population, the best feature to stratify is not obvious. Even worse, the population may not contain subgroups that are homogeneous and non overlapping.

Given a population , sample size and strata there are many ways to allocate the sample across different strata. The two commonly used strategies are :

1. Proportional Allocation : In this approach the contribution of each strata to the sample is proportional to the size of the strata. So a strata that accounts to 10\% of the population will use 10\% of the sample. Note that this approach does not use any information about a sub population other than its size.
2. Variance based Allocation : This strategy allocates samples in proportion to their variance. So a strata with high variance will have higher representation than one with smaller variance. This is logical as we need few samples to accurately estimate the parameters of a sub population when its variance is low. Any additional samples do not add much additional information or reduce the final estimation dramatically.

### Reservoir Sampling

Reservoir Sampling is an algorithm that is widely used to take n random samples from a large database of size N. The real utility of reservoir sampling is realized when N is a large number or is not really known at sample time. This scenario is quite common when the input is a streaming data or when the database is frequently updated . Running the simple uniform random sampling algorithm (say Bernoulli sampling) is inefficient as N is large or the old tuples may be purged (or goes out of Sliding Window). Reservoir sampling allows you to get the random sample in a linear pass such that you only inspect any tuple at most once.

### Reservoir Sampling with Single Sample

Consider the following contrived example. We have a database which is constantly updated and we want to have a single random tuple from it.

The base case occurs when there is only one tuple in the database. In this case our random sample is the first tuple. Hence the sample  S = t1.

Lets say the database is updated and a new tuple  t2 is added. The naive approach is to restart the entire sampling process. Instead, in reservoir sampling, we accept the new tuple as the random sample with probability $\frac{1}{2}$. ie toss a coin which returns head with probability 0.5 and if it returns head , then replace t1 with t2.

We can see why S is a uniform sample. The probability that S contains t1 or t2 remains the same.

1. $Pr(S=t1) = 1 * \frac{1}{2} = \frac{1}{2}$ . The random sample is t1 when it was selected first into S (with probability 1) and then not rejected by t2 with probability $1-\frac{1}{2}$.
2. $Pr(S=t2) = \frac{1}{2}$ . The random sample is t2 when it replaces  t1 in the second step. This occurs with probability $\frac{1}{2}$

The database is updated and lets assume a new tuple t3 is added. We accept the new tuple as the random sample with probability $\frac{1}{3}$. ie toss a coin which returns head with probability 0.33 and if it returns head , then replace the previous value of S (t1 or t2) with t3. More generally when inspecting the i-th tuple, accept it with probability $\frac{1}{i}$.

It might look as if we are treating t3 unfairly because we only accept it with probability 0.33. But we can show probabilistically that the sample is still uniform. The probability that S is t1 or t2 or t3 remains the same.

1. $Pr(S=t1) = 1 \times \frac{1}{2} \times \frac{2}{3}=\frac{1}{3}$ . The only scenario when random sample is still t1 occurs when it was selected first into S (with probability 1) ,not rejected by t2 with probability $1-\frac{1}{2}$ and not rejected by t3 with probability $1-\frac{1}{3} = \frac{2}{3}$.
2. $Pr(S=t2) = \frac{1}{2} \times \frac{2}{3} = \frac{1}{3}$ . The random sample is 2 when it replaces t1 in the second step. This occurs with probability $\frac{1}{2}$. Then in the next step is not replaced by t3. This occurs with probability $1-\frac{1}{3} = \frac{2}{3}$.
3. $Pr(S=t3) = \frac{1}{3}$ . The random sample is 3 when S contains either t1 or t2 and it is replaced by t3. This occurs with probability $\frac{1}{3}$.

The pseudo code looks like :

S = t1
for i = 2 to N
Replace S with tuple t_i with probability $\frac{1}{i}$

### Reservoir Sampling with Multiple Samples

A very similar approach works when the sample size is more than 1. Lets say that we need a sample of size n. Then we initially set the first n tuples of the database as the sample. The next steps is a bit different. In the previous case, there was only one sample so we replaced the sample with the selected tuple. When sample size is more than 1, then this steps splits to two parts :

1. Acceptance : For any new tuple t_i, we need to decide if this tuple enters the sample. This occurs with probability $\frac{n}{i}$.
2. Replacement : Once we decided to accept the new tuple into the sample, some tuple already in the sample needs to make way. This is usually done randomly. We randomly pick a tuple in the sample and replace it with tuple t_i.

The pseudo code looks like :

Store first n elements into S
for i = n+1 to N
Accept tuple t_i with probability $\frac{n}{i}$
If accepted, replace a random tuple in S with tuple t_i

A coding trick that avoids the "coin tossing" by generating a random index and then accepts it if it is less than our sample size. The pseudo code looks like :

Store first n elements into S
for i = n+1 to N
randIndex = random number between 1 and i
if randIndex <= n
replace tuple at index "randIndex" in the sample with tuple t_i

We can similarly analyze that the classical reservoir sampling does provide a uniform random sample. Please refer to the paper  Random Sampling with a Reservoir by Jeffrey Vitter for additional details.

### Sampling in AQP

As discussed above, our main aim is to discuss how sampling techniques is used in AQP. Let us assume that we have a sample S of size n. The usual strategy that we will follow is to apply any aggregate query on the sample S instead of database D. We then use the result of the query from S to estimate the result for D.

One thing to note is that we will only use aggregate queries for approximate processing. Specifically we will look at COUNT, SUM and AVERAGE. The formulas for estimating the values of the aggregate query for the entire database from the sample for these 3 operators is well studied. For additional details refer to the paper “Random Sampling from Databases" by Frank Olken.

Uniform Random Sample

1. Count : $\frac{\sum_{i=1}^{n} T_i p_i}{n} = \frac{\sum_{i=1}^{n} T_i \frac{1}{N}}{n}$ where $T_i$ is an indicator random variable that is 1 when tuple t_i satisfies our clause. $p_i$ is the probability that tuple will be selected into the sample. Intuitively, the formula finds the fraction of tuples in Sample which satisfied the query and applies the same fraction to the entire database.
2. Sum : $\frac{\sum_{i=1}^{n} x_i \frac{1}{p_i}}{n} = \frac{\sum_{i=1}^{n} x_i N}{n}$
3. Average : $\frac{Sum}{Count}$
\end{itemize}

Weighted Random Sample

1. Count : $\frac{\sum_{i=1}^{n} T_i p_i}{n}$ where $T_i$ is an indicator random variable that is 1 when tuple t_i satisfies our clause. $p_i$ is the probability that tuple will be selected into the sample. Intuitively, the formula reweighs each tuple according to the selection probability of the tuple.
2. Sum : $\frac{\sum_{i=1}^{n} x_i \frac{1}{p_i}}{n}$
3. Average : $\frac{Sum}{Count}$

### Probability/Statistics Refresher

Here are few commonly used equations related to Expectation and variance.

1. E[X] = $\sum_{i=1}^{n} x_i p_i$
2. E[aX+b] = aE[X] + b
3. E[X+Y] = E[X] + E[Y] (also called Linearity of Expectations)
4. Var[X] = $E[ (X - E[X] ) ^2]$
5. Var[X+a] = Var[X]
6. Var[aX] = $a^2$ Var[X]
7. Var[X+Y] = Var[X] + Var[Y] if X and Y are independent.

Law of Large Numbers : This law is one of the fundamental laws in probability. Let $X_1,X_2 \ldots , X_n$ be random variables drawn iid. Very informally, as n increases, the average of the variables approaches the expected value of the distribution from which the variables are drawn. For eg, if we have a coin which provides head with probability 0.5 and toss it 10000 times, the number of heads will be very close to 5000.

Binomial Distribution: Suppose you repeat a Bernoulli trial with success probability of p , n times. The distribution of the number of successes in the n trials is provided by binomial distribution B(n,p). This is a very important distribution for modeling sampling with replacement. For eg if we perform Bernoulli sampling, the final size of the sample is a Binomial distribution. Also if we randomly pick n tuples from database of size N , the number of times any tuple is repeated in the sample can also modeled by Binomial distribution. The expected value is given by $E[X]=np$ and variance is given by $np(1-p)$.

Normal Distribution : Normal distribution , aka Gaussian distribution, is one of the most important probability distributions. It is usually represented with parameters $N(\mu,\sigma^2)$. It has the familiar bell curve shape. $\mu$ determines the center of the normal curve and $\sigma^2$ determines the width of it. A smaller value results in a tighter curve while a larger value results in a more flat/wide curve.

Equations (2) and (7) above gives us some detail about how expected value and variance of sum of two random variables can be computed from the expected value/variance of the constituent random variables . We can extend by induction for the rules to hold for the sum of arbitrarily large number of random variables. This introduces us to the concept of Central Limit Theorem.

Central Limit Theorem :  This is one of the most fundamental rules in probability. Assume that we have n different random variables $X_1,X_2 \ldots , X_n$. Each of these random variables have mean $\mu$ and variance $\sigma^2$. For a large n, the distribution of sums $X = X_1 + X_2 + \ldots + X_n$ is normal with parameters $N(n \mu, n \sigma^2)$. Similarly, the distribution of the averages , $X = \frac{X_1 + X_2 + \ldots + X_n}{n}$ is normal with parameters $N(\mu, \frac{\sigma^2}{n})$.

Law of large numbers and Central limit theorem are two important results that allow analysis of the results. law of large number says that if we pick a large enough sample then the average of the sample will be closer to the true average of population. Central limit theorem states that if you repeat the sampling experiment multiple times and plot the distribution of the average value of the samples, they follow a normal distribution. Jointly, they allow you to derive the expression for Standard error.

### Standard Error

The essential idea behind sampling is to perform the experiment on a randomly chosen subset of the population instead of the original population. So far we discussed how to perform the sampling and how to estimate the value from sample to the larger population. In this section let us discuss about the error in our prediction. The concept of standard error is usually used for the same.

Lets say we want to estimate the mean of the population $\mu_P$ . We performed the sampling and found the sample mean as $\mu_S$. Since the sample mean is an unbiased estimator of the population mean we can announce that they are the same. But it need not always be the case that these two values are same. There are two common ways to analyze the error.

1. Absolute Error : $|\mu_P - \mu_S|^2$
2. Relative Error : $\frac{|\mu_P - \mu_S|^2}{\mu_P}$

The primary problem we have in estimating these error metrics is that we do not know the value of $\mu_P$. We will use an indirect way to estimate the error.

Consider the sampling process again. We picked a sample $S={X_1,X_2,\ldots,X_n}$ of random tuples from the database and computed the sample mean as $\mu_S=\frac{\sum_{i=1}^{n} X_i}{n}$. But since the tuples in S are picked randomly, the sample mean changes based on the sample. This means that the sample mean $\mu_S$ is itself a random variable.

Let the population $P={X_1,X_2,\ldots,X_N}$. Then the population mean $\mu_P = \overline{X} = \frac{\sum_{i=1}^{N} X_i}{N}$.  Let the sample be $S={Y_1,Y_2,\ldots,Y_n}$ and sample mean is $\mu_S = \frac{\sum_{i=1}^{n} Y_i}{n}$.

### Standard Error when Sample Size = 1

Consider the case where the sample S consists of only one element. Our aim is to find population mean. As per our approach, we pick a random element $Y_1$ and then announce it as the sample mean ie ${\mu}_S=Y_1$ . The error in this case is the difference between $Y_1$ and $\mu_P$. Since the sample is randomly picked , the sample mean is a random variable. Then it also implies that the error is also a random variable.

We can derive the expected value for the error as follows :
$E[err^2] = \sum_{i=1}^{N}\frac{(\overline{X}-X_i)^2}{N} = \sigma_P^2$.

We can see that the expected value of the error is nothing but the standard deviation of the population !

### Standard Error when Sample Size = 2

Consider the case where the sample S consists of exactly two elements. Our aim is to find population mean. As per our approach, we pick two random elements $Y_1,Y_2$ and then announce the sample mean as ${\mu}_S = \frac{Y_1+Y_2}{2}$. The error in this case is the difference between $\mu_S$ and $\mu_P$. Since the samples are randomly picked , the sample mean is again a random variable. Then it also implies that the error is also a random variable.

We can derive the expected value for the error as $E[err^2] = E[(\overline{X} - \overline{Y})^2] = Var[\overline{Y}]$. We can see that the sample can be any one of $N^2$ different combination of elements . The calculation for variance might be tricky if not for the independence of samples. Since the two elements were picked randomly , they two are independent and we can use that to estimate the variance easily.

$Var[\overline{Y}] = Var [\frac{Y_1 + Y_2}{2}] = Var [\frac{Y_1}{2} + \frac{Y_2}{2}] = \frac{1}{4} Var[Y_1] + \frac{1}{4} Var[Y_2] = \frac{\sigma_P^2}{4} + \frac{\sigma_P^2}{4} = \frac{\sigma_P^2}{2}$

Note that we used rules 3, 6 and 7 from above.

### Standard Error when Sample Size = n

We will not derive the formula here but we can easily show that when the sample contains n different elements the standard error is given by $E[err^2] = \frac{\sigma_P^2}{n} = \frac{Variance\;of\;population}{Sample\;size}$.

There are some interesting things to note here :

1. The expected error when using 2 samples is less than that of the expected error when we used only one sample.
2. As the size of sample increases the error decreases even faster. The rate of decrease is inversely proportional to square of size of the sample size.
3. This also formalizes the intuition that if the variance of the population is less, we need less number of samples to provide estimates with small error.
4. Usually our work will dictate the tolerable error and we can use the formula to find the appropriate n that will make standard error less than our tolerance factor.

Hope this post was useful 🙂

## Detailed Tutorial on Markov and Chebyshev Inequalities

In this post, I plan to discuss about two very simple inequalities – Markov and Chebyshev. These are topics that are covered in any elementary probability course. In this post, I plan to give some intuitive explanation about them and also try to show them in different perspectives. Also, the following discussion is closer to discrete random variables even though most of them can be extended to continuous ones.

### Inequalities from an Adversarial Perspective

One interesting way of looking at the inequalities is from an adversarial perspective. The adversary has given you some limited information and you are expected to come up with some bound on the probability of an event. For eg, in the case of Markov inequality, all you know is that the random variable is non negative and its (finite) expected value. Based on this information, Markov inequality allows you to provide some bound on the tail inequalities. Similarly, in the case of Chebyshev inequality, you know that the random variable has a finite expected value and variance. Armed with this information Chebyshev inequality allows you to provide some bound on the tail inequalities. The most fascinating this about these inequalities is that you do not have to know the probabilistic mass function(pmf). For any arbitrary pmf satisfying some mild conditions, Markov and Chebyshev inequalities allow you to make intelligent guesses about the tail probability.

### A Tail Inequality perspective

Another way of looking at these inequalities is this. Supposed we do not know anything about the pmf of a random variable and we are forced to make some prediction about the value it takes. If the expected value is known, a reasonable strategy is to use it. But then the actual value might deviate from our prediction. Markov and Chebyshev inequalities are very useful tools that allow us to estimate how likely or unlikely that the actual value varies from our prediction. For eg, we can use Markov inequality to bound the probability that the actual varies by some multiple of the expected value from the mean. Similarly, using Chebyshev we can bound the probability that the difference from mean is some multiple of its standard deviation.

One thing to notice is that you really do not need the pmf of the random variable to bound the probability of the deviations. Both these inequalities allow you to make deterministic statements of probabilistic bounds without knowing much about the pmf.

### Markov Inequality

Let us first take a look at the Markov Inequality. Even though the statement looks very simple, clever application of the inequality is at the heart of more powerful inequalities like Chebyshev or Chernoff. Initially, we will see the simplest version of the inequality and then we will discuss the more general version. The basic Markov inequality states that given a random variable X that can only take non negative values, then

$Pr(X \geq k E[X]) \leq \frac{1}{k}$

There are some basic things to note here. First the term P(X >= k E(X)) estimates the probability that the random variable will take the value that exceeds k times the expected value. The term P(X >= E(X)) is related to the cumulative density function as 1 – P(X < E(X)). Since the variable is non negative, this estimates the deviation on one side of the error.

### Intuitive Explanation of Markov Inequality

Intuitively, what this means is that , given a non negative random variable and its expected value E(X)
(1) The probability that X takes a value that is greater than twice the expected value is atmost half. In other words, if you consider the pmf curve, the area under the curve for values that are beyond 2*E(X) is atmost half.
(2) The probability that X takes a value that is greater than thrice the expected value is atmost one third.
and so on.

Let us see why that makes sense. Let X be a random variable corresponding to the scores of 100 students in an exam. The variable is clearly non negative as the lowest score is 0. Tentatively lets assume the highest value is 100 (even though we will not need it). Let us see how we can derive the bounds given by Markov inequality in this scenario. Let us also assume that the average score is 20 (must be a lousy class!). By definition, we know that the combined score of all students is 2000 (20*100).

Let us take the first claim – The probability that X takes a value that is greater than twice the expected value is atmost half. In this example, it means the fraction of students who have score greater than 40 (2*20) is atmost 0.5. In other words atmost 50 students could have scored 40 or more. It is very clear that it must be the case. If 50 students got exactly 40 and the remaining students all got 0, then the average of the whole class is 20. Now , if even one additional student got a score greater than 40, then the total score of 100 students become 2040 and the average becomes 20.4 which is a contradiction to our original information. Note that the scores of other students that we assumed to be 0 is an over simplification and we can do without that. For eg, we can argue that if 50 students got 40 then the total score is atleast 2000 and hence the mean is atleast 20.

We can also see how the second claim is true. The probability that X takes a value that is greater than thrice the expected value is atmost one third. If 33.3 students got 60 and others got 0 , then we get the total score as around 2000 and the average remains the same. Similarly, regardless of the scores of other 66.6 students, we know that the mean is atleast 20 now.

This also must have made clear why the variable must be non negative. If some of the values are negative, then we cannot claim that mean is atleast some constant C. The values that do not exceed the threshold may well be negative and hence can pull the mean below the estimated value.

Let us look at it from the other perspective : Let p be the fraction of students who have a score of atleast a . Then it is very clear to us that the mean is atleast a*p. What Markov inequality does is to turn this around. It says, if the mean is a*p  then the fraction of students with a score greater than a is atmost p. That is, we know the mean here and hence use the threshold to estimate the fraction .

### Generalized Markov Inequality

The probability that the random variable takes a value thats greater than k*E(X) is at most 1/k. The fraction 1/k act as some kind of a limit. Taking this further, you can observe that given an arbitrary constant a, the probability that the random variable X takes a value >= a ie P(X >= a) is atmost 1/a times the expected value. This gives the general version of Markov inequality.

$Pr(X \geq a) \leq \frac{1}{a} E[X]$

In the equation above, I seperated the fraction 1/a because that is the only varying part. We will later see that for Chebyshev we get a similar fraction. The proof of this inequality is straightforward. There are multiple proofs even though we will use the follow proof as it allows us to show Markov inequality graphically.This proof is partly taken from Mitzenmacher and Upfal’s exceptional book on Randomized Algorithms.

Consider a constant a >= 0. Then define an indicator random variable I which takes value of 1 is X >=a . ie

$\displaystyle I = \begin{cases} 1, & \mbox{if } \mbox{ X} \geq \mbox{a} \\ 0, & \mbox{otherwise } \end{cases}$

Now we make a clever observation. We know that X is non negative. ie X >= 0. This means that the fraction X/a is atleast 0 and atmost can be infinty. Also, if X < a, then X/a < 1. When X > a, X/a > 1. Using these facts,

$I \leq \frac{X}{a}$

If we take expectation on both sides, we get

$E[I] \leq \frac{1}{a} E[X]$

But we also know that the expectation of indicator random variable is also the probability that it takes the value 1. This means E[I] = Pr(X>=a). Putting it all together, we get the Markov inequality.

$Pr(X \geq a) \leq \frac{1}{a} E[X]$

### Even more generalized Markov Inequality

Sometimes, it might happen that the random variable is not non-negative. In cases like this, a clever hack helps. Design a function f(x) such that f(x) is non negative. Then we can apply Markov inequality on the modified random variable f(X). The Markov inequality for this special case is :

$Pr(f(X) \geq a) \leq \frac{1}{a} E[f(X)]$

This is a very powerful technique. Careful selection of f(X) allows you to derive more powerful bounds.
(1) One of the simplest examples is f(X) = |X| which guarantees f(X) to be non negative.
(2) Later we will show that Chebyshev inequality is nothing but Markov inequality that uses $f(X) = |X-E(X)|^2$
(3) Under some additional constraints, Chernoff inequality uses $f(X) = e^{tX}$ .

### Simple Examples

Let us consider a simple example where it provides a decent bound and one where it does not. A typical example where Markov inequality works well is when the expected value is small but the threshold to test is very large.

Example 1:
Consider a coin that comes up with head with probability 0.2 . Let us toss it n times. Now we can use Markov inequality to bound the probability that we got atleast 80% of heads.

Let X be the random variable indicating the number of heads we got in n tosses. Clearly, X is non negative. Using linearity of expectation, we know that E[X] is 0.2n.We want to bound the probability P(X >= 0.8n). Using Markov inequality , we get

$P(X \geq 0.8n) \leq \frac{0.2n}{0.8n} = 0.25$

Of course we can estimate a finer value using the Binomial distribution, but the core idea here is that we do not need to know it !

Example 2:
For an example where Markov inequality gives a bad result, let us the example of a dice. Let X be the face that shows up when we toss it. We know that E[X] is 7/2 = 3.5. Now lets say we want to find the probability that X >= 5. By Markov inequality,

$P(X \geq 5) \leq \frac{3.5}{5} = 0.7$

The actual answer of course is 2/6 and the answer is quite off. This becomes even more bizarre , for example, if we find P(X >= 3) . By Markov inequality,

$P(X \geq 3) \leq \frac{3.5}{3} = \frac{7}{6}$

The upper bound is greater than 1 ! Of course using axioms of probability, we can set it to 1 while the actual probability is closer to 0.66 . You can play around with the coin example or the score example to find cases where Markov inequality provides really weak results.

### Tightness of Markov

The last example might have made you think that the Markov inequality is useless. On the contrary, it provided a weak bound because the amount of information we provided to it is limited. All we provided to it were that the variable is non negative and that the expected value is known and finite. In this section, we will show that it is indeed tight – that is Markov inequality is already doing as much as it can.

From the previous example, we can see an example where Markov inequality is tight. If the mean of 100 students is 20 and if 50 students got a score of exactly 0, then Markov implies that atmost 50 students can get a score of atleast 40.

Note : I am not 100% sure if the following argument is fully valid – But atleast it seems to me 🙂

Consider a random variable X such that

$X = \displaystyle \begin{cases} k & \mbox{with probability } \frac{1}{k} \\ 0 & \mbox{else} \end{cases}$

We can estimate its expected value as

$E[X] = \frac{1}{k} \times k + \frac{k-1}{k} \times 0 = 1$

We can see that , $Pr(X \geq k E[X]) = Pr(X \geq k) = \frac{1}{k}$

This implies that the bound is actually tight ! Of course one of the reasons why it was tight is that the other value is 0 and the value of the random variable is exactly k. This is consistent with the score example we saw above.

### Chebyshev Inequality

Chebyshev inequality is another powerful tool that we can use. In this inequality, we remove the restriction that the random variable has to be non negative. As a price, we now need to know additional information about the variable – (finite) expected value and (finite) variance. In contrast to Markov, Chebyshev allows you to estimate the deviation of the random variable from its mean. A common use of it estimates the probability of the deviation from its mean in terms of its standard deviation.

Similar to Markov inequality, we can state two variants of Chebyshev. Let us first take a look at the simplest version. Given a random variable X and its finite mean and variance, we can bound the deviation as

$P(|X-E[X]| \geq k \sigma ) \leq \frac{1}{k^2}$

There are few interesting things to observe here :
(1) In contrast to Markov inequality, Chebyshev inequality allows you to bound the deviation on both sides of the mean.
(2) The length of the deviation is $k \sigma$ on both sides which is usually (but not always) tighter than the bound k E[X]. Similarly, the fraction 1/k^2 is much more tighter than 1/k that we got from Markov inequality.
(3) Intuitively, if the variance of X is small, then Chebyshev inequality tells us that X is close to its expected value with high probability.
(4) Using Chebyshev inequality, we can claim that atmost one fourth of the values that X can take is beyond 2 standard deviation of the mean.

### Generalized Chebyshev Inequality

A more general Chebyshev inequality bounds the deviation from mean to any constant a . Given a positive constant a ,

$Pr(|X-E[X]| \geq a) \leq \frac{1}{a^2}\;Var[X]$

### Proof of Chebyshev Inequality

The proof of this inequality is straightforward and comes from a clever application of Markov inequality. As discussed above we select $f(x) = |X-E[X]|^2$. Using it we get ,

$Pr(|X-E[X]| \geq a) = Pr( (X-E[X])^2 \geq a^2)$
$Pr( (X-E[X])^2 \geq a^2) \leq \frac{1}{a^2} E[(X-E[X])^2]$

We used the Markov inequality in the second line and used the fact that $Var[X] = E[(X-E[X])^2]$.

### Common Pitfalls

It is important to notice that Chebyshev provides bound on both sides of the error. One common mistake to do when applying Chebyshev is to divide the resulting probabilistic bound by 2 to get one sided error. This is valid only if the distribution is symmetric. Else it will give incorrect results. You can refer Wikipedia to see one sided Chebyshev inequalities.

### Chebyshev Inequality for higher moments

One of the neat applications of Chebyshev inequality is to use it for higher moments. As you would have observed, in Markov inequality, we used only the first moment. In the Chebyshev inequality, we use the second moment (and first). We can use the proof above to adapt Chebyshev inequality for higher moments. In this post, I will give a simple argument for even moments only. For general argument (odd and even) look at this Math Overflow post.

The proof of Chebyshev for higher moments is almost exactly the same as the one above. The only observation we make is that $(X-E[X])^{2k}$ is always non negative for any k. Of course the next observation is $E[(X-E[X])^{2k}$ gives the 2k^th central moment . Using the statement from Mitzenmacher and Upfal’s book we get ,

$Pr(|X-E[X]| > t \sqrt[2k] {E[(X-E[X])^{2k}]}) \leq \frac{1}{t^{2k}}$

It should be intuitive to note that the more information we get the tighter the bound is. For Markov we got 1/t as the fraction. It was 1/a^2 for second order Chebyshev and 1/a^k for k^th order Chebyshev inequality.

### Chebyshev Inequality and Confidence Interval

Using Chebyshev inequality, we previously claimed that atmost one fourth of the values that X can take is beyond 2 standard deviation of the mean. It is possible to turn this statement around to get a confidence interval.

If atmost 25% of the population are beyond 2 standard deviations away from mean, then we can be confident that atleast 75% of the population lie in the interval $(E[X]-2 \sigma, E[X]+2 \sigma)$. More generally, we can claim that, $100 * (1-\frac{1}{k})$ percentage of the population lies in the interval $(E[X]-k. \sigma, E[X]+k \sigma)$ . We can similarly derive that 94% of the population lie within 4 standard deviations away from mean.

### Applications of Chebyshev Inequality

We previously saw two applications of Chebyshev inequality – One to get tighter bounds using higher moments without using complex inequalities. The other is to estimate confidence interval. There are some other cool applications that we will state without providing the proof. For proofs refer the Wikipedia entry on Chebyshev inequality.

(1) Using Chebyshev inequality, we can prove that the median is atmost one standard deviation away from the mean.
(2) Chebyshev inequality also provides the simplest proof for weak law of large numbers.

### Tightness of Chebyshev Inequality

Similar to Markov inequality, we can prove the tightness of Chebyshev inequality. I had fun deriving this proof and hopefully some one will find it useful. Define a random variable X as ,

[Note: I could not make the case statement work in WordPress Latex and hence the crude work around]

X = { $\mu$ + C  with probability p

{ $\mu$ – C  with probability p

{ $\mu$ with probability 1-2p

$E[X] = p(\mu +C) + p(\mu -C) + (1-2p) \mu = \mu$
$Var[X] = E[(X-\mu)^2]$
$= p (\mu+C-\mu)^2 + p (\mu-C-\mu)^2 + (1-2p)(\mu-\mu)^2$
$\Rightarrow Var[X] = 2pC^2$

If we want to find the probability that the variable deviates from mean by constant C, the bound provided by Chebyshev is ,

$Pr(|X-\mu| \geq C) \leq \frac{Var[X]}{C^2} = \frac{2pC^2}{C^2}=2p$

which is tight !

### Conclusion

Markov and Chebyshev inequalities are two of the simplest , yet very powerful inequalities. Clever application of them provide very useful bounds without knowing anything about the distribution of the random variable. Markov inequality bounds the probability that a nonnegative random variable exceeds any multiple of its expected value (or any constant). Chebyshev’s inequality , on the other hand, bounds the probability that a random variable deviates from its expected value by any multiple of its standard deviation. Chebyshev does not expect the variable to non negative but needs additional information to provide a tighter bound. Both Markov and Chebyshev inequalities are tight – This means with the information provided, the inequalities provide the most information they can provide.

Hope this post was useful ! Let me know if there is any insight I had missed !

### References

(1) Probability and Computing by Mitzenmacher and Upfal.
(2) An interactive lesson plan on Markov’s inequality – An extremely good discussion on how to teach Markov inequality to students.
(3) This lecture note from Stanford – Treats the inequalities from a prediction perspective.
(4) Found this interesting link from Berkeley recently.

## Introduction to Bayesian Decision Theory – Part 1

In this series of articles , I intend to discuss Bayesian Decision Theory and its most important basic ideas. The articles are mostly based on the classic book "Pattern Classification" by Duda,Hart and Stork. If you want the ideas in all its glory, go get the book !

As I was reading the book, I realized that in its heart , this field is a set of mostly common sense ideas validated by rigorous mathematics. So I will try to discuss the basic ideas in plain english without much mathematical rigor. Since I am still learning how to best explain these complex ideas, any comments on how to improve will be welcome !

You may ask what happened to PCA . Well, I still intend to write more on it but have not found enough time to sit and write it all. I have written a draft version of it but felt it was too technical and not very intuitive.So I am hoping to rewrite it again.

### Background

You would need to know basics of probability to understand the following. In particular the ideas of Prior , Posterior  and the idea of Likelihood  . Of course if you know all these , you will know the big idea of Bayes Theorem . I will try to explain them lightly but if you have any doubts check out Wikipedia or some old books.

We will take the example used in Duda et al. There are two types of fish : Sea Bass and Salmon. We catch a lot of fish of these two types but which are mixed together and our aim is to separate them by automation. We have a conveyor belt in which  the fishes come one by one and we need to decide if the current fish is sea bass or salmon. Of course, we want to make it as accurate as possible but also don’t want to spend lot of money on this project. This is in its heart a classification project. We will be given a few examples of both sea bass and salmon and based on it we need to infer the general characteristics using which we can distinguish them.

### Basic Probability Ideas

Now let us be slightly more formal. We say that there are two "classes" of fish – sea bass and salmon. According to our system, there is no other type of fish. If we treat it as a state machine , then our system has two states. The book uses a notation of $\omega_1 \; and \; \omega_2$ to represent them. We will use the names seabass and salmon as it is more intuitive.

The first basic idea is that of prior probability  . This is represented as $P(seabass) \; and \; P(salmon)$ which basically give the probability that the next fish in the conveyor is seabass or salmon. Of course, both of them have to sum to one. From the Bayesian perspective, this probability is usually obtained from prior (domain) knowledge. We will not talk about Frequentist interpretation as we will focus on Bayesian decision theory.

Let us assume that we use length of the fish to differentiate the fishes. So whenever the fish comes to the conveyor belt, we calculate its length (how, we don’t really care here ) . So we have transformed the fish into a simple representation using a single number, its length. So the length is a feature  that we use to classify and the step of converting  the fish into length is called feature extraction .

In a real life scenario, we will have multiple features and the input will converted to a vector. For eg we may use length , lightness of skin , fin length etc as feature. In this case , the fish will be transformed into a triplet. Converting the input to a feature vector makes further processing easy and more robust. We will usually use the letter x to represent the feature. So you can consider $P(x)$ is the probability of evidence. Eg lets say we got a fish (we dont know what it is yet) of length 5 inches. Now $P(x)$ gives the probability that some fish (either seabass or salmon) has the length 5 inches.

The next idea is that of likelihood . It is also called class conditional probability. It is represented as either $P(x|seabass) \; or \; P(x|salmon)$ . The interpretation is simple. This answers the question that if the fish is seabass what is the probability that it will have length $x$ inches (ditto salmon). Alternatively , what is the probability that a 5 inch seabass exists and so on. Or even how "likely" is a 5 inch seabass ?

The posterior probability is the other side of the story. This is represented by $P(seabass|x) \; or \; P(salmon|x)$ . Intuitively, given that we have a fish of length $x$ inches , what is the probability that it is a seabass (or salmon).  The interesting thing is that knowing prior probability and likelihood we can calculate posterior probability using the famous "Bayes Theorem". We can represent it in words as ,

$posterior = \frac{likelihood \times prior}{evidence}$

This gives another rationale for the word "likelihood". Among all other things being equal , the item with higher likelihood is more "likely" to final result. For eg if the likelihood of a 10 inch seabass is more than that of salmon then when we observe an unknown fish of length 10 inches , it is most likely a seabass.

PS : There is an excellent (but long) tutorial on Bayes Theorem at "An Intuitive Explanation of Bayes’ Theorem" . True to its title, it does try to explain the bizarre (atleast initially) result of Bayes Theorem using multiple examples. I highly recommend reading it.

### Bayesian Decision Theory

Let us enter into the decision theory at last. In a very high level definition, you can consider decision theory as a field which studies about "decisions" (to classify as seabass or not to be) – more exactly, it considers these decisions in terms of cost or loss functions. (More on that later). In essence , you can think of decision theory as providing a decision rule which tells us what action to taken when we make a particular observation. Decision theory can be thought of as all about evaluating decision rules. (Of course, I am grossly simplifying things, but I think I have conveyed the essence).

### Informal Discussion of Decision Theory for Two Class System with Single Feature

Let us take a look at the simplest application of decision theory to our problem. We have a two class system (seabass,salmon) and we are using a single feature (length) to make a decision. Be aware that length is not an ideal feature because many a time you will be having both seabass and salmon of same length (say 5 inches). So when we come across a fish with length 5 inches, we are stuck. We don’t know what decision to take as we know both seabass and salmon can be 5 inches. Decision theory to the rescue !

Instead of providing the theoretical ideas, I will discuss various scenarios and what is the best decision theoretic action to do. In all the scenarios let us assume that we want to be as accurate as possible.

### Case I : We don’t know anything and we are not allowed to see the fish

This is the worst case to be in. We have no idea about seabass and salmon (a vegetarian , perhaps ? 🙂 ). You are also not allowed to see the fish. But you are asked is the next fish in conveyor a seabass or salmon ? All is not lost – The best thing to do is to randomize. So the decision rule is with probability 50% say the next fish is seabass and  with probability 50% say it is salmon.

Convince yourself that this is the best thing to do – Not only when the seabass and salmon are in 50:50 , even when they are in 90:10 ratio.

### Case II : You know the prior probability but still you don’t see the fish

We are in a slightly better position here. We don’t get to see the fish yet , but we know the prior probability that the next fish is a seabass or salmon. ie We are give $P(seabass) \; and \; P(salmon)$ Remember, we want to be as accurate as possible and we want to as reliable about accuracy rate as possible.

A common mistake to do is to randomize again. ie with $P(seabass)$ say that the next fish is seabass and salmon otherwise. For eg let us say, $P(seabass) = .70 \; and \; P(salmon) = 0.3$ . Let me attempt an informal reasoning – In the (sample) worst case, you will get first 40 as seabass, next 30 as salmon and next 30 as seabass. But you say first 30 as salmon and next 70 as seabass. In this hypothetical example you are only at the most 40% accurate even though you can theoretically do better.

What does decision theory say here ? If $P(seabass) > P(salmon)$ then ALWAYS say seabass. Else ALWAYS say salmon. In this case the accuracy rate is $max(P(seabass),P(salmon))$ . Conversely, the error rate is the minimum of both the prior probabilities. Convince yourself that this is the best you can do . It sure is counterintuitive to always say seabass when you know you will get salmon too. But we can easily prove that this is the best you can do "reliably".

Mathematically, decision rule is decide seabass if $P(seabass) > P(salmon)$  else decide salmon .

Error is $min(P(seabass),P(salmon)$ .

### Case III : You know the likelihood function and the length but not the prior probability

This case is really hypothetical. ie we can see the fish and hence find its length. Let say x inches. We have $P(x inches|seabass) \; and \; P(x inches | salmon)$ but we don’t know the prior probability. The decision rule here is : For each fish , find the appropriate likelihood values. If the likelihood of seabass is higher than that of salmon , say the fish is seabass and salmon otherwise.

Note that we are making a decision based on our "observation" in contrast to previous cases. Unless, you are really unlucky and the prior probabilities are really skewed you can do well with this decision rule.

Mathematically, decision rule is decide seabass if $P(x|seabass) > P(x|salmon)$  else  decide  salmon .

### Case IV : You know the length, prior probability and the likelihood function

This is the scenario we are mostly in. We know the length of the fish (say 5 inches). We know the prior probability (say 60% salmon and 40% seabass). We also know the likelihood of them. (say P(5 inches|seabass) is 60% and P(5 inches|salmon) is 10% )

Now we can apply our favorite Bayes rule to get posterior probability. If the length of the fish is 5 inches then what is the probability that it a seabass ? A salmon ? Once you can calculate the posterior , the decision rule becomes simpler. If the posterior probability of seabass is higher than say the fish is seabass else say it is salmon.

Mathematically, decision rule is decide seabass if $P(seabass|x) > P(salmon|x)$ else decide salmon . This rule is very important and is called as Bayes Decision Rule.

For this decision rule, the error is $min(P(seabass|x),P(salmon|x)$

We can expand Bayes decision rule using Bayes theorem.

Decide seabass if  $p(x|seabass)P(seabass) > p(x|salmon)P(salmon)$   else decide salmon.

There are two special cases.
1. If likelihood are equal then our decision depends on prior probabilities.
2. If prior probabilities are equal then our decisions depend on likelihoods.

We have only scratched the surface of decision theory. In particular we did not focus much on bounding the error today. Also we did not discuss the cases where there are multiple classes or features. Hopefully, I will discuss them in a future post.

### Reference

Pattern Classification by Duda,Hart and Stork. Chapter 2.

Article talks about how the research “output” from India is increasing. I can only hope that the contents are at least half true. IMHO , lack of incentives to do cool research is a bigger impediment that corrupt politicians , anachronistic policies or lagging infrastructure. It is quite heartening to see India’s profile increasing even with these restrictions. Another interesting thing to note was that the full article mentioned that University Of Texas is one of the top collaborators. Nice !!.

This was an interesting news that I guess, most missed in CES. USB 3.0 rocks. Probably, I have to qualify that one of the candidates for USB 3.0 rocks. Time will tell if this standard prevails , but the speeds are fascinating.

An article about China’s high speed rails. It beats other competitors like Japan,Germany and France. Its top speed is 312 Kmph. Wow !! (Where the hell is India and US ? )

I think the idea is cool !!

Probably the article that shook me most. I got it from one of Pals. I think it has made me reevaluate storing passwords in apps. The most shocking revelation was that Pidgin was storing the account passwords in plaintext. I agree with their rationale but I would have preferred a message box saying the passwords will be stored in plaintext when I opted to store my passwords. Atleast it would have alerted me to the possibility without me digging this information from the Net from a (reasonably) obscure article.

The best link of today – Terry Tao’s notes on Probability. I think, to call Terry Tao as a genius  is by itself an understatement. I am still digesting the info in the notes. There were some interesting new information like Borel-Cantelli lemma , so I look forward to taking a deeper look. After taking a course on Random Graphs, I also look forward to his course on random matrices . Although , I have to admit, I don’t fancy my chances of understanding most of the topics in the course 🙂 .