Feeds:
Posts

## Detailed Tutorial on Markov and Chebyshev Inequalities

In this post, I plan to discuss about two very simple inequalities – Markov and Chebyshev. These are topics that are covered in any elementary probability course. In this post, I plan to give some intuitive explanation about them and also try to show them in different perspectives. Also, the following discussion is closer to discrete random variables even though most of them can be extended to continuous ones.

### Inequalities from an Adversarial Perspective

One interesting way of looking at the inequalities is from an adversarial perspective. The adversary has given you some limited information and you are expected to come up with some bound on the probability of an event. For eg, in the case of Markov inequality, all you know is that the random variable is non negative and its (finite) expected value. Based on this information, Markov inequality allows you to provide some bound on the tail inequalities. Similarly, in the case of Chebyshev inequality, you know that the random variable has a finite expected value and variance. Armed with this information Chebyshev inequality allows you to provide some bound on the tail inequalities. The most fascinating this about these inequalities is that you do not have to know the probabilistic mass function(pmf). For any arbitrary pmf satisfying some mild conditions, Markov and Chebyshev inequalities allow you to make intelligent guesses about the tail probability.

### A Tail Inequality perspective

Another way of looking at these inequalities is this. Supposed we do not know anything about the pmf of a random variable and we are forced to make some prediction about the value it takes. If the expected value is known, a reasonable strategy is to use it. But then the actual value might deviate from our prediction. Markov and Chebyshev inequalities are very useful tools that allow us to estimate how likely or unlikely that the actual value varies from our prediction. For eg, we can use Markov inequality to bound the probability that the actual varies by some multiple of the expected value from the mean. Similarly, using Chebyshev we can bound the probability that the difference from mean is some multiple of its standard deviation.

One thing to notice is that you really do not need the pmf of the random variable to bound the probability of the deviations. Both these inequalities allow you to make deterministic statements of probabilistic bounds without knowing much about the pmf.

### Markov Inequality

Let us first take a look at the Markov Inequality. Even though the statement looks very simple, clever application of the inequality is at the heart of more powerful inequalities like Chebyshev or Chernoff. Initially, we will see the simplest version of the inequality and then we will discuss the more general version. The basic Markov inequality states that given a random variable X that can only take non negative values, then

$Pr(X \geq k E[X]) \leq \frac{1}{k}$

There are some basic things to note here. First the term P(X >= k E(X)) estimates the probability that the random variable will take the value that exceeds k times the expected value. The term P(X >= E(X)) is related to the cumulative density function as 1 – P(X < E(X)). Since the variable is non negative, this estimates the deviation on one side of the error.

### Intuitive Explanation of Markov Inequality

Intuitively, what this means is that , given a non negative random variable and its expected value E(X)
(1) The probability that X takes a value that is greater than twice the expected value is atmost half. In other words, if you consider the pmf curve, the area under the curve for values that are beyond 2*E(X) is atmost half.
(2) The probability that X takes a value that is greater than thrice the expected value is atmost one third.
and so on.

Let us see why that makes sense. Let X be a random variable corresponding to the scores of 100 students in an exam. The variable is clearly non negative as the lowest score is 0. Tentatively lets assume the highest value is 100 (even though we will not need it). Let us see how we can derive the bounds given by Markov inequality in this scenario. Let us also assume that the average score is 20 (must be a lousy class!). By definition, we know that the combined score of all students is 2000 (20*100).

Let us take the first claim – The probability that X takes a value that is greater than twice the expected value is atmost half. In this example, it means the fraction of students who have score greater than 40 (2*20) is atmost 0.5. In other words atmost 50 students could have scored 40 or more. It is very clear that it must be the case. If 50 students got exactly 40 and the remaining students all got 0, then the average of the whole class is 20. Now , if even one additional student got a score greater than 40, then the total score of 100 students become 2040 and the average becomes 20.4 which is a contradiction to our original information. Note that the scores of other students that we assumed to be 0 is an over simplification and we can do without that. For eg, we can argue that if 50 students got 40 then the total score is atleast 2000 and hence the mean is atleast 20.

We can also see how the second claim is true. The probability that X takes a value that is greater than thrice the expected value is atmost one third. If 33.3 students got 60 and others got 0 , then we get the total score as around 2000 and the average remains the same. Similarly, regardless of the scores of other 66.6 students, we know that the mean is atleast 20 now.

This also must have made clear why the variable must be non negative. If some of the values are negative, then we cannot claim that mean is atleast some constant C. The values that do not exceed the threshold may well be negative and hence can pull the mean below the estimated value.

Let us look at it from the other perspective : Let p be the fraction of students who have a score of atleast a . Then it is very clear to us that the mean is atleast a*p. What Markov inequality does is to turn this around. It says, if the mean is a*p  then the fraction of students with a score greater than a is atmost p. That is, we know the mean here and hence use the threshold to estimate the fraction .

### Generalized Markov Inequality

The probability that the random variable takes a value thats greater than k*E(X) is at most 1/k. The fraction 1/k act as some kind of a limit. Taking this further, you can observe that given an arbitrary constant a, the probability that the random variable X takes a value >= a ie P(X >= a) is atmost 1/a times the expected value. This gives the general version of Markov inequality.

$Pr(X \geq a) \leq \frac{1}{a} E[X]$

In the equation above, I seperated the fraction 1/a because that is the only varying part. We will later see that for Chebyshev we get a similar fraction. The proof of this inequality is straightforward. There are multiple proofs even though we will use the follow proof as it allows us to show Markov inequality graphically.This proof is partly taken from Mitzenmacher and Upfal’s exceptional book on Randomized Algorithms.

Consider a constant a >= 0. Then define an indicator random variable I which takes value of 1 is X >=a . ie

$\displaystyle I = \begin{cases} 1, & \mbox{if } \mbox{ X} \geq \mbox{a} \\ 0, & \mbox{otherwise } \end{cases}$

Now we make a clever observation. We know that X is non negative. ie X >= 0. This means that the fraction X/a is atleast 0 and atmost can be infinty. Also, if X < a, then X/a < 1. When X > a, X/a > 1. Using these facts,

$I \leq \frac{X}{a}$

If we take expectation on both sides, we get

$E[I] \leq \frac{1}{a} E[X]$

But we also know that the expectation of indicator random variable is also the probability that it takes the value 1. This means E[I] = Pr(X>=a). Putting it all together, we get the Markov inequality.

$Pr(X \geq a) \leq \frac{1}{a} E[X]$

### Even more generalized Markov Inequality

Sometimes, it might happen that the random variable is not non-negative. In cases like this, a clever hack helps. Design a function f(x) such that f(x) is non negative. Then we can apply Markov inequality on the modified random variable f(X). The Markov inequality for this special case is :

$Pr(f(X) \geq a) \leq \frac{1}{a} E[f(X)]$

This is a very powerful technique. Careful selection of f(X) allows you to derive more powerful bounds.
(1) One of the simplest examples is f(X) = |X| which guarantees f(X) to be non negative.
(2) Later we will show that Chebyshev inequality is nothing but Markov inequality that uses $f(X) = |X-E(X)|^2$
(3) Under some additional constraints, Chernoff inequality uses $f(X) = e^{tX}$ .

### Simple Examples

Let us consider a simple example where it provides a decent bound and one where it does not. A typical example where Markov inequality works well is when the expected value is small but the threshold to test is very large.

Example 1:
Consider a coin that comes up with head with probability 0.2 . Let us toss it n times. Now we can use Markov inequality to bound the probability that we got atleast 80% of heads.

Let X be the random variable indicating the number of heads we got in n tosses. Clearly, X is non negative. Using linearity of expectation, we know that E[X] is 0.2n.We want to bound the probability P(X >= 0.8n). Using Markov inequality , we get

$P(X \geq 0.8n) \leq \frac{0.2n}{0.8n} = 0.25$

Of course we can estimate a finer value using the Binomial distribution, but the core idea here is that we do not need to know it !

Example 2:
For an example where Markov inequality gives a bad result, let us the example of a dice. Let X be the face that shows up when we toss it. We know that E[X] is 7/2 = 3.5. Now lets say we want to find the probability that X >= 5. By Markov inequality,

$P(X \geq 5) \leq \frac{3.5}{5} = 0.7$

The actual answer of course is 2/6 and the answer is quite off. This becomes even more bizarre , for example, if we find P(X >= 3) . By Markov inequality,

$P(X \geq 3) \leq \frac{3.5}{3} = \frac{7}{6}$

The upper bound is greater than 1 ! Of course using axioms of probability, we can set it to 1 while the actual probability is closer to 0.66 . You can play around with the coin example or the score example to find cases where Markov inequality provides really weak results.

### Tightness of Markov

The last example might have made you think that the Markov inequality is useless. On the contrary, it provided a weak bound because the amount of information we provided to it is limited. All we provided to it were that the variable is non negative and that the expected value is known and finite. In this section, we will show that it is indeed tight – that is Markov inequality is already doing as much as it can.

From the previous example, we can see an example where Markov inequality is tight. If the mean of 100 students is 20 and if 50 students got a score of exactly 0, then Markov implies that atmost 50 students can get a score of atleast 40.

Note : I am not 100% sure if the following argument is fully valid – But atleast it seems to me 🙂

Consider a random variable X such that

$X = \displaystyle \begin{cases} k & \mbox{with probability } \frac{1}{k} \\ 0 & \mbox{else} \end{cases}$

We can estimate its expected value as

$E[X] = \frac{1}{k} \times k + \frac{k-1}{k} \times 0 = 1$

We can see that , $Pr(X \geq k E[X]) = Pr(X \geq k) = \frac{1}{k}$

This implies that the bound is actually tight ! Of course one of the reasons why it was tight is that the other value is 0 and the value of the random variable is exactly k. This is consistent with the score example we saw above.

### Chebyshev Inequality

Chebyshev inequality is another powerful tool that we can use. In this inequality, we remove the restriction that the random variable has to be non negative. As a price, we now need to know additional information about the variable – (finite) expected value and (finite) variance. In contrast to Markov, Chebyshev allows you to estimate the deviation of the random variable from its mean. A common use of it estimates the probability of the deviation from its mean in terms of its standard deviation.

Similar to Markov inequality, we can state two variants of Chebyshev. Let us first take a look at the simplest version. Given a random variable X and its finite mean and variance, we can bound the deviation as

$P(|X-E[X]| \geq k \sigma ) \leq \frac{1}{k^2}$

There are few interesting things to observe here :
(1) In contrast to Markov inequality, Chebyshev inequality allows you to bound the deviation on both sides of the mean.
(2) The length of the deviation is $k \sigma$ on both sides which is usually (but not always) tighter than the bound k E[X]. Similarly, the fraction 1/k^2 is much more tighter than 1/k that we got from Markov inequality.
(3) Intuitively, if the variance of X is small, then Chebyshev inequality tells us that X is close to its expected value with high probability.
(4) Using Chebyshev inequality, we can claim that atmost one fourth of the values that X can take is beyond 2 standard deviation of the mean.

### Generalized Chebyshev Inequality

A more general Chebyshev inequality bounds the deviation from mean to any constant a . Given a positive constant a ,

$Pr(|X-E[X]| \geq a) \leq \frac{1}{a^2}\;Var[X]$

### Proof of Chebyshev Inequality

The proof of this inequality is straightforward and comes from a clever application of Markov inequality. As discussed above we select $f(x) = |X-E[X]|^2$. Using it we get ,

$Pr(|X-E[X]| \geq a) = Pr( (X-E[X])^2 \geq a^2)$
$Pr( (X-E[X])^2 \geq a^2) \leq \frac{1}{a^2} E[(X-E[X])^2]$

We used the Markov inequality in the second line and used the fact that $Var[X] = E[(X-E[X])^2]$.

### Common Pitfalls

It is important to notice that Chebyshev provides bound on both sides of the error. One common mistake to do when applying Chebyshev is to divide the resulting probabilistic bound by 2 to get one sided error. This is valid only if the distribution is symmetric. Else it will give incorrect results. You can refer Wikipedia to see one sided Chebyshev inequalities.

### Chebyshev Inequality for higher moments

One of the neat applications of Chebyshev inequality is to use it for higher moments. As you would have observed, in Markov inequality, we used only the first moment. In the Chebyshev inequality, we use the second moment (and first). We can use the proof above to adapt Chebyshev inequality for higher moments. In this post, I will give a simple argument for even moments only. For general argument (odd and even) look at this Math Overflow post.

The proof of Chebyshev for higher moments is almost exactly the same as the one above. The only observation we make is that $(X-E[X])^{2k}$ is always non negative for any k. Of course the next observation is $E[(X-E[X])^{2k}$ gives the 2k^th central moment . Using the statement from Mitzenmacher and Upfal’s book we get ,

$Pr(|X-E[X]| > t \sqrt[2k] {E[(X-E[X])^{2k}]}) \leq \frac{1}{t^{2k}}$

It should be intuitive to note that the more information we get the tighter the bound is. For Markov we got 1/t as the fraction. It was 1/a^2 for second order Chebyshev and 1/a^k for k^th order Chebyshev inequality.

### Chebyshev Inequality and Confidence Interval

Using Chebyshev inequality, we previously claimed that atmost one fourth of the values that X can take is beyond 2 standard deviation of the mean. It is possible to turn this statement around to get a confidence interval.

If atmost 25% of the population are beyond 2 standard deviations away from mean, then we can be confident that atleast 75% of the population lie in the interval $(E[X]-2 \sigma, E[X]+2 \sigma)$. More generally, we can claim that, $100 * (1-\frac{1}{k})$ percentage of the population lies in the interval $(E[X]-k. \sigma, E[X]+k \sigma)$ . We can similarly derive that 94% of the population lie within 4 standard deviations away from mean.

### Applications of Chebyshev Inequality

We previously saw two applications of Chebyshev inequality – One to get tighter bounds using higher moments without using complex inequalities. The other is to estimate confidence interval. There are some other cool applications that we will state without providing the proof. For proofs refer the Wikipedia entry on Chebyshev inequality.

(1) Using Chebyshev inequality, we can prove that the median is atmost one standard deviation away from the mean.
(2) Chebyshev inequality also provides the simplest proof for weak law of large numbers.

### Tightness of Chebyshev Inequality

Similar to Markov inequality, we can prove the tightness of Chebyshev inequality. I had fun deriving this proof and hopefully some one will find it useful. Define a random variable X as ,

[Note: I could not make the case statement work in WordPress Latex and hence the crude work around]

X = { $\mu$ + C  with probability p

{ $\mu$ – C  with probability p

{ $\mu$ with probability 1-2p

$E[X] = p(\mu +C) + p(\mu -C) + (1-2p) \mu = \mu$
$Var[X] = E[(X-\mu)^2]$
$= p (\mu+C-\mu)^2 + p (\mu-C-\mu)^2 + (1-2p)(\mu-\mu)^2$
$\Rightarrow Var[X] = 2pC^2$

If we want to find the probability that the variable deviates from mean by constant C, the bound provided by Chebyshev is ,

$Pr(|X-\mu| \geq C) \leq \frac{Var[X]}{C^2} = \frac{2pC^2}{C^2}=2p$

which is tight !

### Conclusion

Markov and Chebyshev inequalities are two of the simplest , yet very powerful inequalities. Clever application of them provide very useful bounds without knowing anything about the distribution of the random variable. Markov inequality bounds the probability that a nonnegative random variable exceeds any multiple of its expected value (or any constant). Chebyshev’s inequality , on the other hand, bounds the probability that a random variable deviates from its expected value by any multiple of its standard deviation. Chebyshev does not expect the variable to non negative but needs additional information to provide a tighter bound. Both Markov and Chebyshev inequalities are tight – This means with the information provided, the inequalities provide the most information they can provide.

Hope this post was useful ! Let me know if there is any insight I had missed !

### References

(1) Probability and Computing by Mitzenmacher and Upfal.
(2) An interactive lesson plan on Markov’s inequality – An extremely good discussion on how to teach Markov inequality to students.
(3) This lecture note from Stanford – Treats the inequalities from a prediction perspective.
(4) Found this interesting link from Berkeley recently.

## How to add CRAN Ubuntu Repository to your system and fixing the GPG error

R is one of the coolest language designed and I am having lot of fun using it. It has become my preferred language of programming next only to Python. If you are also using Ubuntu, the rate of update of R in Ubuntu’s official repositories is slightly slow. If you want to get the latest packages as soon as possible, then the best option is to add some CRAN mirror to your Ubuntu repository. This by itself is straightforward. I decided to write this post on how to solve the GPG error if you get it.

### Steps

(1) Decide on which CRAN repository you want to use. Finding the nearest one usually gives the best speed. Lets say it is http://cran.cnr.berkeley.edu/ . Append "bin/linux/ubuntu". Typically this works. You can confirm this by going to this url in the browser too.
(2) Add this to your Ubuntu repository. There are multiple ways. In the steps, below, replace http://cran.cnr.berkeley.edu/bin/linux/ubuntu with your mirror

(a) Synaptic -> Settings -> Repositories -> Other Software -> Add . In the apt line enter "deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu natty/".
(b) sudo vim /etc/apt/sources.list and add "deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu natty/" at the end. If you are not comfortable with vim, use gedit but instead of sudo , used gksudo.

(3) Refresh the source repository by using refresh in Synaptic or using  "sudo apt-get update "
(4) Install R or any other package you want. If you are installing R , I suggest you install r-base-dev instead of r-base. If you are installing some R package , check if it exists with the name r-cran-* . Else, install it using install.packages command inside R.
(5) Enjoy 🙂

### Fixing GPG Errors

When I did these steps, I got an error like the following (This occurred when I updated last month, this might be fixed now !):

GPG error: http://cran.cnr.berkeley.edu natty/ Release: The following signatures couldn’t be verified because the public key is not available: NO_PUBKEY 51716619E084DAB9

If you get the error, enter the following commands in the terminal.

gpg –keyserver keyserver.ubuntu.com –recv-key E084DAB9
gpg -a –export E084DAB9 | sudo apt-key add –

Repeat the steps above and this should fix the key error.

## Detailed discussion on NP-Completeness of Subset Sum

I recently spent some time developing notes on Subset sum – specifically the NP-Completeness part of it. I thought I will share it with the blog readers.

### Introduction

Subset sum is one of the very few arithmetic/numeric problems that we will discuss in this class. It has lot of interesting properties and is closely related to other NP-complete problems like Knapsack . Even though Knapsack was one of the 21 problems proved to be NP-Complete by Richard Karp in his seminal paper, the formal definition he used was closer to subset sum rather than Knapsack.

Informally, given a set of numbers S and a target number t, the aim is to find a subset S’ of S such that the elements in it add up to t. Even though the problem appears deceptively simple, solving it is exceeding hard if we are not given any additional information. We will later show that it is an NP-Complete problem and probably an efficient algorithm may not exist at all.

### Problem Definition

The decision version of the problem is :  Given a set S and a target t does there exist a subset $S^{'} \subseteq S$ such that $t = \sum_{s \in S'} s$ .

### Exponential time algorithm approaches

One thing to note is that this problem becomes polynomial if the size of  S’ is given. For eg,a typical interview question might look like : given an array find two elements that add up to t. This problem is perfectly polynomial and we can come up with a straight forward $O(n^2)$ algorithm using nested for loops to solve it. (what is the running time of best approach ?).

A slightly more complex problem asks for ,say, 3 elements that add up to t. Again, we can come up with a naive approach of complexity $O(n^3)$. (what is the best running time?). The catch in the general case of subset sum is that we do not know $|S^{'}|$. At the worst case $|S^{'}|$ is $O(n)$ and hence the running time of brute force approach is approximately $n^{O(n)}$.

A slightly more efficient algorithm checks out all possible $2^n$ subsets. One typical way to do this is to express all numbers from 0 to $2^{n}-1$ in binary notation and form a subset of elements whose indexes are equal to the bit positions that correspond to 1. For eg, if n is 4 and the current number, in decimal, is say $10$ which in binary is 1010. Then we check the subset that consists of $1^{st}$ and $3^{rd}$ elements of S. One advantage of this approach is that it uses constant space. At each iteration, you examine a single number. But this approach will lead to a slower solution if $|S^{'}|$ is small. Consider the case where $t=S[\frac{n}{2}]$. We will have to examine around $O(2^{\frac{n}{2}})$ different subsets to reach this solution.

A slightly different approach finds all possible sums of subsets and checks if t has occurred in the subset.

EXPONENTIAL-SUBSET-SUM(S,t):
n =  |S|
$L_{0}$ = {0}
for i in 1 to n  :
$L_{i}$ = merge-lists($L_{i-1}, L_{i-1} + S[i]$)
if $L_{i}$ has t, return true.
remove all elements greater than t from $L_i$
if $L_{n}$ has t, return true else return false

This algorithm uses the notation S+x to mean ${s+x :s \in S}$ . Refer CLRS 35.5 for a discussion of a similar algorithm for a variant of subset sum problem.

### NP-Completeness of Subset Sum Decimal

In this section we will prove that a specific variant of Subset sum is NP-Complete. Subset sum decimal is defined very similar to standard Subset sum but each number in S and also t is encoded in decimal digits.

We can show that Subset sum decimal is in class NP by providing the subset S’ as the certificate. Clearly, we can check if elements in S’ adds up to t in polynomial time.

The next step is to select another NP-Complete problem which can be reduced to Subset sum decimal. So far we have not discussed any arithmetic NP complete problems. The only non graph theoretic problem that we have discussed in 3SAT and we will use it for the proof. Of course there are multitude of other reductions including Vertex cover, 3 dimensional matching, partition etc.

We are now given a 3SAT formula $\phi$ with n variables – $x_1, x_2,\ldots,x_n$ and m clauses – $C_1, C_2,\ldots, C_m$. Each clause $C_i$ contains exactly 3 literals. Our aim is to construct an instance of subset sum problem $$ such that $\phi$ is satisfiable if and only if a solution to our instance of Subset sum decimal exists. The outline of the proof is as follows :

1. Construct a set S of unique large decimal numbers that somehow encode the constraints of $\phi$. Additionally this operation must take polynomial time.
2. Construct an appropriate target t such that this instance of Subset sum decimal is solvable if and only if a solution to 3SAT instance exists. Handle complications like carries in addition.
3. Devise a way to find the satisfying assignment from subset solution and vice versa.

To simplify the proof, we make the following assumptions :

1. All the literals $x_1$ to $x_n$ is used in some clause of $\phi$ .
2. No clause can contain both a literal and its complement.

As a consequence of these assumptions, we do not have any variables that are superfluous. Also we do not have any clauses that get satisfied trivially.

We will not duplicate the proof in the lecture notes as a detailed sketch of the reduction is given in CLRS section 34.5.5. Instead we will focus on certain observations.

Observation 1 : Construction of S and t takes polynomial time

This is easy to see. For each variable $x_i$ we create 2 variables. Similarly we create two variables for each clause $C_j$. The total number of variables in S is 2(m+n). Each number in set S and t contains exactly n+m digits. Hence the total construction takes time polynomial in n+m .

Observation 2 : There are no carries when elements in subset are added to form t.

We can see that the only allowed integers in number construction are 0,1 and 2. The columns corresponding to variables (the leading n digits) can add up to at the most 2. The columns corresponding to clauses (trailing m digits) cannot have a sum of more than 6. This is because of two facts :

(a) 3SAT has at most 3 literals in each clause

(b) A clause cannot contain a literal and its complement.

So, each variable can add at most 1 to that clause column and there at most 3 variables in a clause. Additionally, we have 1 and 2 from the slack variables. Concisely, we get at most 3 from $v_i$ or $v_i^{'}$ and 3 from $s_i$ and $s_i^{'}$.

Hence we can conclude that carries does not occur at each column(digit) as the base we use is 10.

Observation 3 : All variables in S corresponding to $x_i$s are unique.

Each variable $x_i$ creates two variables $v_i$ and $v_i^{'}$. The proof is in two parts :

(a) First we show that if $i \neq j$ , $v_i$ and $v_j$ does not match in the leading n digits. Similar argument holds for $v_i^{'}$ and $v_j^{'}$.
(b) Next, we can show that $v_i$ does not equal to $v_i^{'}$. This is because our assumption that a literal and its complement does not occur in the same clause. This means that the trailing m digits will not be equal.

In conclusion, no pair of variables in S corresponding to $x_i$ are equal.

Observation 4 : All variables in S corresponding to $C_i$s are unique.

Each clause $C_i$ creates two variables $s_i$ and $s_i^{'}$. If  $i \neq j$, $s_i (s_i^{'})$ and $s_j (s_j^{'})$ does not match in the trailing m digits. Additionally, by construction, $s_i \neq s_i^{'}$ as the digit position corresponding to $C_i$ has 1 for $s_i$ and 2 for $s_i^{'}$.

Observation 5 : All variables in S is unique. i.e. S forms a set.

This can observed from Observation 3 and 4. By construction $v_i$ and $s_i$ do not match. Similar argument hold for $v_i^{'}$ and $s_i^{'}$.

Observation 6 : New variables corresponding $x_i$ and $C_j$ are both needed for proof.

A detailed sketch is given in CLRS. The variables $v_i$ and $v_i^{'}$ created from $x_i$ makes sure that each variable has a unique boolean assignment of 0 or 1. Else the sum for that column in target will be 2. This is due to the assumption that all variables $x_i$ HAS to be used in some clause $C_j$ and hence has a unique assignment. Of course, it is possible that $\phi$ has multiple satisfying assignment but the target digit forces only one of them to be selected when you select the elements of subset $S^{'}$.

The digits corresponding to clauses makes sure that each clause has at least one variable that evaluates to true. This is because each digit of slack variable corresponding to $C_i$ (ie $s_i,s_i^{'}$) contribute at most 3 towards t and hence the remaining (at least) 1 has to come from $v_j$ or $v_j^{'}$s.

So variables $v_i$ ensure that each $x_i$ has a unique assignment. Variables $s_i$ ensure that each clause $C_j$ of $\phi$ is satisfied.

Observation 7 : Subset sum is NP complete if the numbers are expressed in base $b \geq 7$.

From observation 2 , we know that the maximum possible digit due to summation of elements in S is 6. This means we can reuse the proof of Subset sum decimal to prove that Subset sum is NP-Complete for any base b that is greater that 6.

Observation 8 : Given S’ we can find a satisfying assignment for $\phi$.

We know that any satisfying subset $S^{'}$ must include either $v_i$ or $v_i^{'}$ for $\forall i , 1 \leq i \leq n$. If $S^{'}$ includes $v_i$ then set $x_i$ to 1. Else set it to 0.

Observation 9 Given a satisfying assignment for $\phi$ , we can find S’

This is a bit tricky and is done in two steps. More details can be found in CLRS proof.

1. If the satisfying assignment had $x_i$ , then select $v_i$. Else select $v_i^{'}$.
2. For each clause $C_j$ find how many variables in it evaluated to true due to the boolean assignment. At least one variable has to be true and at most 3 variables are true.

a. If $C_j$ has only one variable that evaluates to true, then select $s_j$ and $s_j^{'}$.
b. If $C_j$ has two variables that evaluate to true, then select $s_j^{'}$.
c. If $C_j$ has three variables that evaluate to true, then select $s_j$.

Observation 10 : If $\phi$ is not satisfied, then S’ cannot be found.

If $\phi$ is not satisfied, then there exist at least one clause $C_j$ that is not satisfied. This means that for ${n+j}^{th}$ digit, the slack variables $s_j,s_j^{'}$ contribute only 3 but the corresponding digit in t has 4. Hence no S’ exists.

### NP-Completeness of Subset Sum Binary

The formal definition of Subset sum binary is similar to Subset sum decimal . The only difference is that all numbers are encoded in bits.

We can notice that the above proof for Subset sum decimal holds only for numbers expressed in base of at least 7 (from observation 7). For bases from 1-6, the previous proof does not apply – partly due to the fact that there will be carries during addition. We need an alternate proof approach. Since we have proved Subset sum decimal as NP-Complete , we can use the result to prove Subset sum binary as NP-Complete.

The certificate is the subset S’ given in binary. We can see that it can be done in polynomial time and hence Subset sum binary is in NP.

The next step is to reduce Subset sum decimal to Subset sum binary. First we observe that any number encoded in decimal can be encoded to binary in polynomial time and vice versa. When given S and t in decimal as input, we encode them in binary and pass it to our Subset sum binary routine. The decision version of Subset sum binary returns true or false which can be fed directly as result of Subset sum decimal. In the optimization version , we just convert the $S’$ returned by the Subset sum binary subroutine to decimal.

Observation 11 : A decimal number can be converted to binary in polynomial time.

Assume some number n is encoded in both binary and decimal. This means $n = 10^k = 2^{k1}$ where k is the number of digits in the decimal representation and k1 is the number of bits needed to encode it.

Taking log to the base 2 on both sides,

$k * log_{2} {10} = {k1} \implies {3.3} {k} = {k1}$

So to express a decimal number with k digits, we need between 3k – 4k bits.

Observation 12 : Subset sum is NP complete for any base $b\geq 2$.

The logarithms of the same number in two different bases differ by at most a constant. ie,

$log_{b1}^{b2} = \frac{log_{b1}^{n}}{log_{b2}^{n}}$.

$log_{b1}^{b2}$ is a constant irrespective of n. So if n needs k digits in base b1, then it needs at most $\frac{k}{log_{b1}b2}$ to be represented in base b2. (Verify observation 11 using this equation !).

### NP-Completeness of Subset Sum Unary

From observation 12, the only base left is 1 and this section handles the special case where all numbers are expressed in base 1. Subset sum unary is similar to Subset sum decimal where all numbers are expressed in unary notation. Numbers in base 1 are called as being represented in unary. Any number k is represented as $1^k$  which is a string of k 1’s. Let us check if Subset sum unary is NP-Complete .

The certificate is the subset where all elements are expressed in unary. If we are given numbers in unary, then  verification takes time that is polynomial in the length of individual unary numbers. Hence Subset sum unary is in unary.

To prove Subset sum unary is in NP-Complete , we have to reduce either Subset sum decimal/binary to unary. Superficially, it looks straightforward and hence  it seems as though Subset sum unary is in NP-Complete. But the catch is that expressing a number n in base b to unary needs time exponential when computed wrt the size of n’s representation in base b. For eg, representing a binary number n that needs k bits needs around $2^{k's}$ unary digits. We can see that $2^k$ is exponential when viewed from k.

In summary, converting a number from any base to unary takes exponential time. So we cannot use our reduction technique as there the reduction is not polynomial.

### Dynamic Programming solution for Subset Sum Unary

What we showed above was that Subset sum unary is in NP but not NP-Complete. Here we show that there exists a dynamic programming formulation for this problem. We represent the problem as a matrix A of size n*t. A is a boolean matrix where the interpretation of cell A[i,j]=True is that there exists a subset of ${x_1,x_2,\ldots,x_i}$ that sum up to j. ie $\exists S^{'} \subseteq \{x_1,x_2,\ldots,x_i\}$ such that $j=\sum_{s \in S'} s$.

The algorithm goes as follows :

SUBSET-SUM-UNARY(S,t):
Form matrix A
Set A[1,0] = True
Set A[1,j] = False unless j==S[1] in which case set A[1,j] to True
for i=2 to t
for j=2 to n
if A[i-1,j] == True
A[i,j] = True
else if A[i-1,j-x_i] == True
A[i,j] = True
else
A[i,j] = False

Consider the set $S=\{2,3,4,5\}$ and let t=8. The worked out DP is given below :

 0 1 2 3 4 5 6 7 8 2 T F T F F F F F F 3 T F T T F T F F F 4 T F T T T T T T F 5 T F T T T T T T T

Since A[5,8]=True , we conclude that there exists a subset of S that sum up to t(8).

### Strong and Weak NP-Complete Problems

Subset sum is interesting in the sense that its binary/decimal can be proved as NP-Complete but its unary version seems to allow a polynomial looking dynamic programming solution.

Looking at the dynamic programming solution carefully, the time (and space) complexity of the approach is $O(n*t)$ where n=|S| and t is the target. By itself, the DP solution looks feasible and ‘somehow’ polynomial. But one of the reasons that Subset sum is NP-Complete is due to the fact that it allows "large" numbers. If t is large, then the table A is huge and the DP approach takes a lot of time to complete.

Given S and t , there are two ways to define an polynomial algorithm. One uses the length of S ie n to measure algorithm complexity. From this angle, $O(n*t)$ is not polynomial. This is because t can be huge irrespective of n. For eg, we have have a small set with 4 elements but the individual elements (and t) are of the order , say, $O(10^{10})$ . But from the perspective of magnitude of t, this dynamic programming approach is clearly polynomial.

In other words, we have two ways to anchor our polynomial – $Length[S]$ and $Magnitude[t]$. An algorithm is called pseudo polynomial, if its time complexity is bounded above by a polynomial function of two variables – $Length[S]$ and $Magnitude[t]$ .

Problems that admit pseudo polynomial algorithms are called weak NP-Complete problems and those that do not admit are called Strong NP-Complete problems. For example, Subset sum is a weak NP-Complete problem but Clique is a strong NP-Complete problem.

There are lot of interesting discussion about the strong/weak NP-Complete problems in both Garey and Johnson and in Kleinberg/Tardos. See references for more details.

Observation 13 : Only number theoretic problems admit pseudo polynomial algorithms.

Observation 14 : Strong NP-Complete problems do not admit a pseudo polynomial time algorithm unless P=NP.

### References

1. CLRS 34.5.5 – Proof of NP-Completeness of Subset sum.
2. CLRS 35.5 – An exponential algorithm to solve a variant of subset sum problem.
3. Garey and Johnson 4.2 – Discussion of pseudo polynomial time algorithms along with strong and weak NP-Complete problems.
4. Kleinberg and Tardos 6.4 – Discusses a variant of the DP algorithm given in the lecture notes and the concept of pseudo polynomial time algorithms. Section 8.8 has an alternate NP-Completeness proof of Subset sum using vertex cover which you can skim through if interested.

Hope you enjoyed the discussion on various facets of Subset sum problem !

## Impressions on MIT OCW Calculus Revisited

Sometime early this week , I finished listening to the excellent video lectures from MIT OCW’s Calculus Revisited course . I had been meaning to listen to MIT OCW’s Single Variable Calculus course for quite sometime as my background in Calculus is a bit flaky. My interests are in machine learning, data mining and AI where Calculus has a nasty habit of making surprise entries 🙂

I somehow finished my Master’s using my old Calculus knowledge. I took a course on Numerical Methods which kind of exposed my weaknesses. I kept getting confused with the error approximations which used ideas from  infinite series. Other advanced ideas like multivariable optimization were also problematic to me. Once that course was over, I swore myself to refresh my Calculus stuff and also learn multivariable calculus.

I started listening to MIT OCW’s Single Variable Calculus lecture videos and felt two things – The course was a bit slow for my pace and the course jumped right away into the mechanics without spending much time on the intuitive explanations of the Calculus. In other words, I felt 18.01 was more focused on the analytic part which emphasized proofs and derivations whereas for my purposes an intuitive explanation of the concept would have sufficed. In fact, I remembered almost all of the Calculus formulas from undergrad – My only problem was the lack of “sense” in how to apply it to the problem I faced (say in machine learning or some optimization).

Then I found the Calculus Revisited course from MIT OCW. It consists of a series of lectures on Calculus but also assumes that students have had prior exposure to it. This assumption had some interesting consequences and I fit the bill perfectly. I downloaded the set of videos and started listening to them. Interestingly, all the lectures were between 20-40 minutes which allowed for maximum focus and also allowed you to listen to multiple lectures in the same day. In fact, Arlington had a heavy snow this week and my university had to be closed for the entire week. I completed around 16 lectures in 3 days and was able to finish it ahead of my target date of Feb 15.

The course starts with the absolute basic ideas of sets, functions, induction and other stuff. If you are from CS and had taken discrete math, you can feel free to skip the first section. But I would suggest you to still take a look as it , in a sense, sets the stage for the entire course. Do take some time to listen to the lecture on limits. (Part 1 , lecture 4). Here, the discussion of limits effortlessly leads to the derivation of the formula for instantaneous speed and hence differentiation.

Part 2 forms the crux of the course and covers differentiation. Professor Herbert Gross had a beautiful way of teaching stuff about derivatives. In particular, he extensively used the idea of geometric proofs or visualizations to expound basic ideas. The way he brought out the tight relation between analysis (as in Math) and geometry was enthralling. He had a huge emphasis on geometric intuition which helped me to “grasp” the key concepts.

Part 3 had some nice discussion on Circular functions. He joked about how teachers don’t provide good motivation for learning trigonometry which I felt very true to me. He also explained some concepts that were new to me – that you do not really need triangles to define cosine and sine. Previously, I was aware of the radian concept but never put it all together. He also explained how sine and cosine tend to come up in unexpected places – like as the solution of the differential equation for harmonic motion 🙂 He also masterfully showed the close relation between circular and hyperbolic functions with a playful title of ‘What a difference a sign makes’  (in Part 5).

Part 4 discussed about integration and how it can be used to calculate 2 and 3 dimensional areas (and volumes). This part also had a great discussion on how differential and integral calculus are related. That being said, I was a bit dissatisfied with the discussion on the two fundamental theorems of Calculus. The discussion on Mean Value Theorem also felt a bit rushed. I got a bit lost on the discussion on 1 dimensional arc length calculations. May be I should revisit the lecture notes for the same when I get some free time.

Part 6 was my favorite part for two reasons – This had a discussion of infinite series and my favorite quip of the course. When discussing about the non intuitiveness and the intellectual challenges posed by infinity , professor Herbert Gross playfully quips (which goes something like this)– ‘ of course, one thing to do is to not study it. I can call it as the right wing conservative educational philosophy’ – Ouch 🙂 I think I mostly understood the idea of infinite series even though there was not much explanation of “why” it works that way. I also felt the topic of Uniform Convergence to be way beyond my comprehension level.

Overall, it is a great course and acted as a fast paced refresher for those who had already taken Calculus. The course slowly starts from basic pre-calculus ideas and rapidly gains speed and covers a huge list of calculus topics. I felt few of important Calculus topics were not covered or rushed up – First and second fundamental theorem of Calculus, Mean Value theorem, Taylor series, L’Hospital rule, discussion of exponents and logarithms etc.

But that being said, I feel the course more than makes it up for the way the basic ideas were covered. I had fun learning the ideas of limits, infinitesimals , intuitive ideas of differentiation/integration, geometric explanation of differentiation/integration, how the concept of inverse functions pervades Calculus etc. Prof. Herbert Gross had a jovial air around him and occasionally delved into philosophical discussions which made listening to the lectures more interesting.

He also had an extensive set of supplementary notes and huge amount of problems with solutions. I had to skip the problems part to conserve time. But if you have some time do spend some on it.

Lastly, I found that one of the lectures in the series was missing. Lecture 5 in Part 2 on Implicit Differentiation was the same as the one on Lecture 4. I had sent a mail to MIT OCW about this and got an reply saying they will fix it soon. Hopefully, it will be fixed soon.

In conclusion, this is a great , fast paced course on Calculus that emphasizes geometric intuition of the major ideas in Calculus.  Listen to it if you already know Calculus and want a fast refresher ! I am currently listening to the lectures on Multi Variable calculus. I do intend to listen to Single variable Calculus again , may be in Summer. I will put out another post on how it went 🙂

## Learning Math using YouTube Videos

Last week, I was searching for tutorials on using Lagrange multipliers. I was most interested in the case where there are multiple constraints. I found some good youtube videos in the process. So, I spent some time looking at good Youtube channels where good math lessons are taught. To my delight , I found some good channels.

One thing I need to mention is that in the channels I mention here , the Math level is not very advanced – at the most up to the undergraduate level. If you want really advanced stuff check out MIT OCW or similar places. Also, since most of the efforts were voluntary , there was also quite a bit of overlap in the lessons. Most of them are around 10 minutes which is excellent as they allow me to listen to a lesson when I feel bored and in the process refresh my basic math 🙂

Some of my favorite channels (not in any order) are :

1. Math2b Prof’s channel
Has some interesting stuff on Partial fractions, calculus and some geometry ish topics.

2. Partick JMT’s channel
Has some nice and organized stuff about trigonometry and calculus.

3. MathTV’s channel
Has a series of videos of algebra, calculus and other stuff

This is probably the most popular education Youtube channel. It contains basic tutorial videos on lot of subjects like physics , biology and math. It also has some nice videos on contemporary economic issues. Most of the videos are well packages using playlists that will help you listen in a organized fashion. You can also check Khan academy’s website .

### Other Resources

1. Steven Strogatz’s NYTimes Math article series
Steven Strogatz writes a weekly article series on Math in NYTimes. He explains lot of interesting stuff in Math in a simple manner. You can check out the Strogatz’s Opinionator blog page for more details.

### Some Tips

Most of the channels may not be very useful for grad students in their studies. But they can act as a refresher.

The easiest way to follow the channels is by Subscribing to it. In each of the web page , there is a subscribe button which allows you to be notified when  new video are uploaded. Once you subscribe , you either visit your My Subscriptions page to get the videos uploaded per user. You can also add the subscription widget to your Youtube homepage.

Just to bring the topic to closure, I finally found a good tutorial on using Lagrange Multipliers with multiple constraints at An Introduction to Lagrange Multipliers .

Have fun with all the Math videos 🙂

## Big Picture of Calculus

Calculus is one of the important fields to master if you want to do research in data mining or machine learning. There is a very good set of video lectures on Single Variable Calculus at MIT OCW. The video lectures are here . I had listened to first 5-6 lectures. Since I had some grounding in calculus already, I did not have any trouble understanding it. But I felt professor David Jerison went a bit too fast but without giving a deep intuition of calculus. I was a bit dissatisfied and quit watching the lectures.

When I was looking for alternate video lectures on calculus, I came across a set of 5 lectures on calculus titled "Big Picture Of Calculus". It consists of recordings of Professor Gilbert Strang and focuses explicitly on giving an intuitive feel on calculus. From the lectures, it looks like it might grow to a full series of lectures on calculus although for quite some time the lecture count has stayed constant at 5. The lectures span the most important topics in differential and integral calculus.

I have talked about Gilbert Strang and his linear algebra course lectures here . The calculus lectures are also excellent. He focuses on the main topics and gives a geometric intuition. The lectures are short (around 30 minutes) and hence are quite convenient to watch also. I hope that the series will be expanded to cover other important topics too. Once you get the basic intuition , the OCW course on calculus should be easy to follow.

The website for the videos is at Big Picture of Calculus . The lectures can be watched online. If you want to download them , you need to follow a convoluted procedure.

a. Goto the html page for the individual lecture. Eg  Big Picture of Calculus at http://www-math.mit.edu/~gs/video1.html .
c. Search for this line <param name="FlashVars" value="configxml=somexmlfile.xml" />
d. You can download the actual file at http://www-math.mit.edu/~gs/somexmlfile.flv . In case it did not work , open the xml file at http://www-math.mit.edu/~gs/somexmlfile.xml . View the xml file’s source and get the flv name at this line <param name="flv" value="flvfilename.flv" /> . Then the file name will be http://www-math.mit.edu/~gs/flvfilename.flv

Have fun with Calculus !

## Impressions On MIT OCW Linear Algebra Course

Linear Algebra is one of the coolest and most useful math courses you can take. Basically , it deals with vectors , matrices all the cool stuff you can do with them. Unfortunately, I did not really have a dedicated course on Linear Algebra in my undergrad. From what I hear , most of the CS people I meet (from India) also don’t have this course in their undergrad. Sure we have had some of the topics (like vectors, basic matrices, determinants, Eigenvalues) split across in multiple courses or in our high school ; but not a single,unified course on it.

Linear algebra is useful on its own but it becomes indispensable when your area of interest is AI , Data Mining or Machine Learning. When I took a machine learning course , I spent most of the time learning things in Linear Algebra, adv Calculus or Linear Optimization. In hindsight , machine learning would have been an easy course if I had previously taken courses on Linear Algebra or Linear Optimization.

As a concrete example, I had a hard time understanding the proof of optimality of PCA or the equivalence of different techniques for calculating PCA (eg Eigen space decomposition or SVD etc) . But once I learnt all about basis, dimension , ,Eigen space and Eigen space decomposition, QR decomposition,SVD etc (which are btw taught in any intro course of Linear Algebra) the whole PCA concept looked really really simple and the proofs looked like straight forward algebraic derivations. Oh well, the benefits of hindsight 🙂

Ok, enough on my rant on lack of Linear Algebra in undergrad. After I struggled mightily in my machine learning course, I decided that I had to master Linear Algebra before taking any more advanced courses. I spent the entire winter holidays learning Linear Algebra as I was taking an advanced data mining course this spring. So this blog post is a discussion of my experience.

### Video Resources

Arguably the best resource to learn Linear Algebra is MIT’s OCW course taught by Professor Gilbert Strang . This course are is one the most popular OCW course and so far had more than 1 Million visits . I also searched for alternate courses, but this course wins hands down both for its excellent teaching style and its depth.

The course website is here. It contains around 35 video lectures on various topics. The lecture are available for download both from ITunes and from Internet Archive. If you prefer YouTube, then the playlist for this course is here.

### Books

The recommended book for this course is Introduction to Linear Algebra. 4th ed. by Gilbert Strang. I found the book to be quite costly , even used books for old versions ! I don’t mind buying expensive books (I shell out a lot of money for data mining books , but a rant on it later ) but since I was interested in Linear Algebra primarily to help me master data mining, I preferred the equivalent book Linear Algebra and Its Applications , also by Gilbert Strang. This book had a very similar content to the recommended book but I felt was more fast paced which suited me fine. Also I was able to get an old copy from Amazon for 10 bucks. Sweet ! My only complaint of the book is that the examples and exercises felt a bit disconnected (or should I say, I wasn’t clear of the motivation ? ) from the topics.

If you don’t want to purchase these expensive books , then there is an EXCELLENT free e-book by Professor Jim Hefferon .The book’s website is here , from where you can download the e-book. I have to say, this book really blew me away. It was really intuitive, has excellent (mostly plausible) examples, was slightly more theoretical than Strang’s book with more proofs. It also had a very helpful solution manual , and a LaTeX version of the book. Too good to be true 🙂 I felt this book had a much limited set of topics than Strang’s course/book, (hence this a truly intro book) , but whatever topic it took, it gave it a thorough treatment. Another thing I liked in the book are the  exercises – Most of them were excellent. And having a solution manual helped clarify a lot of things given that I was doing essentially a self-study. Thanks Jim !

### Impressions on the Lectures

I felt, overall, the lectures were excellent. They were short (40-50 minutes). So my usual daily schedule was to listen to a lecture, and read the relevant sections in the book , solve the exercises for which the answers are available at the end of book. All these steps took at most 2-3 hours a day. I was also taking notes in LaTeX using Lyx. I have talked about using Lyx previously in this blog post.

I really liked Strang’s teaching style. He often emphasizes intuition , especially geometric intuition rather than proofs. I felt that is how intro courses must be structured. Proofs are important but not before I have a solid understanding of the topics. But I also have to say that the lectures were varying in quality. Some of the lectures were exceptional while some were not so enlightening. But on the whole, I was really glad that he has made the lectures available online. It has certainly helped me learn Linear Algebra.

### Topics

If possible see all the lectures as almost all of them cover important topics. I did and have to say all of them were excellent and useful. But if you are mostly interested in applied Linear Algebra and planning to use it in Data Mining/ Machine learning, then my suggestion will be Lectures 1-11 , 14-22,25,27-29,33. If you are interested watch lectures 30,31 too. Again a better way to learn is to take notes during lectures and solving at least a few exercises in the book. If you have Matlab or Octave then you can verify answers to some other exercises for which solutions are not given.

### Notes

I have taken LaTeX notes for this course but they are a bit scattered and unorganized. Hopefully, I will organize them together and create a single PDF soon. I kind of put it on a lower priority after I noticed that Peteris Krumins’s blog has a partial set of lecture notes for this course. His lecture notes can accessed here . As of now (Jan 30 , 2010) , he has put notes for first 5 lectures although the frequency seems to be a bit slow.

Have fun with Vectors and Matrices !!