Theory | God, Your Book Is Great !!

Archive for the ‘Theory’ Category

Tutorial on Sampling Theory for Approximate Query Processing

Posted in Data Mining, Statistics, Theory, tagged approximate query processing, aqp, probability, random, reservoir sampling, sampling, statistics on November 15, 2011| Leave a Comment »

Prelude

This blog post is based on the lecture notes I prepared for one of the courses in which I am the teaching assistant. Hopefully, this notes will be useful for anyone interested in either Approximate Query Processing (AQP) or basic Sampling theory as applied in Databases.

Approximate Query Processing

Consider a database D which contains billions of tuples. Such a large database is not uncommon in data warehousing. When we want to get some answers from the database, we construct a SQL query to get the results. Due to the large size of the database, any query should take quite a bit of time. This is regardless of the use of techniques like indexing which can speed up the processing time but does not really reduce the time asymptotically.

One observation is that the queries that are posed to the database returns very precise and accurate answers – probably after looking at each and every tuple. For a lot of use cases of OLAP we may not need such a precise answer. Consider some sample queries like – what is the ratio of male to female in UTA ? What percentage of US people live in Texas? What is the average salary of all employees in the company? and so on.

Notice that for such queries we really do not need answers that are correct to the last decimal point. Also notice that each of those query is an aggregate over some column. Approximate query processing is a viable technique to use in these cases. A slightly less accurate result but which is computed instantly is desirable in these cases. This is because most analysts are performing exploratory operation on the database and do not need precise answers. An approximate answer along with a confidence interval would suit most of the use cases.

There are multiple techniques to perform approximate query processing. The two most popular involve histograms and sampling. Histograms store some statistic about the database and then use it to answer queries approximately. Sampling creates a very small subset of the database and then uses this smaller database to answer the query. In this course we will focus on AQP techniques that use sampling. Sampling is a very well studied problem backed up by a rich theory that can guide us in selecting the subset so that we can provide reasonably accurate results and also provide statistical error bounds.

Introduction to Sampling

The idea behind sampling is very simple. We want to estimate some specific characteristic of a population. For eg this might be the fraction of people who support some presidential candidate or the fraction of people who work in a specific field or fraction of people infected with a disease and so on. The naive strategy is to evaluate the entire population. Most of the time , this is infeasible due to constraints on time, cost or other factors.

An alternate approach that is usually used is to pick a subset of people . This subset is usually called a sample. The size of the sample is usually an order of magnitude smaller than the population. We then use the sample to perform our desired evaluation. Once we get some result, we can use this to estimate the characteristic for the entire population. Sampling theory helps, among other things, on how to select the subset , what is the size of population, how to extrapolate the result from sample to original population , how to estimate the confidence interval of our prediction etc.

The process of randomly picking a subset of the population and using it to perform some estimation should appear very strange. On a first glance it might look that we might get wildly inaccurate results. We will later see how to give statistical guarantees over our prediction. Sampling is a very powerful and popular technique. More and more problems in the real world are being solved using sampling. Lots of recent problems in data mining and machine learning essentially use sampling and randomization to approximately solve very complex problems which are not at all solvable otherwise. This is for this reason that most DBMS provide sampling as a fundamental operator. (Eg Oracle provides a SAMPLE operator in select and also dbms_stats package. SQL Server provides TABLESAMPLE operator and so on).

We represent the population with P and the sample with S. N represents the size of population and n represents the size of the sample. We will use these letters to denote statistics on the population and sample. For eg, $\mu_P$ represents the mean of the population and $\mu_S$ represents the mean of the sample. Similarly, $\sigma_P$ and $\sigma_S$ represent the standard deviation of the population and sample respectively.

Types of Sampling

There are different ways to perform sampling. The ones that we will use most in AQP are :

Sampling With/Without Replacement

The aim of sampling is to get a subset (sample) from a larger population. There are two ways to go about it. In the first approach, we randomly pick some entity, perform measurements if any and add it to the sample. We then replace the entity back to the original population and repeat the experiment. This means that same entity can come in the sample multiple times. This approach is called Sampling with replacement. This is the simplest approach to sampling. There is no additional overhead to check if an entity is already in sample or not. Typically, sampling with replacement is modeled with binomial distribution.

In the second approach, we explicitly make sure that an entity does not appear in the sample more than once. So we randomly pick an entity from the population, verify it is not already in the sample, perform measurement and so on. Alternatively, we can remove entities that were added to sample from the population. This approach is called Sampling without replacement and is usually modeled with an hypergeometric distribution.

Bernoulli Sampling

Bernoulli Trial : A Bernoulli Trial is an experiment which has exactly two outcomes – Success or failure. Each of these outcomes has an associated probability that completely determines the trial. For eg consider a coin which produces head with probability 0.6 and tail with probability 0.4 . This constitutes a bernoulli trial as there are exactly two outcomes.

In Bernoulli Sampling, we perform a Bernoulli trial on each tuple in the database. Each tuple can be selected into a sample with uniform probability. If the trial is a success, the tuple is added to the sample. Else it is not. The most important thing to notice is that all the tuples have exactly the sample probability of getting into the sample. Alternatively, the success probability for Bernoulli trial remains the same for each tuple. Pseudo code for Bernoulli Sampling is given below :

success probability SP = $\frac{n}{N}$
for i = 1 to N
Perform a Bernoulli trial with success probability SP
If outcome is success add i-th tuple to sample

It is important to note that Bernoulli sampling falls under Sampling without replacement. Size of the sample follows a binomial distribution with parameters $B(N,\frac{n}{N})$ . ie it can vary between 0 and N-1 with the expected size of sample as n.

Uniform Random Sampling

In Uniform Random Sampling, we pick each tuple in the database with a constant probability. This means that the probability of any tuple entering the sample is constant. Typically, this is implemented as sampling with replacement. Pseudo code for Uniform Random Sampling is given below :

1. Generate a set of n random numbers S between 1 and N.
2. Select tuples with index in S and add it to sample.

Note that in this case we have exactly n tuples in the sample. We can also notice that sample tuple might appear multiple times. The number of times a tuple appears in the sample forms a binomial distribution with parameters $B(n,\frac{1}{n})$ .

Weighted Random Sampling

In Weighted Random Sampling, we perform a Bernoulli trial on each tuple in the database. The difference with Uniform random sampling that the success probability for each Bernoulli trial varies. In other words, each tuple has a different probability of getting into the sample.

Stratified Sampling

If the population can be subdivided into sub population that are distinct and non overlapping , stratified sampling can be used. In Stratified sampling, we split the population into a bunch of strata and then form sampling over each strata independently. For eg a political poll can be stratified on gender, race , state etc.

There are multiple advantages in using stratified sampling. For one, this allows the convenience to use different sampling techniques over each strata. If there is some specific strata that might be under represented in a general random sampling, we can easily provide additional weights for the samples taken from that. It is also possible to vary the number of samples from a strata to minimize the error. For eg, we can take less number of samples from a strata with low variance while preserving them for strata with high variance.

Stratified sampling is not a panacea because getting a good stratification strategy may not be obvious. In lot of population, the best feature to stratify is not obvious. Even worse, the population may not contain subgroups that are homogeneous and non overlapping.

Given a population , sample size and strata there are many ways to allocate the sample across different strata. The two commonly used strategies are :

1. Proportional Allocation : In this approach the contribution of each strata to the sample is proportional to the size of the strata. So a strata that accounts to 10\% of the population will use 10\% of the sample. Note that this approach does not use any information about a sub population other than its size.
2. Variance based Allocation : This strategy allocates samples in proportion to their variance. So a strata with high variance will have higher representation than one with smaller variance. This is logical as we need few samples to accurately estimate the parameters of a sub population when its variance is low. Any additional samples do not add much additional information or reduce the final estimation dramatically.

Reservoir Sampling

Reservoir Sampling is an algorithm that is widely used to take n random samples from a large database of size N. The real utility of reservoir sampling is realized when N is a large number or is not really known at sample time. This scenario is quite common when the input is a streaming data or when the database is frequently updated . Running the simple uniform random sampling algorithm (say Bernoulli sampling) is inefficient as N is large or the old tuples may be purged (or goes out of Sliding Window). Reservoir sampling allows you to get the random sample in a linear pass such that you only inspect any tuple at most once.

Reservoir Sampling with Single Sample

Consider the following contrived example. We have a database which is constantly updated and we want to have a single random tuple from it.

The base case occurs when there is only one tuple in the database. In this case our random sample is the first tuple. Hence the sample S = t1.

Lets say the database is updated and a new tuple t2 is added. The naive approach is to restart the entire sampling process. Instead, in reservoir sampling, we accept the new tuple as the random sample with probability $\frac{1}{2}$ . ie toss a coin which returns head with probability 0.5 and if it returns head , then replace t1 with t2.

We can see why S is a uniform sample. The probability that S contains t1 or t2 remains the same.

1. $Pr(S=t1) = 1 * \frac{1}{2} = \frac{1}{2}$ . The random sample is t1 when it was selected first into S (with probability 1) and then not rejected by t2 with probability $1-\frac{1}{2}$ .
2. $Pr(S=t2) = \frac{1}{2}$ . The random sample is t2 when it replaces t1 in the second step. This occurs with probability $\frac{1}{2}$

The database is updated and lets assume a new tuple t3 is added. We accept the new tuple as the random sample with probability $\frac{1}{3}$ . ie toss a coin which returns head with probability 0.33 and if it returns head , then replace the previous value of S (t1 or t2) with t3. More generally when inspecting the i-th tuple, accept it with probability $\frac{1}{i}$ .

It might look as if we are treating t3 unfairly because we only accept it with probability 0.33. But we can show probabilistically that the sample is still uniform. The probability that S is t1 or t2 or t3 remains the same.

1. $Pr(S=t1) = 1 \times \frac{1}{2} \times \frac{2}{3}=\frac{1}{3}$ . The only scenario when random sample is still t1 occurs when it was selected first into S (with probability 1) ,not rejected by t2 with probability $1-\frac{1}{2}$ and not rejected by t3 with probability $1-\frac{1}{3} = \frac{2}{3}$ .
2. $Pr(S=t2) = \frac{1}{2} \times \frac{2}{3} = \frac{1}{3}$ . The random sample is 2 when it replaces t1 in the second step. This occurs with probability $\frac{1}{2}$ . Then in the next step is not replaced by t3. This occurs with probability $1-\frac{1}{3} = \frac{2}{3}$ .
3. $Pr(S=t3) = \frac{1}{3}$ . The random sample is 3 when S contains either t1 or t2 and it is replaced by t3. This occurs with probability $\frac{1}{3}$ .

The pseudo code looks like :

S = t1
for i = 2 to N
Replace S with tuple t_i with probability $\frac{1}{i}$

Reservoir Sampling with Multiple Samples

A very similar approach works when the sample size is more than 1. Lets say that we need a sample of size n. Then we initially set the first n tuples of the database as the sample. The next steps is a bit different. In the previous case, there was only one sample so we replaced the sample with the selected tuple. When sample size is more than 1, then this steps splits to two parts :

1. Acceptance : For any new tuple t_i, we need to decide if this tuple enters the sample. This occurs with probability $\frac{n}{i}$ .
2. Replacement : Once we decided to accept the new tuple into the sample, some tuple already in the sample needs to make way. This is usually done randomly. We randomly pick a tuple in the sample and replace it with tuple t_i.

The pseudo code looks like :

Store first n elements into S
for i = n+1 to N
Accept tuple t_i with probability $\frac{n}{i}$
If accepted, replace a random tuple in S with tuple t_i

A coding trick that avoids the "coin tossing" by generating a random index and then accepts it if it is less than our sample size. The pseudo code looks like :

Store first n elements into S
for i = n+1 to N
    randIndex = random number between 1 and i
    if randIndex <= n
        replace tuple at index "randIndex" in the sample with tuple t_i

We can similarly analyze that the classical reservoir sampling does provide a uniform random sample. Please refer to the paper Random Sampling with a Reservoir by Jeffrey Vitter for additional details.

Sampling in AQP

As discussed above, our main aim is to discuss how sampling techniques is used in AQP. Let us assume that we have a sample S of size n. The usual strategy that we will follow is to apply any aggregate query on the sample S instead of database D. We then use the result of the query from S to estimate the result for D.

One thing to note is that we will only use aggregate queries for approximate processing. Specifically we will look at COUNT, SUM and AVERAGE. The formulas for estimating the values of the aggregate query for the entire database from the sample for these 3 operators is well studied. For additional details refer to the paper “Random Sampling from Databases" by Frank Olken.

Uniform Random Sample

1. Count : $\frac{\sum_{i=1}^{n} T_i p_i}{n} = \frac{\sum_{i=1}^{n} T_i \frac{1}{N}}{n}$ where $T_i$ is an indicator random variable that is 1 when tuple t_i satisfies our clause. $p_i$ is the probability that tuple will be selected into the sample. Intuitively, the formula finds the fraction of tuples in Sample which satisfied the query and applies the same fraction to the entire database.
2. Sum : $\frac{\sum_{i=1}^{n} x_i \frac{1}{p_i}}{n} = \frac{\sum_{i=1}^{n} x_i N}{n}$
3. Average : $\frac{Sum}{Count}$
\end{itemize}

Weighted Random Sample

1. Count : $\frac{\sum_{i=1}^{n} T_i p_i}{n}$ where $T_i$ is an indicator random variable that is 1 when tuple t_i satisfies our clause. $p_i$ is the probability that tuple will be selected into the sample. Intuitively, the formula reweighs each tuple according to the selection probability of the tuple.
2. Sum : $\frac{\sum_{i=1}^{n} x_i \frac{1}{p_i}}{n}$
3. Average : $\frac{Sum}{Count}$

Probability/Statistics Refresher

Here are few commonly used equations related to Expectation and variance.

1. E[X] = $\sum_{i=1}^{n} x_i p_i$
2. E[aX+b] = aE[X] + b
3. E[X+Y] = E[X] + E[Y] (also called Linearity of Expectations)
4. Var[X] = $E[ (X - E[X] ) ^2]$
5. Var[X+a] = Var[X]
6. Var[aX] = $a^2$ Var[X]
7. Var[X+Y] = Var[X] + Var[Y] if X and Y are independent.

Law of Large Numbers : This law is one of the fundamental laws in probability. Let $X_1,X_2 \ldots , X_n$ be random variables drawn iid. Very informally, as n increases, the average of the variables approaches the expected value of the distribution from which the variables are drawn. For eg, if we have a coin which provides head with probability 0.5 and toss it 10000 times, the number of heads will be very close to 5000.

Binomial Distribution: Suppose you repeat a Bernoulli trial with success probability of p , n times. The distribution of the number of successes in the n trials is provided by binomial distribution B(n,p). This is a very important distribution for modeling sampling with replacement. For eg if we perform Bernoulli sampling, the final size of the sample is a Binomial distribution. Also if we randomly pick n tuples from database of size N , the number of times any tuple is repeated in the sample can also modeled by Binomial distribution. The expected value is given by $E[X]=np$ and variance is given by $np(1-p)$ .

Normal Distribution : Normal distribution , aka Gaussian distribution, is one of the most important probability distributions. It is usually represented with parameters $N(\mu,\sigma^2)$ . It has the familiar bell curve shape. $\mu$ determines the center of the normal curve and $\sigma^2$ determines the width of it. A smaller value results in a tighter curve while a larger value results in a more flat/wide curve.

Equations (2) and (7) above gives us some detail about how expected value and variance of sum of two random variables can be computed from the expected value/variance of the constituent random variables . We can extend by induction for the rules to hold for the sum of arbitrarily large number of random variables. This introduces us to the concept of Central Limit Theorem.

Central Limit Theorem : This is one of the most fundamental rules in probability. Assume that we have n different random variables $X_1,X_2 \ldots , X_n$ . Each of these random variables have mean $\mu$ and variance $\sigma^2$ . For a large n, the distribution of sums $X = X_1 + X_2 + \ldots + X_n$ is normal with parameters $N(n \mu, n \sigma^2)$ . Similarly, the distribution of the averages , $X = \frac{X_1 + X_2 + \ldots + X_n}{n}$ is normal with parameters $N(\mu, \frac{\sigma^2}{n})$ .

Law of large numbers and Central limit theorem are two important results that allow analysis of the results. law of large number says that if we pick a large enough sample then the average of the sample will be closer to the true average of population. Central limit theorem states that if you repeat the sampling experiment multiple times and plot the distribution of the average value of the samples, they follow a normal distribution. Jointly, they allow you to derive the expression for Standard error.

Standard Error

The essential idea behind sampling is to perform the experiment on a randomly chosen subset of the population instead of the original population. So far we discussed how to perform the sampling and how to estimate the value from sample to the larger population. In this section let us discuss about the error in our prediction. The concept of standard error is usually used for the same.

Lets say we want to estimate the mean of the population $\mu_P$ . We performed the sampling and found the sample mean as $\mu_S$ . Since the sample mean is an unbiased estimator of the population mean we can announce that they are the same. But it need not always be the case that these two values are same. There are two common ways to analyze the error.

1. Absolute Error : $|\mu_P - \mu_S|^2$
2. Relative Error : $\frac{|\mu_P - \mu_S|^2}{\mu_P}$

The primary problem we have in estimating these error metrics is that we do not know the value of $\mu_P$ . We will use an indirect way to estimate the error.

Consider the sampling process again. We picked a sample $S={X_1,X_2,\ldots,X_n}$ of random tuples from the database and computed the sample mean as $\mu_S=\frac{\sum_{i=1}^{n} X_i}{n}$ . But since the tuples in S are picked randomly, the sample mean changes based on the sample. This means that the sample mean $\mu_S$ is itself a random variable.

Let the population $P={X_1,X_2,\ldots,X_N}$ . Then the population mean $\mu_P = \overline{X} = \frac{\sum_{i=1}^{N} X_i}{N}$ . Let the sample be $S={Y_1,Y_2,\ldots,Y_n}$ and sample mean is $\mu_S = \frac{\sum_{i=1}^{n} Y_i}{n}$ .

Standard Error when Sample Size = 1

Consider the case where the sample S consists of only one element. Our aim is to find population mean. As per our approach, we pick a random element $Y_1$ and then announce it as the sample mean ie ${\mu}_S=Y_1$ . The error in this case is the difference between $Y_1$ and $\mu_P$ . Since the sample is randomly picked , the sample mean is a random variable. Then it also implies that the error is also a random variable.

We can derive the expected value for the error as follows :
$E[err^2] = \sum_{i=1}^{N}\frac{(\overline{X}-X_i)^2}{N} = \sigma_P^2$ .

We can see that the expected value of the error is nothing but the standard deviation of the population !

Standard Error when Sample Size = 2

Consider the case where the sample S consists of exactly two elements. Our aim is to find population mean. As per our approach, we pick two random elements $Y_1,Y_2$ and then announce the sample mean as ${\mu}_S = \frac{Y_1+Y_2}{2}$ . The error in this case is the difference between $\mu_S$ and $\mu_P$ . Since the samples are randomly picked , the sample mean is again a random variable. Then it also implies that the error is also a random variable.

We can derive the expected value for the error as $E[err^2] = E[(\overline{X} - \overline{Y})^2] = Var[\overline{Y}]$ . We can see that the sample can be any one of $N^2$ different combination of elements . The calculation for variance might be tricky if not for the independence of samples. Since the two elements were picked randomly , they two are independent and we can use that to estimate the variance easily.

$Var[\overline{Y}] = Var [\frac{Y_1 + Y_2}{2}] = Var [\frac{Y_1}{2} + \frac{Y_2}{2}] = \frac{1}{4} Var[Y_1] + \frac{1}{4} Var[Y_2] = \frac{\sigma_P^2}{4} + \frac{\sigma_P^2}{4} = \frac{\sigma_P^2}{2}$

Note that we used rules 3, 6 and 7 from above.

Standard Error when Sample Size = n

We will not derive the formula here but we can easily show that when the sample contains n different elements the standard error is given by $E[err^2] = \frac{\sigma_P^2}{n} = \frac{Variance\;of\;population}{Sample\;size}$ .

There are some interesting things to note here :

1. The expected error when using 2 samples is less than that of the expected error when we used only one sample.
2. As the size of sample increases the error decreases even faster. The rate of decrease is inversely proportional to square of size of the sample size.
3. This also formalizes the intuition that if the variance of the population is less, we need less number of samples to provide estimates with small error.
4. Usually our work will dictate the tolerable error and we can use the formula to find the appropriate n that will make standard error less than our tolerance factor.

Hope this post was useful 🙂

Read Full Post »

Detailed discussion on NP-Completeness of Subset Sum

Posted in math, Theory, tagged algorithms, dynamic programming, knapsack, np-complete, polynomial, subset sum on February 7, 2011| 3 Comments »

I recently spent some time developing notes on Subset sum – specifically the NP-Completeness part of it. I thought I will share it with the blog readers.

Introduction

Subset sum is one of the very few arithmetic/numeric problems that we will discuss in this class. It has lot of interesting properties and is closely related to other NP-complete problems like Knapsack . Even though Knapsack was one of the 21 problems proved to be NP-Complete by Richard Karp in his seminal paper, the formal definition he used was closer to subset sum rather than Knapsack.

Informally, given a set of numbers S and a target number t, the aim is to find a subset S’ of S such that the elements in it add up to t. Even though the problem appears deceptively simple, solving it is exceeding hard if we are not given any additional information. We will later show that it is an NP-Complete problem and probably an efficient algorithm may not exist at all.

Problem Definition

The decision version of the problem is : Given a set S and a target t does there exist a subset $S^{'} \subseteq S$ such that $t = \sum_{s \in S'} s$ .

Exponential time algorithm approaches

One thing to note is that this problem becomes polynomial if the size of S’ is given. For eg,a typical interview question might look like : given an array find two elements that add up to t. This problem is perfectly polynomial and we can come up with a straight forward $O(n^2)$ algorithm using nested for loops to solve it. (what is the running time of best approach ?).

A slightly more complex problem asks for ,say, 3 elements that add up to t. Again, we can come up with a naive approach of complexity $O(n^3)$ . (what is the best running time?). The catch in the general case of subset sum is that we do not know $|S^{'}|$ . At the worst case $|S^{'}|$ is $O(n)$ and hence the running time of brute force approach is approximately $n^{O(n)}$ .

A slightly more efficient algorithm checks out all possible $2^n$ subsets. One typical way to do this is to express all numbers from 0 to $2^{n}-1$ in binary notation and form a subset of elements whose indexes are equal to the bit positions that correspond to 1. For eg, if n is 4 and the current number, in decimal, is say $10$ which in binary is 1010. Then we check the subset that consists of $1^{st}$ and $3^{rd}$ elements of S. One advantage of this approach is that it uses constant space. At each iteration, you examine a single number. But this approach will lead to a slower solution if $|S^{'}|$ is small. Consider the case where $t=S[\frac{n}{2}]$ . We will have to examine around $O(2^{\frac{n}{2}})$ different subsets to reach this solution.

A slightly different approach finds all possible sums of subsets and checks if t has occurred in the subset.

EXPONENTIAL-SUBSET-SUM(S,t):
n = |S|
$L_{0}$ = {0}
for i in 1 to n :
             $L_{i}$ = merge-lists( $L_{i-1}, L_{i-1} + S[i]$ )
             if $L_{i}$ has t, return true.
             remove all elements greater than t from $L_i$
if $L_{n}$ has t, return true else return false

This algorithm uses the notation S+x to mean ${s+x :s \in S}$ . Refer CLRS 35.5 for a discussion of a similar algorithm for a variant of subset sum problem.

NP-Completeness of Subset Sum Decimal

In this section we will prove that a specific variant of Subset sum is NP-Complete. Subset sum decimal is defined very similar to standard Subset sum but each number in S and also t is encoded in decimal digits.

We can show that Subset sum decimal is in class NP by providing the subset S’ as the certificate. Clearly, we can check if elements in S’ adds up to t in polynomial time.

The next step is to select another NP-Complete problem which can be reduced to Subset sum decimal. So far we have not discussed any arithmetic NP complete problems. The only non graph theoretic problem that we have discussed in 3SAT and we will use it for the proof. Of course there are multitude of other reductions including Vertex cover, 3 dimensional matching, partition etc.

We are now given a 3SAT formula $\phi$ with n variables – $x_1, x_2,\ldots,x_n$ and m clauses – $C_1, C_2,\ldots, C_m$ . Each clause $C_i$ contains exactly 3 literals. Our aim is to construct an instance of subset sum problem $<S,t>$ such that $\phi$ is satisfiable if and only if a solution to our instance of Subset sum decimal exists. The outline of the proof is as follows :

1. Construct a set S of unique large decimal numbers that somehow encode the constraints of $\phi$ . Additionally this operation must take polynomial time.
2. Construct an appropriate target t such that this instance of Subset sum decimal is solvable if and only if a solution to 3SAT instance exists. Handle complications like carries in addition.
3. Devise a way to find the satisfying assignment from subset solution and vice versa.

To simplify the proof, we make the following assumptions :

1. All the literals $x_1$ to $x_n$ is used in some clause of $\phi$ .
2. No clause can contain both a literal and its complement.

As a consequence of these assumptions, we do not have any variables that are superfluous. Also we do not have any clauses that get satisfied trivially.

We will not duplicate the proof in the lecture notes as a detailed sketch of the reduction is given in CLRS section 34.5.5. Instead we will focus on certain observations.

Observation 1 : Construction of S and t takes polynomial time

This is easy to see. For each variable $x_i$ we create 2 variables. Similarly we create two variables for each clause $C_j$ . The total number of variables in S is 2(m+n). Each number in set S and t contains exactly n+m digits. Hence the total construction takes time polynomial in n+m .

Observation 2 : There are no carries when elements in subset are added to form t.

We can see that the only allowed integers in number construction are 0,1 and 2. The columns corresponding to variables (the leading n digits) can add up to at the most 2. The columns corresponding to clauses (trailing m digits) cannot have a sum of more than 6. This is because of two facts :

(a) 3SAT has at most 3 literals in each clause

(b) A clause cannot contain a literal and its complement.

So, each variable can add at most 1 to that clause column and there at most 3 variables in a clause. Additionally, we have 1 and 2 from the slack variables. Concisely, we get at most 3 from $v_i$ or $v_i^{'}$ and 3 from $s_i$ and $s_i^{'}$ .

Hence we can conclude that carries does not occur at each column(digit) as the base we use is 10.

Observation 3 : All variables in S corresponding to $x_i$s are unique.

Each variable $x_i$ creates two variables $v_i$ and $v_i^{'}$ . The proof is in two parts :

(a) First we show that if $i \neq j$ , $v_i$ and $v_j$ does not match in the leading n digits. Similar argument holds for $v_i^{'}$ and $v_j^{'}$ .
(b) Next, we can show that $v_i$ does not equal to $v_i^{'}$ . This is because our assumption that a literal and its complement does not occur in the same clause. This means that the trailing m digits will not be equal.

In conclusion, no pair of variables in S corresponding to $x_i$ are equal.

Observation 4 : All variables in S corresponding to $C_i$ s are unique.

Each clause $C_i$ creates two variables $s_i$ and $s_i^{'}$ . If $i \neq j$ , $s_i (s_i^{'})$ and $s_j (s_j^{'})$ does not match in the trailing m digits. Additionally, by construction, $s_i \neq s_i^{'}$ as the digit position corresponding to $C_i$ has 1 for $s_i$ and 2 for $s_i^{'}$ .

Observation 5 : All variables in S is unique. i.e. S forms a set.

This can observed from Observation 3 and 4. By construction $v_i$ and $s_i$ do not match. Similar argument hold for $v_i^{'}$ and $s_i^{'}$ .

Observation 6 : New variables corresponding $x_i$ and $C_j$ are both needed for proof.

A detailed sketch is given in CLRS. The variables $v_i$ and $v_i^{'}$ created from $x_i$ makes sure that each variable has a unique boolean assignment of 0 or 1. Else the sum for that column in target will be 2. This is due to the assumption that all variables $x_i$ HAS to be used in some clause $C_j$ and hence has a unique assignment. Of course, it is possible that $\phi$ has multiple satisfying assignment but the target digit forces only one of them to be selected when you select the elements of subset $S^{'}$ .

The digits corresponding to clauses makes sure that each clause has at least one variable that evaluates to true. This is because each digit of slack variable corresponding to $C_i$ (ie $s_i,s_i^{'}$ ) contribute at most 3 towards t and hence the remaining (at least) 1 has to come from $v_j$ or $v_j^{'}$ s.

So variables $v_i$ ensure that each $x_i$ has a unique assignment. Variables $s_i$ ensure that each clause $C_j$ of $\phi$ is satisfied.

Observation 7 : Subset sum is NP complete if the numbers are expressed in base $b \geq 7$ .

From observation 2 , we know that the maximum possible digit due to summation of elements in S is 6. This means we can reuse the proof of Subset sum decimal to prove that Subset sum is NP-Complete for any base b that is greater that 6.

Observation 8 : Given S’ we can find a satisfying assignment for $\phi$ .

We know that any satisfying subset $S^{'}$ must include either $v_i$ or $v_i^{'}$ for $\forall i , 1 \leq i \leq n$ . If $S^{'}$ includes $v_i$ then set $x_i$ to 1. Else set it to 0.

Observation 9 Given a satisfying assignment for $\phi$ , we can find S’

This is a bit tricky and is done in two steps. More details can be found in CLRS proof.

1. If the satisfying assignment had $x_i$ , then select $v_i$ . Else select $v_i^{'}$ .
2. For each clause $C_j$ find how many variables in it evaluated to true due to the boolean assignment. At least one variable has to be true and at most 3 variables are true.

a. If $C_j$ has only one variable that evaluates to true, then select $s_j$ and $s_j^{'}$ .
b. If $C_j$ has two variables that evaluate to true, then select $s_j^{'}$ .
c. If $C_j$ has three variables that evaluate to true, then select $s_j$ .

Observation 10 : If $\phi$ is not satisfied, then S’ cannot be found.

If $\phi$ is not satisfied, then there exist at least one clause $C_j$ that is not satisfied. This means that for ${n+j}^{th}$ digit, the slack variables $s_j,s_j^{'}$ contribute only 3 but the corresponding digit in t has 4. Hence no S’ exists.

NP-Completeness of Subset Sum Binary

The formal definition of Subset sum binary is similar to Subset sum decimal . The only difference is that all numbers are encoded in bits.

We can notice that the above proof for Subset sum decimal holds only for numbers expressed in base of at least 7 (from observation 7). For bases from 1-6, the previous proof does not apply – partly due to the fact that there will be carries during addition. We need an alternate proof approach. Since we have proved Subset sum decimal as NP-Complete , we can use the result to prove Subset sum binary as NP-Complete.

The certificate is the subset S’ given in binary. We can see that it can be done in polynomial time and hence Subset sum binary is in NP.

The next step is to reduce Subset sum decimal to Subset sum binary. First we observe that any number encoded in decimal can be encoded to binary in polynomial time and vice versa. When given S and t in decimal as input, we encode them in binary and pass it to our Subset sum binary routine. The decision version of Subset sum binary returns true or false which can be fed directly as result of Subset sum decimal. In the optimization version , we just convert the $S’$ returned by the Subset sum binary subroutine to decimal.

Observation 11 : A decimal number can be converted to binary in polynomial time.

Assume some number n is encoded in both binary and decimal. This means $n = 10^k = 2^{k1}$ where k is the number of digits in the decimal representation and k1 is the number of bits needed to encode it.

Taking log to the base 2 on both sides,

$k * log_{2} {10} = {k1} \implies {3.3} {k} = {k1}$

So to express a decimal number with k digits, we need between 3k – 4k bits.

Observation 12 : Subset sum is NP complete for any base $b\geq 2$ .

The logarithms of the same number in two different bases differ by at most a constant. ie,

$log_{b1}^{b2} = \frac{log_{b1}^{n}}{log_{b2}^{n}}$ .

$log_{b1}^{b2}$ is a constant irrespective of n. So if n needs k digits in base b1, then it needs at most $\frac{k}{log_{b1}b2}$ to be represented in base b2. (Verify observation 11 using this equation !).

NP-Completeness of Subset Sum Unary

From observation 12, the only base left is 1 and this section handles the special case where all numbers are expressed in base 1. Subset sum unary is similar to Subset sum decimal where all numbers are expressed in unary notation. Numbers in base 1 are called as being represented in unary. Any number k is represented as $1^k$ which is a string of k 1’s. Let us check if Subset sum unary is NP-Complete .

The certificate is the subset where all elements are expressed in unary. If we are given numbers in unary, then verification takes time that is polynomial in the length of individual unary numbers. Hence Subset sum unary is in unary.

To prove Subset sum unary is in NP-Complete , we have to reduce either Subset sum decimal/binary to unary. Superficially, it looks straightforward and hence it seems as though Subset sum unary is in NP-Complete. But the catch is that expressing a number n in base b to unary needs time exponential when computed wrt the size of n’s representation in base b. For eg, representing a binary number n that needs k bits needs around $2^{k's}$ unary digits. We can see that $2^k$ is exponential when viewed from k.

In summary, converting a number from any base to unary takes exponential time. So we cannot use our reduction technique as there the reduction is not polynomial.

Dynamic Programming solution for Subset Sum Unary

What we showed above was that Subset sum unary is in NP but not NP-Complete. Here we show that there exists a dynamic programming formulation for this problem. We represent the problem as a matrix A of size n*t. A is a boolean matrix where the interpretation of cell A[i,j]=True is that there exists a subset of ${x_1,x_2,\ldots,x_i}$ that sum up to j. ie $\exists S^{'} \subseteq \{x_1,x_2,\ldots,x_i\}$ such that $j=\sum_{s \in S'} s$ .

The algorithm goes as follows :

SUBSET-SUM-UNARY(S,t):
Form matrix A
Set A[1,0] = True
Set A[1,j] = False unless j==S[1] in which case set A[1,j] to True
for i=2 to t
    for j=2 to n
        if A[i-1,j] == True
             A[i,j] = True
        else if A[i-1,j-x_i] == True
            A[i,j] = True
        else
            A[i,j] = False

Consider the set $S=\{2,3,4,5\}$ and let t=8. The worked out DP is given below :

	0	1	2	3	4	5	6	7	8
2	T	F	T	F	F	F	F	F	F
3	T	F	T	T	F	T	F	F	F
4	T	F	T	T	T	T	T	T	F
5	T	F	T	T	T	T	T	T	T

Since A[5,8]=True , we conclude that there exists a subset of S that sum up to t(8).

Strong and Weak NP-Complete Problems

Subset sum is interesting in the sense that its binary/decimal can be proved as NP-Complete but its unary version seems to allow a polynomial looking dynamic programming solution.

Looking at the dynamic programming solution carefully, the time (and space) complexity of the approach is $O(n*t)$ where n=|S| and t is the target. By itself, the DP solution looks feasible and ‘somehow’ polynomial. But one of the reasons that Subset sum is NP-Complete is due to the fact that it allows "large" numbers. If t is large, then the table A is huge and the DP approach takes a lot of time to complete.

Given S and t , there are two ways to define an polynomial algorithm. One uses the length of S ie n to measure algorithm complexity. From this angle, $O(n*t)$ is not polynomial. This is because t can be huge irrespective of n. For eg, we have have a small set with 4 elements but the individual elements (and t) are of the order , say, $O(10^{10})$ . But from the perspective of magnitude of t, this dynamic programming approach is clearly polynomial.

In other words, we have two ways to anchor our polynomial – $Length[S]$ and $Magnitude[t]$ . An algorithm is called pseudo polynomial, if its time complexity is bounded above by a polynomial function of two variables – $Length[S]$ and $Magnitude[t]$ .

Problems that admit pseudo polynomial algorithms are called weak NP-Complete problems and those that do not admit are called Strong NP-Complete problems. For example, Subset sum is a weak NP-Complete problem but Clique is a strong NP-Complete problem.

There are lot of interesting discussion about the strong/weak NP-Complete problems in both Garey and Johnson and in Kleinberg/Tardos. See references for more details.

Observation 13 : Only number theoretic problems admit pseudo polynomial algorithms.

Observation 14 : Strong NP-Complete problems do not admit a pseudo polynomial time algorithm unless P=NP.

References

1. CLRS 34.5.5 – Proof of NP-Completeness of Subset sum.
2. CLRS 35.5 – An exponential algorithm to solve a variant of subset sum problem.
3. Garey and Johnson 4.2 – Discussion of pseudo polynomial time algorithms along with strong and weak NP-Complete problems.
4. Kleinberg and Tardos 6.4 – Discusses a variant of the DP algorithm given in the lecture notes and the concept of pseudo polynomial time algorithms. Section 8.8 has an alternate NP-Completeness proof of Subset sum using vertex cover which you can skim through if interested.

Hope you enjoyed the discussion on various facets of Subset sum problem !