Feeds:
Posts

## Detailed discussion on NP-Completeness of Subset Sum

I recently spent some time developing notes on Subset sum – specifically the NP-Completeness part of it. I thought I will share it with the blog readers.

### Introduction

Subset sum is one of the very few arithmetic/numeric problems that we will discuss in this class. It has lot of interesting properties and is closely related to other NP-complete problems like Knapsack . Even though Knapsack was one of the 21 problems proved to be NP-Complete by Richard Karp in his seminal paper, the formal definition he used was closer to subset sum rather than Knapsack.

Informally, given a set of numbers S and a target number t, the aim is to find a subset S’ of S such that the elements in it add up to t. Even though the problem appears deceptively simple, solving it is exceeding hard if we are not given any additional information. We will later show that it is an NP-Complete problem and probably an efficient algorithm may not exist at all.

### Problem Definition

The decision version of the problem is :  Given a set S and a target t does there exist a subset $S^{'} \subseteq S$ such that $t = \sum_{s \in S'} s$ .

### Exponential time algorithm approaches

One thing to note is that this problem becomes polynomial if the size of  S’ is given. For eg,a typical interview question might look like : given an array find two elements that add up to t. This problem is perfectly polynomial and we can come up with a straight forward $O(n^2)$ algorithm using nested for loops to solve it. (what is the running time of best approach ?).

A slightly more complex problem asks for ,say, 3 elements that add up to t. Again, we can come up with a naive approach of complexity $O(n^3)$. (what is the best running time?). The catch in the general case of subset sum is that we do not know $|S^{'}|$. At the worst case $|S^{'}|$ is $O(n)$ and hence the running time of brute force approach is approximately $n^{O(n)}$.

A slightly more efficient algorithm checks out all possible $2^n$ subsets. One typical way to do this is to express all numbers from 0 to $2^{n}-1$ in binary notation and form a subset of elements whose indexes are equal to the bit positions that correspond to 1. For eg, if n is 4 and the current number, in decimal, is say $10$ which in binary is 1010. Then we check the subset that consists of $1^{st}$ and $3^{rd}$ elements of S. One advantage of this approach is that it uses constant space. At each iteration, you examine a single number. But this approach will lead to a slower solution if $|S^{'}|$ is small. Consider the case where $t=S[\frac{n}{2}]$. We will have to examine around $O(2^{\frac{n}{2}})$ different subsets to reach this solution.

A slightly different approach finds all possible sums of subsets and checks if t has occurred in the subset.

EXPONENTIAL-SUBSET-SUM(S,t):
n =  |S|
$L_{0}$ = {0}
for i in 1 to n  :
$L_{i}$ = merge-lists($L_{i-1}, L_{i-1} + S[i]$)
if $L_{i}$ has t, return true.
remove all elements greater than t from $L_i$
if $L_{n}$ has t, return true else return false

This algorithm uses the notation S+x to mean ${s+x :s \in S}$ . Refer CLRS 35.5 for a discussion of a similar algorithm for a variant of subset sum problem.

### NP-Completeness of Subset Sum Decimal

In this section we will prove that a specific variant of Subset sum is NP-Complete. Subset sum decimal is defined very similar to standard Subset sum but each number in S and also t is encoded in decimal digits.

We can show that Subset sum decimal is in class NP by providing the subset S’ as the certificate. Clearly, we can check if elements in S’ adds up to t in polynomial time.

The next step is to select another NP-Complete problem which can be reduced to Subset sum decimal. So far we have not discussed any arithmetic NP complete problems. The only non graph theoretic problem that we have discussed in 3SAT and we will use it for the proof. Of course there are multitude of other reductions including Vertex cover, 3 dimensional matching, partition etc.

We are now given a 3SAT formula $\phi$ with n variables – $x_1, x_2,\ldots,x_n$ and m clauses – $C_1, C_2,\ldots, C_m$. Each clause $C_i$ contains exactly 3 literals. Our aim is to construct an instance of subset sum problem $$ such that $\phi$ is satisfiable if and only if a solution to our instance of Subset sum decimal exists. The outline of the proof is as follows :

1. Construct a set S of unique large decimal numbers that somehow encode the constraints of $\phi$. Additionally this operation must take polynomial time.
2. Construct an appropriate target t such that this instance of Subset sum decimal is solvable if and only if a solution to 3SAT instance exists. Handle complications like carries in addition.
3. Devise a way to find the satisfying assignment from subset solution and vice versa.

To simplify the proof, we make the following assumptions :

1. All the literals $x_1$ to $x_n$ is used in some clause of $\phi$ .
2. No clause can contain both a literal and its complement.

As a consequence of these assumptions, we do not have any variables that are superfluous. Also we do not have any clauses that get satisfied trivially.

We will not duplicate the proof in the lecture notes as a detailed sketch of the reduction is given in CLRS section 34.5.5. Instead we will focus on certain observations.

Observation 1 : Construction of S and t takes polynomial time

This is easy to see. For each variable $x_i$ we create 2 variables. Similarly we create two variables for each clause $C_j$. The total number of variables in S is 2(m+n). Each number in set S and t contains exactly n+m digits. Hence the total construction takes time polynomial in n+m .

Observation 2 : There are no carries when elements in subset are added to form t.

We can see that the only allowed integers in number construction are 0,1 and 2. The columns corresponding to variables (the leading n digits) can add up to at the most 2. The columns corresponding to clauses (trailing m digits) cannot have a sum of more than 6. This is because of two facts :

(a) 3SAT has at most 3 literals in each clause

(b) A clause cannot contain a literal and its complement.

So, each variable can add at most 1 to that clause column and there at most 3 variables in a clause. Additionally, we have 1 and 2 from the slack variables. Concisely, we get at most 3 from $v_i$ or $v_i^{'}$ and 3 from $s_i$ and $s_i^{'}$.

Hence we can conclude that carries does not occur at each column(digit) as the base we use is 10.

Observation 3 : All variables in S corresponding to $x_i$s are unique.

Each variable $x_i$ creates two variables $v_i$ and $v_i^{'}$. The proof is in two parts :

(a) First we show that if $i \neq j$ , $v_i$ and $v_j$ does not match in the leading n digits. Similar argument holds for $v_i^{'}$ and $v_j^{'}$.
(b) Next, we can show that $v_i$ does not equal to $v_i^{'}$. This is because our assumption that a literal and its complement does not occur in the same clause. This means that the trailing m digits will not be equal.

In conclusion, no pair of variables in S corresponding to $x_i$ are equal.

Observation 4 : All variables in S corresponding to $C_i$s are unique.

Each clause $C_i$ creates two variables $s_i$ and $s_i^{'}$. If  $i \neq j$, $s_i (s_i^{'})$ and $s_j (s_j^{'})$ does not match in the trailing m digits. Additionally, by construction, $s_i \neq s_i^{'}$ as the digit position corresponding to $C_i$ has 1 for $s_i$ and 2 for $s_i^{'}$.

Observation 5 : All variables in S is unique. i.e. S forms a set.

This can observed from Observation 3 and 4. By construction $v_i$ and $s_i$ do not match. Similar argument hold for $v_i^{'}$ and $s_i^{'}$.

Observation 6 : New variables corresponding $x_i$ and $C_j$ are both needed for proof.

A detailed sketch is given in CLRS. The variables $v_i$ and $v_i^{'}$ created from $x_i$ makes sure that each variable has a unique boolean assignment of 0 or 1. Else the sum for that column in target will be 2. This is due to the assumption that all variables $x_i$ HAS to be used in some clause $C_j$ and hence has a unique assignment. Of course, it is possible that $\phi$ has multiple satisfying assignment but the target digit forces only one of them to be selected when you select the elements of subset $S^{'}$.

The digits corresponding to clauses makes sure that each clause has at least one variable that evaluates to true. This is because each digit of slack variable corresponding to $C_i$ (ie $s_i,s_i^{'}$) contribute at most 3 towards t and hence the remaining (at least) 1 has to come from $v_j$ or $v_j^{'}$s.

So variables $v_i$ ensure that each $x_i$ has a unique assignment. Variables $s_i$ ensure that each clause $C_j$ of $\phi$ is satisfied.

Observation 7 : Subset sum is NP complete if the numbers are expressed in base $b \geq 7$.

From observation 2 , we know that the maximum possible digit due to summation of elements in S is 6. This means we can reuse the proof of Subset sum decimal to prove that Subset sum is NP-Complete for any base b that is greater that 6.

Observation 8 : Given S’ we can find a satisfying assignment for $\phi$.

We know that any satisfying subset $S^{'}$ must include either $v_i$ or $v_i^{'}$ for $\forall i , 1 \leq i \leq n$. If $S^{'}$ includes $v_i$ then set $x_i$ to 1. Else set it to 0.

Observation 9 Given a satisfying assignment for $\phi$ , we can find S’

This is a bit tricky and is done in two steps. More details can be found in CLRS proof.

1. If the satisfying assignment had $x_i$ , then select $v_i$. Else select $v_i^{'}$.
2. For each clause $C_j$ find how many variables in it evaluated to true due to the boolean assignment. At least one variable has to be true and at most 3 variables are true.

a. If $C_j$ has only one variable that evaluates to true, then select $s_j$ and $s_j^{'}$.
b. If $C_j$ has two variables that evaluate to true, then select $s_j^{'}$.
c. If $C_j$ has three variables that evaluate to true, then select $s_j$.

Observation 10 : If $\phi$ is not satisfied, then S’ cannot be found.

If $\phi$ is not satisfied, then there exist at least one clause $C_j$ that is not satisfied. This means that for ${n+j}^{th}$ digit, the slack variables $s_j,s_j^{'}$ contribute only 3 but the corresponding digit in t has 4. Hence no S’ exists.

### NP-Completeness of Subset Sum Binary

The formal definition of Subset sum binary is similar to Subset sum decimal . The only difference is that all numbers are encoded in bits.

We can notice that the above proof for Subset sum decimal holds only for numbers expressed in base of at least 7 (from observation 7). For bases from 1-6, the previous proof does not apply – partly due to the fact that there will be carries during addition. We need an alternate proof approach. Since we have proved Subset sum decimal as NP-Complete , we can use the result to prove Subset sum binary as NP-Complete.

The certificate is the subset S’ given in binary. We can see that it can be done in polynomial time and hence Subset sum binary is in NP.

The next step is to reduce Subset sum decimal to Subset sum binary. First we observe that any number encoded in decimal can be encoded to binary in polynomial time and vice versa. When given S and t in decimal as input, we encode them in binary and pass it to our Subset sum binary routine. The decision version of Subset sum binary returns true or false which can be fed directly as result of Subset sum decimal. In the optimization version , we just convert the $S’$ returned by the Subset sum binary subroutine to decimal.

Observation 11 : A decimal number can be converted to binary in polynomial time.

Assume some number n is encoded in both binary and decimal. This means $n = 10^k = 2^{k1}$ where k is the number of digits in the decimal representation and k1 is the number of bits needed to encode it.

Taking log to the base 2 on both sides,

$k * log_{2} {10} = {k1} \implies {3.3} {k} = {k1}$

So to express a decimal number with k digits, we need between 3k – 4k bits.

Observation 12 : Subset sum is NP complete for any base $b\geq 2$.

The logarithms of the same number in two different bases differ by at most a constant. ie,

$log_{b1}^{b2} = \frac{log_{b1}^{n}}{log_{b2}^{n}}$.

$log_{b1}^{b2}$ is a constant irrespective of n. So if n needs k digits in base b1, then it needs at most $\frac{k}{log_{b1}b2}$ to be represented in base b2. (Verify observation 11 using this equation !).

### NP-Completeness of Subset Sum Unary

From observation 12, the only base left is 1 and this section handles the special case where all numbers are expressed in base 1. Subset sum unary is similar to Subset sum decimal where all numbers are expressed in unary notation. Numbers in base 1 are called as being represented in unary. Any number k is represented as $1^k$  which is a string of k 1’s. Let us check if Subset sum unary is NP-Complete .

The certificate is the subset where all elements are expressed in unary. If we are given numbers in unary, then  verification takes time that is polynomial in the length of individual unary numbers. Hence Subset sum unary is in unary.

To prove Subset sum unary is in NP-Complete , we have to reduce either Subset sum decimal/binary to unary. Superficially, it looks straightforward and hence  it seems as though Subset sum unary is in NP-Complete. But the catch is that expressing a number n in base b to unary needs time exponential when computed wrt the size of n’s representation in base b. For eg, representing a binary number n that needs k bits needs around $2^{k's}$ unary digits. We can see that $2^k$ is exponential when viewed from k.

In summary, converting a number from any base to unary takes exponential time. So we cannot use our reduction technique as there the reduction is not polynomial.

### Dynamic Programming solution for Subset Sum Unary

What we showed above was that Subset sum unary is in NP but not NP-Complete. Here we show that there exists a dynamic programming formulation for this problem. We represent the problem as a matrix A of size n*t. A is a boolean matrix where the interpretation of cell A[i,j]=True is that there exists a subset of ${x_1,x_2,\ldots,x_i}$ that sum up to j. ie $\exists S^{'} \subseteq \{x_1,x_2,\ldots,x_i\}$ such that $j=\sum_{s \in S'} s$.

The algorithm goes as follows :

SUBSET-SUM-UNARY(S,t):
Form matrix A
Set A[1,0] = True
Set A[1,j] = False unless j==S[1] in which case set A[1,j] to True
for i=2 to t
for j=2 to n
if A[i-1,j] == True
A[i,j] = True
else if A[i-1,j-x_i] == True
A[i,j] = True
else
A[i,j] = False

Consider the set $S=\{2,3,4,5\}$ and let t=8. The worked out DP is given below :

 0 1 2 3 4 5 6 7 8 2 T F T F F F F F F 3 T F T T F T F F F 4 T F T T T T T T F 5 T F T T T T T T T

Since A[5,8]=True , we conclude that there exists a subset of S that sum up to t(8).

### Strong and Weak NP-Complete Problems

Subset sum is interesting in the sense that its binary/decimal can be proved as NP-Complete but its unary version seems to allow a polynomial looking dynamic programming solution.

Looking at the dynamic programming solution carefully, the time (and space) complexity of the approach is $O(n*t)$ where n=|S| and t is the target. By itself, the DP solution looks feasible and ‘somehow’ polynomial. But one of the reasons that Subset sum is NP-Complete is due to the fact that it allows "large" numbers. If t is large, then the table A is huge and the DP approach takes a lot of time to complete.

Given S and t , there are two ways to define an polynomial algorithm. One uses the length of S ie n to measure algorithm complexity. From this angle, $O(n*t)$ is not polynomial. This is because t can be huge irrespective of n. For eg, we have have a small set with 4 elements but the individual elements (and t) are of the order , say, $O(10^{10})$ . But from the perspective of magnitude of t, this dynamic programming approach is clearly polynomial.

In other words, we have two ways to anchor our polynomial – $Length[S]$ and $Magnitude[t]$. An algorithm is called pseudo polynomial, if its time complexity is bounded above by a polynomial function of two variables – $Length[S]$ and $Magnitude[t]$ .

Problems that admit pseudo polynomial algorithms are called weak NP-Complete problems and those that do not admit are called Strong NP-Complete problems. For example, Subset sum is a weak NP-Complete problem but Clique is a strong NP-Complete problem.

There are lot of interesting discussion about the strong/weak NP-Complete problems in both Garey and Johnson and in Kleinberg/Tardos. See references for more details.

Observation 13 : Only number theoretic problems admit pseudo polynomial algorithms.

Observation 14 : Strong NP-Complete problems do not admit a pseudo polynomial time algorithm unless P=NP.

### References

1. CLRS 34.5.5 – Proof of NP-Completeness of Subset sum.
2. CLRS 35.5 – An exponential algorithm to solve a variant of subset sum problem.
3. Garey and Johnson 4.2 – Discussion of pseudo polynomial time algorithms along with strong and weak NP-Complete problems.
4. Kleinberg and Tardos 6.4 – Discusses a variant of the DP algorithm given in the lecture notes and the concept of pseudo polynomial time algorithms. Section 8.8 has an alternate NP-Completeness proof of Subset sum using vertex cover which you can skim through if interested.

Hope you enjoyed the discussion on various facets of Subset sum problem !

1. Quantum Entanglement Holds DNA Together, Say Physicists
TR summarizes it best – Speculative but potentially explosive work. I learned that Quantum theory is used for things other than Physics when I learnt about Quantum Mind.  Where is this field heading ?

2. Windows 8 Plans Leaked: Numerous Details Revealed
An old story by now but has lot of neat ideas. My favorite idea is to bring appstore inside Windows. I am not sure it will click but it is still pretty clever.

3. Python internals: adding a new statement to Python
A neat article that shows how to add Ruby’s until statement to Python. It is quite striking how easy it is.

4. A Math Problem Solver Declines a \$1 Million Prize
As expected , Grigory Perelman has declinded the Clay foundation’s award.

5. Does Cantor’s Diagonalization Proof Cheat?
Another very neat article from Prof.Lipton. I liked the way he explained Diagonalization idea using games. Very cool !

6. Drone Alone: How Airliners May Lose Their Pilots
In an interesting experiment , FAA has initiated research for civil aircraft to share airspace with remotely piloted UAVs – If successful, I am sure people will use it for Cargo flights. In another interesting idea in airspace , Ryanair to sell £5 tickets for standing-room only flights .

7. iTunes accounts hacking more widespread than initially thought. The facts, and what you should do
The big news in the weekend. Looks like some of my friends too were affected.

1. Visions for Theoretical Computer Science
The basic aim of the this page is to highlight the broad research directions in TCS in such a way that even lay persons can understand. I think they have done an admirable job.

2. Express social objects in Atom format
I learnt about Activity Streams from Diaspora’s page on technology they are evaluating. I also played around with Google Buzz’s activity stream data. The idea looks very cool even though I feel the standard is less expressive.

3. How to Make an Artificial Cell
A detailed discussion of the recent break through by Craig Venter and his team.

4. Bringing improved PDF support to Google Chrome
Good move. I would be glad if they bring Flash support too 😉 In the other Google news, they introduced a command line tool to control many of their services. It is limited by gdata APIs but still there are lot of them. See more details at Introducing the Google Command Line Tool.

5. Tech News
Lot of interesting stuff happening in Tech World . There is a new price war between Amazon and Barnes & Noble – See E-Reader Prices Are Slashed . Also check out Swype and SeaMicro .

1. Improving Brand Recognition in TV Ads
A HBS research on how to improve brand recognition and reduce viewers from switching channels. The idea of pulsing looks pretty interesting.

2. HP’s ePrint: Print From Devices With No Printing Support
Idea looks very promising. I have some concerns about security and spamming but pretty cool idea.

3. Disentangling Gaussians
The post talks about a paper in STOC by Valiant. Basically, given a normal distribution which is internally a mixture of two other normal distributions – how to figure out the parameters of individual distributions given the parameters of the combined distribution. There are two interesting things :  it is solvable in polynomial time and it uses moments . I remember learning moments and wondering how the higher moments will be used – This looks like a great example.

4. 3-D Without the Glasses
Looks like a cool idea. The fact that it uses eye tracking may mean that it may not scale very well. But it will still find application in our living rooms as 3D TVs if not in conference rooms as projectors. I hope this product does not go the way as courier went 🙂

5. Chris Lattner gets first SIGPLAN award for LLVM work
LLVM is a fantastic idea and I am happy that Lattner is being recognized with a SIGPLAN award.

6. Linux Stuff
Lot of interesting things are happening in the Linux world . Demystifying Grilo discusses the promising library of grilo for media discovery.  It’s a matter of plugins…  discusses libpeas architecture for new gedit plugins.  I had talked about gEdit plugins at How to convert gEdit to gEdit++ .

1. The Shadow Case Model of Complexity
An interesting post by Prof.Lipton. I dont claim that understand BBG model (probably I should put some effort there) but I have to say their lemma looks very attractive to me.

2. WiFi data collection: An update
Looks like all big companies are getting tripped up with privacy.

3. Google Voice invites for students
I had applied and have got a Google Voice invite and have activated the service. I am reading the tutorial to know how best to use this tool. Thanks to Suresh for the info.

4. Speedy New Traders Make Waves Far From Wall St.
Pretty interesting. Average holding time for a stock is 11 seconds and no loss day in 4 years – Wow impressive  !

5. A New Clue to Explain Existence
Some interesting news from Tevatron which refuses to fade away in the face of LHC 🙂 .

If you liked this post , please subscribe to the RSS feed.

1. Breaking Provably Secure Systems
A very thoughtful post from Prof.Lipton. One (very) interesting point is that most of the security proofs don’t fully capture all possible attacks and that some of the assumptions the proofs make are not reasonable. Overall , an excellent read.

2. Google Responds To Privacy Concerns With Unsettlingly Specific Apology
A hilarious article from Onion. Found the article from Greg Linden.

I am not sure how exactly this whole thing works but the news is pretty exciting ! Identifying films from MRI scans – sounds like science fiction comes true !

4. Health Reform Myths
A combative Op-Ed from Paul Krugman on Health Care myths.

5. I*L and The Gamechangers
Transcript of online chat with Fake IPL Player organized by Cricket Next. Had lot of fun reading it. Link from Lokesh.

6. New Site Unmasks Chatroulette Players
A interesting mashup that uses Google Maps and Chatroulette and gives screenshots of people using Chatroulette and their location. Pretty interesting !

: : : : :

Wired has an article on Google titled How Google’s Algorithm Rules the Web . Nothing special but worth a read. Greg’s commentary on the article is here . Greg also makes a couple of valid points – point of diminishing returns for Google  and search quality differences becoming small between browsers in the long term .

An interesting post on "How Google Docs Leaks Your Identity" . (It is fixed now !)

Google’s official post on the recent bizarre (bone headed) ruling in Italy is here.

2. FOCS/STOC : What’s the big deal ?
A provocative article from Mitzenmacher. An interesting reply by David Karger is here.

3. Expanding Our Commitment to Facebook Credits
Facebook last week announced a partnership with PayPal to make purchasing virtual credits easier. Some more info on Facebook’s efforts on it is here. Frankly, I don’t really understand why the idea Of virtual economy credits makes sense. It really baffles me that some person is willing to spend real bucks for some thing virtual ! From the net it seems that it is on its way to become a billion dollar industry . Crazy !

4. What Do People Ask Their Social Networks?
A post about MIT/Microsoft Research on the type of questions asked in Social Networks. There are no real surprises except that rhetorical questions form a larger chunk than one expect. The post had some interesting points – People ask questions to their social networks because “I trust my friends more than I trust strangers,” while the second-largest response was “A search engine can provide data but not an opinion.” Good point !

5. Is the process of evolution finding the Nash Equilibrium between genomes? and Dating, Sex, and Game Theory
Certainly Game Theory is being applied to more and more exotic fields !

6. Choosing Consistency
Amazon made an announcement on SimpleDB consistency enhancements in their AWS blog here . A more detailed discussion is at Werner Vogel’s post Choosing Consistency .