Feeds:
Posts

## Detecting Mixtures of Genres in Movie Dialogues

Author Note : My blog is getting the first guest post ! This post is written by dear friend and fellow geek Kripa who just graduated from Columbia University. He specializes in Machine Learning, NLP and Data Mining. Over the last few months,  he has been doing some amazing exploratory work in Data Mining. This blog post describes one such work. Here he describes how he applied LDA on movie scripts to determine the mixture of genres. Aside from being a cool application of LDA, finding such genre mixture has multiple applications. For eg, this can provide a mechanism to improve movie recommendations. You can even use it to deconstruct the work of your favorite artist !

Every movie predominantly falls into a particular genre that can be one of war, crime, adventure, thriller, comedy, romance, etc. Sometimes there is more than one genre a movie can belong to. For example a movie like Minority Report though predominantly belongs to sci-fi has very strong elements of crime in the storyline. Given a movie, I wanted to look at some approximate percentages of genres contained in it.

If I throw a movie like Pearl Harbor at my system, I expect it to return an output like:

war: 35%
crime: 13%
comedy: 13%
horror: 9%
romance: 19%
sci-fi: 11%

I decided to consider only these six genres as others like Adventure, Action, Thriller, etc are mostly overlaps. For example, action is a broad category – war and crime could also be considered action; and a thriller could be a medley of action and crime. For my experiment I considered only those genres which I felt were distinct from each other.

### Data

The first question is what kind of data to go after? Movie plots are available from the IMDB dataset. They do express the major genre well but they don’t contain a lot of information to get the minor genres out. Wikipedia usually contains a detailed plot, but I found something better: Internet Movie Script Database (imsdb here forth). This is an online repository of movie scripts which contain dialogues and also contextual scene information.

For example consider this scene from The Patriot:

Marion leaves the workshop with Susan at his side. Nathan and Samuel walk past, exhausted from their day in the field.
NATHAN: Father, I saw a post rider at the house.
MARION: Thank you. Did you finish the upper field?
SAMUEL: We got it all cut and we bundled half of it.
MARION: Those swimming breaks cut into the day, don’t they?

This contains both the dialogues and the description of the scene. It struck me as the right kind of data I should be looking at. Also this had a good amount of scripts (~1000 movies) for running a quick experiment. I did a quick and dirty wget of all the webpages followed by a bunch of shell scripts to weed out html tags and expressions such as "CAMERA PANS OUT" and "SCENE SHIFTS TO" that add no information to the script. I also removed the stop words and proper nouns. After pre-processing I had just the text of dialogues and the scene description.

Now that I have data, how did I go about finding the composition of genres in a movie?

### Supervised vs Unsupervised

Let us define the problem elaborately. I have the pre-processed imsdb content which is the data and its genre is its label. A simple classification task will involve extracting features from the labeled data and training it using a classifier to build a model. This model can then be used to predict the genre of a movie that is out of the training data. This is a supervised approach where we already have the labeled data – genre of a movie. For my task of detecting a mixture of genres, all I would need to do is to tweak the classifier output so that it gives the individual probabilities of each genre a movie can possibly belong to. But is this the right approach? Will the output probabilities be close to the actual representation of genres in a movie?

There are couple of issues with this supervised approach.

• For training using a classifier, I need to assign a label to a script. For this problem I should be assigning more than one label to an instance. This is not possible. If during traning, I were to assign a single label (genre) to a whole movie, I’m making a strong (and wrong!) assumption that each movie script represents one genre.
• Consider an example like Minority Report . Here imdb does not list ‘Crime’ as a category at all! We know very well that minority report is a ‘Sci-Fi Crime’ movie. The validity of labels itself poses an issue here.

My expectation from an unsupervised approach is simple. I have a huge collection of unlabeled scripts. I’m assuming that each script contains a mixture of genres.

I now need to extract features (words) that are relevant to each genre from this corpus. In better words, given a corpus of scripts, the unsupervised system needs to discover from the corpus a bunch of topics and list the most relevant features (words) that belong to each topic. I would then be able to use the features to infer the topic mixture of each individual script. This is known as Topic Modelling . Intuitively I’m trying to generate a bag of words model out of a corpus of scripts.

### Latent Dirichlet Allocation

LDA is a topic model and one which is most widely used today. I will not spend much effort in giving a rigorous treatment on LDA since there are excellent resources available online. I’m listing those that I found extremely useful for understanding LDA.

1. Prof. David Blei gave a talk on LDA and its applications at NYU during the NYC Machine Learning Week sometime in February 2011. There is a nice video recording of it: Part 1 and Part 2
2. You can find Prof.Blei’s implementation of LDA in C on his webpage.
3. Edwin Chen’s blog article is one of the best sources for grasping the intuition behind LDA even if you are mathematically less inclined.

A word of warning here! I will write a brief note about my understanding of LDA. Honestly, there are better ways of explaining LDA. At this point all you need know is that LDA is a statistical black box – throw a bunch of documents at it; specify ‘k’ – the number of topics you think are represented by the documents and LDA will output ‘k’ topic vectors. Each topic vector contains words arranged in decreasing probabilities of being associated with that particular topic. You need to know ONLY THIS! So feel free to skip to the next section.

In the case of Maximum Likelihood given n data points, we assume the underlying distribution that generated this data is a Gaussian and fit the data to the best Gaussian possible. LDA also makes a similar assumption that there is a hidden structure to the data. And that hidden structure is a multinomial whose parameter $\theta$ comes from a Dirichlet Prior.

Let us say that I want to generate a random document; I don’t care if its meaningful or not. I first fix the number of words I would want to generate in that document. I can on the other hand draw a random number from say a Poisson. Once I have the number of words (N) to be generated, I go ahead to generate those many words from the corpus.

Each word is generated thus: draw a $\theta$ from a Dirichlet. (Dirichlet is a distribution over the simplex.) Consider $\alpha$ as the parameter that decides the shape of the Dirichlet similar to how mean and variance decide the shape of the Gaussian bell curve. In a 3-D space, for some choice of $\alpha$, consider the probability to be more near (1,0,0), (0,1,0) and (0,0,1). For some other choice of $\alpha$ all points in the 3-D simplex might get the same probability! This represents what kind of topic mixtures I can generally expect! (If my initial guess is that each document has only one topic, mostly I will choose an $\alpha$ such that I get more probability on the (1,0,0) points. This is just a prior which could be wrong! And in this way it is not strictly analogous to Maximum Likelihood).

So I have an $\alpha$ now and I draw a sample from the Dirichlet. What I actually get is a vector that sums up to 1. I call this $\theta$.

Remember that I’m trying to generate a random document and I haven’t generated a single word yet! The $\theta$ I have is my guess on the topic vector! I have obtained this $\theta$ by sampling from a k-dimensional vector (here k=3 in the above example.)

Now that $\theta$ represents a topic vector which can also be re-imagined as a probability distribution and because any draw is guaranteed to be from the simplex, I can use this drawn vector ($\theta$) as the weights of a loaded ‘k’ faced die. And I throw this die! Lets say it shows up 5 (a number between 1 and k). I will now say that the word I’m going to generate belongs to Topic-5.

I have not yet generated a word! To generate a word that belongs to a topic, I need a |V| faced die. |V| is the size of the vocabulary of the corpus. How do I get such a huge die?!

I will get that in a similar way as for the topic vector. I will sample again from a Dirichlet – but a different Dirichlet – one that is over a v-dimensional simplex. Any draw from this Dirichlet will give a v-faced die. Call this the Dirichlet $\beta$. For each topic ($\theta$) you need a different v-faced die ($\beta$). Thus I end up drawing ‘k’ such ‘v’-faced dice.

So for topic-5, I throw the 5th v-faced die. Let us say it shows 42; I then go to the vocabulary and pick the 42nd word!  I will do this whole process ‘N’ times (N was the number of words to be generated for the random document.)

The crux of this discussion is this: for every document , the dice (i.e samples from the dirichlet($\alpha$) and dirichelt($\beta$) ) are generated only once. It is just that to generate each word, the dice are rolled multiple times. Once to get a topic and once to get a word given this topic!

• ### Running LDA on Movie Scripts

I converted the imsdb data to a format accepted by Prof. Blei’s code. For more information you should look at his README. I initially ran the code with k = 10 (guessing that the movie scripts could represent a mixture of 10 genres). I will jump from here to give a sneak peek at the end results. The end results have been very encouraging. A sample of words clustered under different topics is here:

At this point eyeballing the clusters it is apparent that the starting from the red cloud, clockwise, the clusters represent romance, sci-fi, war, horror, comedy and crime respectively.

Just running LDA on the data with k=10 did not turn out such relevant results. So what was that extra mile which had to be crossed?

LDA does not always guarantee relevant features. Nor is every topic discovered meaningful. After the first run, by eyeballing through the 10 topics, I was able to distinguish 3 topics very easily – war, romance and sci-fi. Playing around with different ‘k’ values did not yield more discernible topics. There are reasons to this:

• ‘war’ and ‘sci-fi’ are distinctly different from other genres – they are pretty exclusive.
• words belonging to ‘romance’ were getting repeated in other topics. This clearly indicated a skew in the data.
• so if I remove the scripts of genres that have been already detected from the dataset, maybe LDA will detect more latent topics due to a change in the underlying data distribution caused by removal of data.

I removed the scripts corresponding to war and sci-fi from the dataset. This was achieved by comparing each script in the dataset against the top features from these topics. Each script was scored based on the occurance of features. I removed the scripts that scored greater than a threshold. The new dataset D* contained scripts other than war and science fiction movies. Then I ran LDA on this new dataset D*. Now I found that the topics pertaining to crime and comedy were getting discovered but their features were mangled with romance. This feature set was improved by removing words which were common to most categories and were not exclusive to one particular category.

This paper deals with using human judgments to examine topics. This is one way of picking the odd feature out of a topic vector.  The final feature set looked something like this.

### Results

I wrote a simple python script that does soft-clustering based on occurance and count of features. A sample of the results is here:

The results were quite convincing to look at. I wanted to obtain some hard metrics. IMSDB already categorises movies based on its genre. So I consider the category as the actual label. A movie is considered to be rightly classified if the label matches one of the top two genres detected by my system. The following are some encouraging results!

These proportions cannot be considered accurate, but they do give an idea of what to expect from a movie! The complete list of results is here .

## Detailed Tutorial on Linear Programming

### Introduction

Linear and Integer programming play an important role in algorithm design. Lot of combinatorial optimization problems can be formulated as Integer Programming problems. But as we will see later, Integer Programming is NP-Complete. One potential solution to solve the Integer Programming problems in polynomial time using Linear Programming with the technique of LP relaxation. Thus, Linear Programming and the concepts of rounding and duality are very useful tools in the design of approximation algorithms for many NP-Complete problems. In this lecture, we will discuss the theory of Linear and Integer Programming

### Background

Before discussing Linear Programming let us discuss some mathematical background .

### Optimization

In an optimization problem, we are given a real valued function f whose domain is some set A and we want to find the parameter which maximizes or minimizes the value of f. This function is also called as objective function. An optimization problem can either be a maximization or a minimization problem. In a maximization (minimization) problem, the aim is to find the parameter which maximizes (minimizes) the value of f. In this lecture, we will primarily discuss about maximization problems because maximizing f(x) is same as minimizing -f(x).

Optimization problems can be broadly classified into two types : Unconstrained and constrained optimization problems. In an unconstrained optimization problem , we are given a function f(x) and our aim is to find the value of x which maximizes/minimizes f(x). There are no constraints on the parameter x. In constrained optimization, we are given a function f(x) which we want to maximize/minimize subject to the constraints g(x). We will focus more on constrained optimization because most real world problems can be formulated as constrained optimization problems.

The value x that optimizes f(x) is called the optima of the function. So, a maxima maximizes the function while the minima minimizes the function. A function can potentially have many optima. A local optimum x of a function is a value that is optimal within the neighborhood of x. Alternatively, a global optimum is the optimal value for the entire domain of the function. In general, the problem of finding the global optima is intractable. However, there are certain subclasses of optimization problems which are tractable.

### Convex Optimization

One such special subclass is that convex optimization where there are lot of good algorithms exist to find the optima efficiently. Linear programming happens to be a special case of convex optimization.

Convex sets :
A set C is convex if for $x,y \in C$ and any $\alpha \in [0,1]$,

$\alpha x + (1-\alpha) y \in C$

Informally, in a geometric sense, for any two points x,y in a convex region C, all the points in the line joining x and y are also in C. For eg the set of all real numbers , $\mathbb{R}^n$ is a convex set. One of the interesting properties of such sets is that the intersection of two convex sets is again a convex set. We will exploit this feature when discussing the geometric interpretation of Linear Programming.

Convex Functions :
A real valued function f(x) is convex if for any two points $x,y \in domain(f)$, and $\alpha \in [0,1]$ ,

$f(\alpha x + (1-\alpha) y) \leq \alpha f(x) + (1-\alpha) f(y)$

Linear/affine functions are, for example, convex. One very neat property of convex functions is that any local optimum is also a global optimum. Additionally, the maximum of a convex function on a compact convex set is attained only on the boundary . We will be exploiting these properties when discussing algorithms that solve linear programming.

### Linear Programming

Linear Programming is one of the simplest examples of constrained optimization where both the objective function and constraints are linear. Our aim is to find an assignment that maximizes or minimizes the objective while also satisfying the constraints. Typically, the constraints are given in the form of inequalities. For eg, the standard (or canonical) form of a maximization linear programming instance looks like :

$\begin{array}{ll} {\rm maximize} & \mbox{} c_1 x_1 + c_2 x_2 + \ldots c_n x_n \\ {\rm subject\ to}\\ \end{array}$

$a_{11}x_1 + a_{12} x_2 \ldots a_{1n} x_n \leq b_1 \\ a_{21}x_1 + a_{22} x_2 \ldots a_{2n} x_n \leq b_2 \\ \ldots \\ a_{m1}x_1 + a_{m2} x_2 \ldots a_{mn} x_n \leq b_m \\ \forall i , x_i \geq 0$

A minimization linear programming instance looks like :

$\begin{array}{ll} {\rm minimize} & \mbox{} c_1 x_1 + c_2 x_2 + \ldots c_n x_n \\ {\rm subject\ to}\\ \end{array}$

$a_{11}x_1 + a_{12} x_2 \ldots a_{1n} x_n \geq b_1 \\ a_{21}x_1 + a_{22} x_2 \ldots a_{2n} x_n \geq b_2 \\ \ldots \\ a_{m1}x_1 + a_{m2} x_2 \ldots a_{mn} x_n \geq b_m \\ \forall i , x_i \geq 0$

Any assignment of values for variables $x_1 \ldots x_n$ which also satisfies the constraints is called as a feasible solution. The challenge in Linear Programming is to select a feasible solution that maximizes the objective function. Of course, real world linear programming instances might have different inequalities – for eg $\geq$ instead of $\leq$ or the value of $x_i$ is bound between $[p_i , q_i]$ . But in almost all such cases, we can convert that linear programming instance to the standard form in a finite time. For the rest of the lecture, we will focus on linear programming problems in standard form as this simplifies the exposition.

The standard form of linear programming also enables us to concisely represent the linear programming instance using linear algebra. For eg, the maximization problem can be written as ,

$\begin{array}{ll} {\rm maximize} & \mbox{} {\bf c}^T {\bf x}\\ {\rm subject\ to}\\ & A{\bf x} \leq {\bf b} \\ & \mbox{} {\bf x} \geq {\bf {0}} \end{array}$

In this representation, c is a column vector which represents the co-efficient of the objective function . x is also a column vector of the variables involved in the Linear Programming . Hence the equation $c^Tx$ represents the objective function. $x \geq 0$ represent the non negativity constraints. A is matrix of dimensions $n \times m$ and each of its row represents an inequality. Each of the column represent a variable. So the inequality $a_{k1} x_1 + a_{k2} x_2 + \ldots + a_{kn} x_n \leq b_{k}$  becomes the $k^{th}$ row of whose elements are $a_{k1},a_{k2}, \ldots , a_{kn}$. The column vector b represents the RHS of each inequality and the $k^{th}$ entry has the value $b_k$ .

The linear algebra representation of minimization problem is :

$\begin{array}{ll} {\rm minimize} & \mbox{} {\bf c}^T {\bf x}\\ {\rm subject\ to}\\ & A{\bf x} \geq {\bf b} \\ & \mbox{} {\bf x} \geq {\bf {0}} \end{array}$

### Geometric Interpretation of Linear Programming

For the ease of visualization, let us take a Linear Programming instance with just 2 variables. Any assignment for the variables can be easily visualized as a 2D point in a plane.

Consider a simple Linear Programming instance with 2 variables and 4 constraints :

$\begin{array}{ll} {\rm maximize} & x_1 + x_2\\ {\rm subject\ to}\\ & x_1 + 2x_2 \leq 1\\ & 2x_1 + x_2\leq 1\\ & x_1 \geq 0\\ & x_2 \geq 0 \end{array}$

Each of the 4 inequalities divide the plane into two regions – one where the inequality is satisfied and one where it does not. For eg, the inequality $x_1 + 2 x_2 \leq 1$ divides the plane into two regions: the set of points(x1,x2) such that $x_1 + 2 x_2 > 1$ which are not feasible for our problem  and the points such that $x_1 + 2 x_2 \leq 1$ where the inequality is satisfied. The first set does not contain any feasible solution but the second set could contain points which are feasible solution of the problem provided they also satisfy the other inequalities. The line equation $x_1 + 2x_2 =1$ is the boundary which separates these regions. In the figure below, the shaded region contains the set of points which satisfy the inequality.

Each of the four inequalities individually form a convex region where the feasible solution for the respective inequality is satisfied. The feasible region of the problem is found by intersecting all the individual feasible regions. By the property of convex sets, the problem’s feasible region is also a convex region. For this example, the feasible region geometrically looks like :

Since the feasible region is not empty, we say that this linear programming instance is feasible. Sometimes, the intersection might result in an empty feasible region and the problem becomes infeasible. Yet another possibility occurs when the feasible region extends to infinity and the problem becomes unbounded.

The discussion above extends naturally to higher dimensions. If the linear programming instance consists of $n$ variables, then each of the inequalities divide the space $\mathbb{R}^n$ into two regions where in one of them the inequality is satisfied and in another it is not. The hyperplane $a_{k1}x_1 + a_{k2}x_2 + \ldots + a_{kn}x_n = b_k$ forms the boundary for constraint $C_k$. The feasible region formed by intersection of all the constraint hyperplane is a polytope.

### Approaches to solve Linear Programming

Once we have found the feasible region, we have to explore the region to find the point which maximises the objective function. Due to feasible region being convex and the objective function being a convex function (linear function), we know that the optimum lies only in the boundary. Once we find a local optima we are done as convex function on a compact convex set implies that this is also the global optima of the function.

One thing that simplifies the search is the fact that the optima can only exist in the vertices of the feasible region. This is due to two facts : (1) Optima of convex function on a convex set occurs on the boundary . (2) Due to convexity property any point on the line xy between x and y can be expressed as a linear combination of x and y which implies that (x , y) and (f(x),f(y)) are indeed the extremal points in their respective spaces for the line xy.

In the 2D example, the objective function is a line that sweeps the 2D space. At any point in the sweep, all the points in the line have the same value of the objective function. In other words, at each step the line becomes a contour line. The figure below shows the contour values for few instances of the sweep.

Each of the vertices of the feasible region are points that is the solution of the linear equations formed when two constraints are changed to equalities. In other words, each of the vertices are formed by the intersection of the boundary line of two constraints. For eg, the point (0.5,0) is obtained by converting the inequalities $x_1 \geq 0$ and $2x_1+x_2 \leq 1$ to equalities and solving for them. So in a two variable linear programming instance with m constraints, there can atmost be ${m \choose 2}$ vertices. This results in a simple brute force algorithm (assuming the problem is feasible):

Brute force algorithm for 2D :

1. For every unique pair of constraints (inequalities) $C_i,C_j$
1. Convert them to equalities and find the point that satisfies them.
2. Check if the point satisfies other m-2 inequalities. If not discard the point.
3. Else, find the cost of this point in the goal function
2. Return the point with the largest cost for goal function.

There are atmost ${m \choose 2}$ vertices and it takes O(m) to check if the point is feasible and hence this algorithm takes O(m^3). This naive approach works for higher dimensions also. In a linear programming problem with n variables and m constraints, each vertex in the feasible region are formed by converting some n inequalities out of m to equalities and solving this system. There are now ${m \choose n}$ potential vertices and the naive algorithm takes around $O(m^{n+1})$ and results in an exponential time algorithm.

Clearly, we can do better. One approach is to prune the feasible region more aggressively. That is , if our current maximum value of the goal function for some vertex is C then do not even bother to check the vertices that have a lower cost. This makes sense practically and is the intuition behind the popular Simplex algorithm developed by George Dantzig.

As we discussed above, each of the vertices of the feasible region are points are the solution the linear system where $n$ constraints are changed to equalities. For eg, let vertex V was obtained by solving some n equations $e_1,e_2 \ldots e_n$. There are potentially, n neighbors for V. To find each of them , we can set equation $e_i$ back to inequality (other equations are unaltered) and find the set of feasible points for the new system (with n-1 equations and rest inequalities).

So the informal outline of the simplex algorithm (assuming the problem is feasible) is :

Simplex Algorithm  :

1. Find an arbitrary starting vertex V.
2. Find the neighbors of V.
1. Find the cost of each neighbour
2. If no neighbour has a cost greater that V, declare it as optimal assignment.
3. If some neighbour $V_{new}$ exists whose cost is greater than V, set $V= V_{new}$ and repeat step 2.
3. Return the point with the largest cost for goal function.

Unfortunately, the worst case cost of simplex algorithm is still exponential and takes $O(m^{n+1})$ . This is because of the fact that it has to check all the ${m \choose n}$ vertices in the worst case. Even though this does not appear better than the naive algorithm, in practice, it runs very fast.

The simplex algorithm is really a family of algorithms with different set of heuristics for selecting initial vertex, breaking ties when there are two vertices with same cost and avoiding cycles. However, presence of heuristics make the precise analysis of simplex algorithm hard. For eg, the outline above uses a random starting point and the simple greedy heuristic to pick the next vertex.

Let us see how simplex fares in our example. Lets say the initial vertex was (0,0). This vertex was obtained from the equations $x_1=0,x_2=0$. The cost of this point is 0. So we do not need to check any points with cost of below 0. The figure below shows the feasible region we must still explore and the line $x_1+x_2 = 0$ which is our initial stab at pruning the region.

Now we check the neighbours of (0,0). We can obtain the point (0.5,0) by discarding the equation $x_1=0$ and including the equation $2x_1+x_2=1$. Similarly, we can obtain the point (0,0.5) by discarding the equation $x_2=0$ and including the equation $x_1+2x_2=1$. Both of the vertices have same cost of 0.5 and lets arbitrarily pick the vertex (0.5,0). So the new lower bound of the objective function becomes 0.5 and we do not need to check any more points in the region where $x_1 + x_2 \leq 0.5$. The figure below shows the new region to explore.

The only neighbour of (0.5,0) in the new feasible region is (0.33,0.33) and the cost of this point is 0.66. So the new lower bound of the objective function becomes 0.66 and we do not need to check any more points in the region where $x_1 + x_2 \leq 0.66$. When we try to find the feasible region where $x_1+x_2 > 0.66$ we do not find any and we output this point as the optimum. The figure below shows the final result.

### Ellipsoid Algorithm

Even though simplex algorithm is exponential, there exist other polynomial time algorithm for solving linear programming. One of the earliest is the Ellipsoid algorithm by Leonid Khachiyan. This algorithm applies not just for linear programming but for convex optimization in general. In contrast to simplex algorithm which checks the vertices of the polytope both ellipsoid and interior point algorithm follow a different outline. Ellipsoid algorithm in particular uses an approach very close to $bisection$ method used in unconstrained optimization. Assume that the problem is feasible ie has a nonempty feasible region and the optimal solution for Linear Programming is $x^{*}$.

Ellipsoid Algorithm  :

1. Find an ellipsoid which contains the entire feasible region (including the point $x^{*}$).
2. Construct a cutting plane that splits the ellipsoid to two half-ellipsoids.
3. Using an (cutting plane) oracle, find which half-ellipsoid contains $x^{*}$.
4. Construct the minimal volume ellipsoid that contain the half-ellipsoid with the optimal solution.
5. Repeat steps (2)-(4) till the ellipsoid is "small" enough . Output the point within the ellipsoid as optimal.

Even though the ellipsoid algorithm is polynomial, it is quite expensive. Usually simplex or interior point algorithms are preferred to solve Linear Programming . The complexity of ellipsoid algorithm is approximately $O(L m^4)$ where L informally, is the number of bits of input to the algorithm. The high cost of ellipsoid algorithm is attributable to the cutting plane oracle which says how the ellipsoid must be cut to form half-ellipsoids and also constructing a minimal volume ellipsoid over the half-ellipsoid that contains the optimal point.

### Interior Point Algorithms

Interior point methods are a family of algorithms which can solve Linear Programming in polynomial time. This is also a generic algorithm applicable for convex optimization in general. This algorithm starts with an initial point inside the polytope and slowly generates a sequence of points that are successively closer to the optimal point. Barrier methods, which penalize the point for violating constraints (and hence go out of polytope) are used to steer the point towards optimal point. This process is repeated till the current point is arbitrarily close to the optimal vertex.

### Principle of Duality

The concept of duality plays an important role in Linear Programming and approximation algorithm design but we discuss it only very briefly. For additional details refer to resources section.

Consider a maximization linear program instance.

$\begin{array}{ll} {\rm maximize} & \mbox{} {\bf c}^T {\bf x}\\ {\rm subject\ to}\\ & A{\bf x} \leq {\bf b} \\ & \mbox{} {\bf x} \geq {\bf {0}} \end{array}$

and let us call it as the primal Linear Programming instance. Then, the dual of this instance is a minimization Linear Programming instance,

$\begin{array}{ll} {\rm minimize} & \mbox{} {\bf b}^T {\bf y}\\ {\rm subject\ to}\\ & A^T{\bf y} \geq {\bf c} \\ & \mbox{} {\bf y} \geq {\bf {0}} \end{array}$

Informally, the role of variables and constraints are swapped between primal and dual problem – ie for each constraint in primal there exist one variable in dual . Similarly, for each variable in primal , there exist one constraint in dual. The upper bounds of each of the constraints became the coefficient of the dual problem. Each of the variables in primal became constraints and the coefficients of primal become their respective lower bounds.  If the primal is a maximization problem, dual is a minimization problem and vice versa.

### Facts :

1. Dual of a linear programming problem in standard form is another linear programming problem in standard form. If the primal problem is a maximization, the dual is a minimization and vice versa.
2. Dual of dual is primal itself.
3. Feasible solutions of dual provide a way to verify if the optimal value of the primal is $\leq C$ for some constant in C. Since we can verify the certificate in polynomial time, Linear programming is in $NP \bigcap Co-NP$.
4. If both primal and dual are feasible, then their optimal values are same.  (aka Strong duality).
5. Every feasible solution of dual problem gives a lower bound on the optimal value of the primal problem. In other words, the optimum of the dual is an upper bound to the optimum of the primal (aka Weak duality).
6. Weak duality is usually used to derive approximation factors in algorithm design.

### References and Resources

1. Luca Trevisan’s lecture notes on Linear Programming at http://cs.stanford.edu/people/trevisan//cs261/lecture05.pdf . The example and 2D geometric interpretation were inspired from his notes. http://cs.stanford.edu/people/trevisan//cs261/lecture06.pdf contains a discussion of principle of duality.
2. http://www.stanford.edu/class/msande310/lecture07.pdf contains a worst case example of Simplex algorithm and also a discussion of the Ellipsoid algorithm.