Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

Dynamic programming is a general-purpose AlgorithmDesignTechnique that is most often used to solve CombinatorialOptimization problems. There are two parts to dynamic programming. The first part is a programming technique: dynamic programming essentially DivideAndConquer run in reverse: as in DivideAndConquer, we solve a big instance of a problem by breaking it up recursively into smaller instances; but instead of carrying out the computation recursively from the top down, we start from the bottom with the smallest instances of the problem, solving each increasingly large instance in turn and storing the result in a table. The second part is a design principle: in building up our table, we are careful always to preserve alternative solutions we may need later, by delaying commitment to particular choices to the extent that we can.

The bottom-up aspect of dynamic programming is most useful when a straightforward recursion would produce many duplicate subproblems. It is most efficient when we can enumerate a class of subproblems that doesn't include too many extraneous cases that we don't need for our original problem.

To take a simple example, suppose that we want to compute the n-th Fibonacci number using the defining recurrence

F(n) = F(n-1) + F(n-2); F(1) = F(0) = 1.

A naive approach would simply code the recurrence up directly:

   1 int
   2 fib(int n)
   3 {
   4     if(n < 2) {
   5         return 1
   6     } else {
   7         return fib(n-1) + fib(n-2);
   8     }
   9 }

The running time of this procedure is easy to compute. The recurrence is

T(n) = T(n-1) + T(n-2) + Θ(1),

whose solution is Θ(aⁿ) where a is the golden ratio 1.6180339887498948482... . This is badly exponential.¹

1. Memoization

The problem is that we keep recomputing values of fib that we've already computed. We can avoid this by memoization, where we wrap our recursive solution in a memoizer that stores previously-computed solutions in a HashTable. Sensible² programming languages will let you write a memoizer once and apply it to arbitrary recursive functions. In less sensible programming languages it is usually easier just to embed the memoization in the function definition itself:

   1 int
   2 memoFib(int n)
   3 {
   4     int ret;
   5 
   6     if(hashContains(FibHash, n)) {
   7         return hashGet(FibHash, n);
   8     } else {
   9         ret = memoFib(n-1) + memoFib(n-2);
  10         hashPut(FibHash, n, ret);
  11         return ret;
  12     }
  13 }

The assumption here is that FibHash is a global hash table that we have initialized to map 0 and 1 to 1. The total cost of running this procedure is O(n), because fib is called at most twice for each value k in 0..n.

Memoization is a very useful technique in practice, but it is not popular with algorithm designers because computing the running time of a complex memoized procedure is often much more difficult than computing the time to fill a nice clean table. The use of a hash table instead of an array may also add overhead (and code complexity) that comes out in the constant factors. But it is always the case that a memoized recursive procedure considers no more subproblems than a table-based solution, and it may consider many fewer if we are sloppy about what we put in our table (perhaps because we can't easily predict what subproblems will be useful).

2. Dynamic programming

Dynamic programming comes to the rescue. Because we know what smaller cases we have to reduce F(n) to, instead of computing F(n) top-down, we compute it bottom-up, hitting all possible smaller cases and storing the results in an array:

   1 int
   2 fib2(int n)
   3 {
   4     int *a;
   5     int i;
   6     int ret;
   7     
   8     if(n < 2) {
   9         return 1;
  10     } else {
  11         a = malloc(sizeof(*a) * (n+1));
  12         assert(a);
  13 
  14         a[1] = a[2] = 1;
  15 
  16         for(i = 3; i <= n; i++) {
  17             a[i] = a[i-1] + a[i-2];
  18         }
  19     }
  20 
  21     ret = a[n];
  22     free(a);
  23     return ret;
  24 }

Notice the recurrence is exactly the same in this version as in our original recursive version, except that instead of computing F(n-1) and F(n-2) recursively, we just pull them out of the array. This is typical of dynamic-programming solutions: often the most tedious editing step in converting a recursive algorithm to dynamic programming is changing parentheses to square brackets. As with memoization, the effect of this conversion is dramatic; what used to be an exponential-time algorithm is now linear-time.

2.1. More examples

2.1.1. Longest increasing subsequence

Suppose that we want to compute the longest increasing subsequence of an array. This is a sequence, not necessarily contiguous, of elements from the array such that each is strictly larger than the one before it. Since there are 2ⁿ different subsequences of an n-element array, it will take a while to try all of them by BruteForce.

What makes this problem suitable for dynamic programming is that any prefix of a longest increasing subsequence is a longest increasing subsequence of the part of the array that ends where the prefix ends; if it weren't, we could make the big sequence longer by choosing a longer prefix. So to find the longest increasing subsequence of the whole array, we build up a table of longest increasing subsequences for each initial prefix of the array. At each step, when finding the longest increasing subsequence of elements 0..i, we can just scan through all the possible values for the second-to-last element and read the length of the best possible subsequence ending there out of the table. When the table is complete, we can scan for the best last element and then work backwards to reconstruct the actual subsequence.

This last step requires some explanation. We don't really want to store in table[i] the full longest increasing subsequence ending at position i, because it may be very big. Instead, we store the index of the second-to-last element of this sequence. Since that second-to-last element also has a table entry that stores the index of its predecessor, by following the indices we can generate a subsequence of length O(n), even though we only stored O(1) pieces of information in each table entry.

Here's what the code looks like:

   1 /* compute a longest strictly increasing subsequence of an array of ints */
   2 /* input is array a with given length n */
   3 /* returns length of LIS */
   4 /* If the output pointer is non-null, writes LIS to output pointer. */
   5 /* Caller should provide at least sizeof(int)*n space for output */
   6 /* If there are multiple LIS's, which one is returned is arbitrary. */
   7 unsigned long
   8 longest_increasing_subsequence(const int a[], unsigned long n, int *output);

lis.h

   1 #include <stdlib.h>
   2 #include <assert.h>
   3 
   4 #include "lis.h"
   5 
   6 unsigned long
   7 longest_increasing_subsequence(const int a[], unsigned long n, int *output)
   8 {
   9     struct lis_data {
  10         unsigned long length;             /* length of LIS ending at this point */
  11         unsigned long prev;               /* previous entry in the LIS ending at this point */
  12     } *table;
  13 
  14     unsigned long best;      /* best entry in table */
  15     unsigned long scan;      /* used to generate output */
  16 
  17     unsigned long i;            
  18     unsigned long j;
  19     unsigned long best_length;
  20 
  21     /* special case for empty table */
  22     if(n == 0) return 0;
  23 
  24     table = malloc(sizeof(*table) * n);
  25 
  26     for(i = 0; i < n; i++) {
  27         /* default best is just this element by itself */
  28         table[i].length = 1;
  29         table[i].prev = n;              /* default end-of-list value */
  30 
  31         /* but try all other possibilities */
  32         for(j = 0; j < i; j++) {
  33             if(a[j] < a[i] && table[j].length + 1 > table[i].length) {
  34                 /* we have a winner */
  35                 table[i].length = table[j].length + 1;
  36                 table[i].prev = j;
  37             }
  38         }
  39     }
  40 
  41     /* now find the best of the lot */
  42     best = 0;
  43 
  44     for(i = 1; i < n; i++) {
  45         if(table[i].length > table[best].length) {
  46             best = i;
  47         }
  48     }
  49 
  50     /* table[best].length is now our return value */
  51     /* save it so that we don't lose it when we free table */
  52     best_length = table[best].length;
  53 
  54     /* do we really have to compute the output? */
  55     if(output) {
  56         /* yes :-( */
  57         scan = best;
  58         for(i = 0; i < best_length; i++) {
  59             assert(scan >= 0);
  60             assert(scan < n);
  61 
  62             output[best_length - i - 1] = a[scan];
  63 
  64             scan = table[scan].prev;
  65         }
  66     }
  67 
  68     free(table);
  69 
  70     return best_length;
  71 }

lis.c

A sample program that runs longest_increasing_subsequence on a list of numbers passed in by stdin is given in test_lis.c. Here is a Makefile.

2.1.2. All-pairs shortest paths

Suppose we want to compute the distance between any two points in a graph, where each edge uv has a length l_uv (+∞ for edges not in the graph) and the distance between two vertices s and t is the minimum over all s-t paths of the total length of the edges. There are various algorithms for doing this for a particular s and t, but there is also a very simple dynamic programming algorithm known as Floyd-Warshall that computes the distance between all n² pairs of vertices in Θ(n³) time. This algorithm will not be described here; see ShortestPath.

2.1.3. Longest common subsequence

Given sequences of characters v and w, v is a subsequence of w if every character in v appears in w in the same order. For example, aaaaa, brac, and badar are all subsequences of abracadabra, but badcar is not. A longest common subsequence (LCS for short) of two sequences x and y is the longest sequence that is a subsequence of both: two longest common subsequences of abracadabra and badcar are badar and bacar.

One can find the LCS of two sequence by BruteForce, but it will take a while; there are 2ⁿ subsequences of a sequence of length n, and for each of these subsequences of the first sequence it will take some additional time to check if it is a subsequence of the second. It is better to solve the problem using dynamic programming. Having sequences gives an obvious linear structure to exploit: the basic strategy will be to compute LCSs for increasingly long prefixes of the inputs. But with two sequences we will have to consider prefixes of both, which will give us a two-dimensional table where rows correspond to prefixes of sequence x and columns correspond to prefixes of sequence y.

The recursive decomposition that makes this technique work looks like this: the LCS of x[1..i] and y[1..j] is the longest of

LCS(x[1..i-1], y[1..j-1] + 1) if x[i] = y[j],
LCS(x[1..i-1], y[1..j]), or
LCS(x[1..i], y[1..j-1]).

The idea is that we either have a new matching character we couldn't use before (the first case), or we have an LCS that doesn't use one of x[i] or y[j] (the remaining cases). In each case the recursive call to LCS involves a shorter prefix of x or y. So we can fill in these values in a table, as long as we are careful to make sure that the shorter prefixes are always filled first. Here's a short C program that does this:

   1 #include <stdio.h>
   2 #include <stdlib.h>
   3 #include <assert.h>
   4 #include <string.h>
   5 #include <limits.h>
   6 
   7 /* compute longest common subsequence of argv[1] and argv[2] */
   8 
   9 /* computes longest common subsequence of x and y, writes result to lcs */
  10 /* lcs should be pre-allocated by caller to 1 + minimum length of x or y */
  11 void
  12 longestCommonSubsequence(const char *x, const char *y, char *lcs)
  13 {
  14     int xLen;
  15     int yLen;
  16     int i;             /* position in x */
  17     int j;             /* position in y */
  18 
  19     xLen = strlen(x);
  20     yLen = strlen(y);
  21 
  22     /* best choice at each position */
  23     /* length gives length of LCS for these prefixes */
  24     /* prev points to previous substring */
  25     /* newChar if non-null is new character */
  26     struct choice {
  27         int length;
  28         struct choice *prev;
  29         char newChar;
  30     } best[xLen][yLen];
  31 
  32     for(i = 0; i < xLen; i++) {
  33         for(j = 0; j < yLen; j++) {
  34             /* we can always do no common substring */
  35             best[i][j].length = 0;
  36             best[i][j].prev = 0;
  37             best[i][j].newChar = 0;
  38 
  39             /* if we have a match, try adding new character */
  40             /* this is always better than the nothing we started with */
  41             if(x[i] == y[j]) {
  42                 best[i][j].newChar = x[i];
  43                 if(i > 0 && j > 0) {
  44                     best[i][j].length = best[i-1][j-1].length + 1;
  45                     best[i][j].prev = &best[i-1][j-1];
  46                 } else {
  47                     best[i][j].length = 1;
  48                 }
  49             }
  50 
  51             /* maybe we can do even better by ignoring a new character */
  52             if(i > 0 && best[i-1][j].length > best[i][j].length) {
  53                 /* throw away a character from x */
  54                 best[i][j].length = best[i-1][j].length;
  55                 best[i][j].prev = &best[i-1][j];
  56                 best[i][j].newChar = 0;
  57             }
  58 
  59             if(j > 0 && best[i][j-1].length > best[i][j].length) {
  60                 /* throw away a character from x */
  61                 best[i][j].length = best[i][j-1].length;
  62                 best[i][j].prev = &best[i][j-1];
  63                 best[i][j].newChar = 0;
  64             }
  65 
  66         }
  67     }
  68 
  69     /* reconstruct string working backwards from best[xLen-1][yLen-1] */
  70     int outPos;        /* position in output string */
  71     struct choice *p;  /* for chasing linked list */
  72 
  73     outPos = best[xLen-1][yLen-1].length;
  74     lcs[outPos--] = '\0';
  75 
  76     for(p = &best[xLen-1][yLen-1]; p; p = p->prev) {
  77         if(p->newChar) {
  78             assert(outPos >= 0);
  79             lcs[outPos--] = p->newChar;
  80         }
  81     }
  82 }
  83 
  84 int
  85 main(int argc, char **argv)
  86 {
  87     if(argc != 3) {
  88         fprintf(stderr, "Usage: %s string1 string2\n", argv[0]);
  89         return 1;
  90     }
  91 
  92     char output[strlen(argv[1]) + 1];
  93 
  94     longestCommonSubsequence(argv[1], argv[2], output);
  95 
  96     puts(output);
  97 
  98     return 0;
  99 }

lcs.c

The whole thing takes O(nm) time where n and m are the lengths of A and B.

3. Dynamic programming: algorithmic perspective

These are mostly old notes from CS365; CS223 students should feel free to ignore this part.

3.1. Preserving alternatives

Like any DivideAndConquer algorithm, a dynamic programming algorithm needs to have some notion of what problems are big and what are small, so that we know that our recursive decomposition is making progress. Often a dynamic programming algorithm works by walking up this problem ordering. The simplest case (as in Fib2 above) is when the ordering is linear; every problem has an index 1, 2, 3, ..., and we simply solve problems in increasing order. Dynamic programming naturally lends itself to any problem we can put in a line, and a common application in combinatorial optimization is solving problems with an underlying temporal structure, where we want to maximize our profit after n steps by first figuring out how to maximize our profit after n-1. However, in figuring out the n-1 step solution, we may have to preserve multiple potential solutions, because the choices we make during these steps may limit our options later. Preserving these alternatives is what distinguishes dynamic programming from the GreedyMethod, where we just grab what we can now and hope for the best.

To take a simple example, consider the following problem: a large chemical plant can be in any of states a, b, c, d, etc. The plant starts at the beginning of the day in state a (all vats empty and clean, all workers lined up at the entry gate) and ends the day n steps later in the same state. In between, the chemical plant makes its owners money by moving from state to state (for example, moving from a to b might involve filling vat number 126 with toluene and moving from state r to state q might involve dumping the contents of vat 12 into vat 17 and stirring the resulting mixture vigorously). We don't care about the details of the particular operations, except that we assume that there is a profit function p(i,j) that tells us how much money we make (or lose, if p(i,j) is negative) when we move from state i to state j. Our goal is to find a sequence of n+1 states with state a at the start and end, that maximizes our total profit.

This is a good candidate for solution using dynamic programming, and our approach will be to keep around increasingly large partial sequences that tell us what to do for the first k steps. The only trick part is that certain profitable partial sequences might lead us to final states from which it is very expensive to get back to state a (e.g., chemical plant is on fire and venting toxic fumes over most of New England while board of directors flees to Brazil carrying suitcases full of $100 bills). But on the other hand if we know what the final state of a partial sequence is, we don't particularly care how we got there. So we can recursively decompose the problem of finding the best sequence of n steps into finding many best sequences of n-1 steps:

Plan(n, final state):
  for each state s:
    BestPlan[n-1, s] = Plan(n-1, s)
  Find s that maximizes profit of BestPlan[n-1, s] + p(s, final state)
  return BestPlan[n-1,s] with final state appended

This is easily turned upside-down to get a dynamic programming algorithm:

Plan(n, final state):
  BestPlan[0,a] = "a"
  for i = 1 to n:
    for each s:
      Find s1 that maximizes profit of BestPlan[i-1,s1] + p(s1,s)
      BestPlan[i,s] = BestPlan[i-1,s1] with s appended
  return BestPlan[n, final state]

The running time of this algorithm easily seen to be Θ(nm²) where n is the number of steps in the schedule and m is the number of states. Note that we can optimize the storage requirements by keeping around only two rows of the BestPlan array at a time.

The pattern here generalizes to other combinatorial optimization problems: the subproblems we consider are finding optimal assignments to increasingly large subsets of the variables, and have have to keep around an optimal representative of each set of assignments that interact differently with the rest of the variables. Typically this works best when the partial assignment has a small "frontier" (like the last state of the chemical plant) that is all that is visible to later operations.

3.2. Knapsack

In the knapsack problem, we are offered the opportunity to carry away items 1..n, where item i has value p_i and weight w_i. We'd like to maximize the total value of the items we take, but the total weight of the items we take is limited by the capacity K of our knapsack. How can we pick the best subset?

Greedy algorithms will let us down here: suppose item 1 costs $1000 and weighs K, while items 2..1001 cost $100 each and weigh K/10000. An algorithm that grabs the most expensive item first will fill the knapsack and get a profit of $1000 vs $100,000 for the optimal solution. It is tempting to try instead to grab first the items with the highest profit/weight ratio, but this too fails in many cases: consider two items, one of which weigh K/100000 and has value $1, and the other of which weighs K but has value $10,000. We can only take one, and the first item is worth ten times as much per pound as the second.

There are two ways to solve knapsack using dynamic programming. One works best when the weights (or the size of the knapsack) are small integers, and the other works best when the profits are small integers. Both process the items one at a time, maintaining a list of the best choices of the first k items yielding a particular total weight or a particular total profit.

Here's the version that tracks total weight:

KnapsackByWeight(p, w, K):
  // MaxProfit[i][w] is the maximum profit on subsets of items 1..i
  // with total weight w
  MaxProfit = array[0..n][0..K] initialized to -infinity

  MaxProfit[0][0] = 0

  for i = 1 to n:
    for j = 0..K:
      MaxProfit[i][j] = max(MaxProfit[i-1][j], MaxProfit[i-1][j-w[i]] + p[i])
  
  return max over all j of MaxProfit[n][j]

This runs in time Θ(nK). Note that even though this time bound is polynomial in the numerical value of K, it is exponential in the size of the input, since the size of K represented in binary is ⌈lg k ⌉ .

We can also solve the problem by tracking the minimum weight needed to obtain a certain profit. This variant is used less often than the previous one, because it is less likely in practice that the total profit on items are small than that the size of the knapsack is small.

KnapsackByProfit(p, w, K):
  // MinWeight[i][j] is the minimum weight of subsets of items 1..i
  // with total profit j
  MinWeight = array[0..n][0..sum p[i]] initialized to +infinity

  MinWeight[0][0] = 0

  for i = 1 to n:
    for j = 0..sum p[i]:
      MinWeight[i][j] = min(MinWeight[i-1][j], MinWeight[i-1][j-p[i]] + w[i])
  
  return maximum j for which MinWeight[n][j] <= K

The running time of this algorithm is Θ(n∑p_i).

3.3. Non-linear structures

The main thing we need to do dynamic programming is some sort of structure that is ordered (so we can build solutions bottom-up) and that doesn't require keeping track of too many alternatives (so that we don't just end up reimplementing BruteForce search).

3.3.1. Trees

One way to get both of these properties is to do dynamic programming on a tree: a tree is naturally ordered by depth (or height), and for many problems the visible part of the solution to a subproblem is confined to a small frontier consisting only of the root of a subtree.

For example, suppose we want to solve minimum vertex cover on a tree. A vertex cover is a set of marked vertices in a graph that covers every edge; in a tree this means that every node or its parent is marked. A minimum vertex cover is a vertex cover that marks the fewest nodes.

Here's a recursive algorithm for finding the size of a minimum vertex cover in a tree, based on the very simple fact that the root must either be put in or not:

VC(root, mustIncludeRoot):
  if mustIncludeRoot:
    return 1 + sum over all children of VC(child, false)
  else:
    withRoot = 1 + sum over all children of VC(child, false)
    withoutRoot = sum over all children of VC(child, true)
    return min(withRoot, withoutRoot)

The running time of this algorithm depends on the structure of the tree in a complicated way, but we can easily see that it will grow at least exponentially in the depth. This is a job for dynamic programming.

The dynamic programming version computes both VC(root, false) and VC(root, true) simultaneously, avoiding the double call for each child. The structure of the resulting algorithm does not look much like the table structures we typically see elsewhere, but the pairs of values passed up through the tree are in fact acting as very small vestigial tables.

DynamicVC(root):

  for each child c:
    Best[c][0], Best[c][1] = DynamicVC(c)

  withoutRoot = sum over all c of Best[c][1]
  withRoot = 1 + sum over all c of min(Best[c][0], Best[c][1])

  return (withoutRoot, withRoot)

The running time of this algorithm is easily seen to be Θ(n).

3.3.2. Treewidth

The important property of trees that makes them suitable for dynamic programming is that they have small separators: if we cut out one vertex, the tree disintegrates into subtrees that only interact (for certain problems) through the missing vertex. We can generalize this property to arbitrary graphs through the notion of treewidth.

Given a graph G = (V,E), a tree decomposition of G is a set of bags X = {X₁, X₂, ... X_n} of vertices in G, together with a tree structure T on the bags satisfying the following axioms:

The union of the bags equals V.
For every edge (u,v) in E, there is a bag containing both u and v.
If there is a path from X_i through X_j to X_k in T, then X_i ∩ X_k ⊆ X_j. (Equivalently: the set of all bags that contain any vertex v form a connected subtree of T.)

The width w(X,T) of a tree decomposition is the size of the largest bag. The treewidth tw(G) of a graph G is min(w(X,T))-1 over all tree decompositions of G. If we limit ourselves to tree decompositions in which T is a path, we get pathwidth instead.

3.3.2.1. Examples

If G is a tree, make each edge a bag, and let (X_i,X_j)∈T if X_i and X_j share a vertex. This is a width 2 tree decomposition of G, giving treewidth 2-1 = 1. (Making trees have treewidth 1 is why we subtract the 1.)
A k×m grid has treewidth min(k,m). Proof: Suppose k < m. Label the vertices v_ij where i ranges from 1..k and j ranges from 1..m. Construct a path of bags of size k+1 as follows: Bag 1 contains all k vertices v_i1 in column 1 plus v_k2. Bag 2 drops v_k1 and adds v_(k-1)2. Bag three drops v_(k-1)1 and add v_(k-2)2. Continue until we removed all column 1 vertices except v₁₁ and then repeat the process starting with column 2. We've covered every edge (most of them many times) and the subtree condition holds, since once we remove some vertex v_ij we never put it back. Since each bag has (k+1) vertices we get the desired treewidth. Generalization: the same tree decomposition works if we add edges from v_ij to v_(i+1)(j+1).
Any connected subgraph of a graph of treewidth k has treewidth at most k. Proof: Use the tree decomposition of the parent graph, removing any vertices that don't appear in the subgraph. The union axiom still holds, we have only removed edges, so the edge axiom still hold, and we have only removed vertices, so the collections of bags containing any surviving vertices are still subtrees.
Any cycle has treewidth 2. Proof: embed it in a 2×m grid with the extra diagonal edges.

3.3.2.2. Treewidth and dynamic programming

Idea: Reconstruct the tree decomposition as a rooted binary tree, then use separation properties to argue that we can build up solutions to e.g. independent set, graph coloring, Hamiltonian circuit, etc. recursively through the tree.

Problem: This whole process is a pain in the neck, and it's not clear that treewidth is any more intuitive than just having small separators. So I suspect we will not be covering this in lecture.

CategoryAlgorithmNotes CategoryProgrammingNotes

But it's linear in the numerical value of the output, which means that Fib(n) will actually terminate in a reasonable amount of time on a typical modern computer when run on any n small enough that F(n) fits in 32 bits. Running it using 64-bit (or larger) integer representations will be slower. (1)
I.e., functional. (2)