Dynamic Programming

1-D Dynamic Programming Algorithm

The 1-D dynamic programming problem is: Given vertices, vlo, vlo+1, ..., vhi-1, vhi, directed edges (vi,vj) for all i and j where lo≤i<j≤hi, and an edge cost function, edgeCost i j, find an optimal path, [vi0, vi1, ..., vim], from vlo to vhi.

This is a special case of the shortest-path problem; here the graph is a complete, directed acyclic graph as determined by the ordering on vertices. It can therefore be solved by a variation on the dynamic programming strategy which is also the basis of Prim's minimum spanning tree algorithm and Dijkstra's single-source shortest paths algorithm.

The algorithm maintains paths from vlo to a set of "undone" vertices, initially containing vlo+1, ..., vhi; it also keeps a set of "done" vertices, initially empty. The starting path from vlo to vi consists of the single edge (vlo,vi) and is unlikely to be optimal. The algorithm repeatedly finds the best undone vertex, vb, that is closest to vlo and moves it to the "done" set. vb may also reveal shorter paths from vlo via vb to remaining undone vertices. Enough information is kept to trace an optimal path backwards from vhi to vlo when vhi becomes done.


module DPA1D (dpa1D) where

dpa1D lo hi edgeCost =
-- Given vertices  lo, lo+1, ..., hi -- and directed,
-- +vely weighted edges (i,j,edgeCost i j) for all i < j,
-- return an optimal path from lo to hi with its cost,
-- i.e. ([lo, i, j, ..., hi], c) where lo < i < j < ... < hi.
 let
  dpa undone done =
    let (b@(_,v,_)) = best undone -- closest (to lo) of the undone
    in if v == hi then b:done
       else dpa (removeAndUpdate b undone) (b:done)

  undone = map (\j -> (lo, j, edgeCost lo j)) [lo+1..hi]  -- initially

  removeAndUpdate (edge@(i,j,c)) edges =
   let
    rmvEdge ((edge'@(_,j',_)):es) =
      if j==j' then map update es  -- drop edge'; seek shortcuts
      else edge' : (rmvEdge es)    -- else j > j'
    -- assumes edge (i,j) => i < j, and that unDone sorted on j.

    update edge'@(_,j',c') =       -- does j help j' ?
      let viaJ = c + edgeCost j j' -- ...(_,j,c)(j,j',c') ?
      in if c' > viaJ then (j,j',viaJ) -- improvement or...
         else edge'                    -- ...not

   in rmvEdge edges

  better (e@(_,_,c)) (e'@(_,_,c')) = if c' >= c then e else e'
  best = foldl1 better

  trace k edges path =         -- trace back optimal soln...
    if k <= lo then k:path     -- ...from k down to lo
    else let ((i,j,c):edges') = edges
         in if j == k then trace i edges' (k:path)
         else trace k edges' path

  (edges @ ((e1@(i,j,cost)):_)) = dpa undone []

 in (trace hi edges [], cost)
-- ---------------------------------------LA--UWA-&-U.York--2004--
(About [Haskell, ghc, *.hs].)

This problem appears in various disguises, for example in inserting line breaks for paragraph layout and in polygon fitting. It also appears in the inference and data-compression problem (Farr & Wallace 2002) described in the next section.

SMML Code-Book for the 2-State (~Binomial) Distribution

A transmitter and a receiver want to share the results of n Bernouilli trials, i.e. n throws of a coin, carried out by the transmitter, by sending them over a data link. They could use a fixed code, e.g. based on pr(H)=0.5 in which case the data would be sent at 1-bit per throw. But they might be able to do better than this if the coin is biased.

The transmitter, who is the one making the trials, is in a position to estimate θ=pr(H) and to send the results using a code based on θ, but the receiver cannot decode such data unless told the estimate in a header to the message. The transmitter and receiver need a "code-book" for (the parameter of) the 2-state distribution.

Just considering the number of heads, #H, that could turn up, this quantity lies in [0..n]; there are n+1 possible values for #H. It is clearly impossible to infer pr(H) more precisely than something like ±1/(2×(n+1)) and it turns out that a sensible precision is much coarser than this.

The code-book contains a partition of [0..n] (which implies a partition of [0.0..1.0]). The transmitter conducts the trials and selects the entry for the part of the partition into which #H falls. The message header is the code-word for that part. The transmitter then transmits the trial results using the code for {H,T} based on the estimate corresponding to the part. The receiver now has the estimate and can decode the results of the trials.

The 1-D DPA can be used to partition [0..n] optimally(F&W 2002). It is convenient to set lo=-1 and hi=n. This example illustrates the correspondence between a path and a partition:

path [ -1, 1, 2,5 ]
edges [(-1,1), (1,2), (2,5)]
           ----->
 ---------->    --------------->
-1    0    1    2    3    4    5
partition [[0..1], [2..2], [3..5]]

The cost of an edge (i,j), i.e. part [i+1..j], is the expected message length over those outcomes containing more than i and at most j heads. A prior probability distribution on θ is needed to complete the calculation; the uniform distribution over [0.0..1.0] is used below for simplicity.


module Binomial (module Binomial) where
import DPA1D

binomial n =  -- compute the SMML code-book for n Bernouilli trials
 let
  n' = fromIntegral n  -- often need n in real ratios etc.
  n1'= n'+1            -- ditto, |[0..n]|

  codeTrials p h n =   -- p = pr(h), h = #head, n = #trials
    let t = n-h
    in -(if h > 0 then (fromIntegral h)*log p     else 0.0)
       -(if t > 0 then (fromIntegral t)*log (1-p) else 0.0) -- no 0*inf 

  interval lo hi =  -- expected msg len due to single interval [lo..hi]
    let cases  = hi-lo+1           -- |[lo..hi]|
        cases' = fromIntegral cases
        est    = fromIntegral(lo+hi) / (2.0 * n')
    in (cases' * log(n1'/cases')                   -- uniform prior
        + codeTrials est (cases*(hi+lo) `div` 2) (cases*n)  -- m.l. 
       ) / n1'                                    -- in expectation

  edgeCost i j = interval (i+1) j  -- (i,j) covers [i+1..j]

  (path, c) = dpa1D (-1) n edgeCost
  parts = zip (map (+1) path) (tail path)  -- partition [0..n]
 in (parts, c)
-- --------------------------------------------LA--UWA-&-U.York--2004--

A simple main program runs the algorithm through its paces:

 
module Main where
import DPA1D
import Binomial

main =
  putStrLn ("SMML code-book for 2-state (~binomial) distribution\n" ++
         "key: n, partition, Expected msg length (bits)")
 >> putStrLn( foldr (\n -> \theRest ->
      let (p, msgl) = binomial n
      in (show n) ++ ", " ++ (show p) ++ ", "
         ++ (show (msgl/log 2)) ++ "\n" ++ theRest
      ) ""
     [1..30] )  -- a range of n
-- ---------------------------------------LA--UWA-&-U.York--2004--

E.g.

SMML code-book for 2-state (~binomial) distribution
key: n, partition, Expected msg length (bits)
1, [(0,1)], 1.0
2, [(0,2)], 2.0
3, [(0,0),(1,3)], 2.88
4, [(0,0),(1,3),(4,4)], 3.77
5, [(0,0),(1,4),(5,5)], 4.58
6, [(0,0),(1,5),(6,6)], 5.43
7, [(0,3),(4,7)], 6.25
8, [(0,0),(1,5),(6,8)], 7.04
9, [(0,0),(1,5),(6,9)], 7.83
10, [(0,0),(1,4),(5,9),(10,10)], 8.63
...

Farr and Wallace (2002) show that the SMML code-book problem is feasible, that is computable in polynomial time, for probability distributions having 1-D discrete parameter spaces. "The problem in general is [...] NP-hard." "The complexity of the trinomial [3-state] case remains open" but they give a good heuristic for it; the trinomial problem has two parameters, θ0=pr(0) and θ1=pr(1), pr(2)=1-θ01.

Approximate Code-Book for the Normal (Gaussian) Distribution

The normal distribution causes us two difficulties: Its parameter space, (μ,σ), is two dimensional and both parameters are continuous. We can appoximate its code-book by considering μ|σ and by quantising the range of σ. This is to make two simplifying assumptions: An approximate code-book entry corresponds to a rectangle in parameter space which is suboptimal in general, and σ is quantised.

μ is an "origin" parameter: There is no nett effect on the likelihood if μ and all data are increased by the same value x. σ is a "scale" parameter: If μ=0 (see above), there is no nett effect on the likelihood if σ and all data are multiplied by the same value x>0. Taking advantage of these facts, σ quanta are chosen so that their log values are equally spaced and, for a given σ interval, the range of μ is divided into some number of equal sized intervals.

The expected "cost" of using one estimate NmEst,sEst per "patch" (mLo,mHi)×(sLo,sHi) is the cost of sending the estimate plus the expected KL-distance from the (untransmittable) maximum likelihood estimate to it. This expected KL-distance be calculated by integration of the KL-distance from NmaxLH to NmEst,sEst over the patch. Making the patches smaller reduces this cost to the data but it also increases the cost of stating the estimates, giving a trade-off to be optimised.


module Normal (module Normal) where
import DPA1D

normal nData mLwb mUpb sLwb sUpb nS =
 let      -- m mean,   s sigma
  nS' = fromIntegral nS
  sRange  = sUpb - sLwb
  logSlwb = log sLwb
  dLogS   = (log sUpb - logSlwb) / nS'
  sVal n  = exp(logSlwb + (fromIntegral n)*dLogS)

  sInterval i j =
   let si = sVal i
       sj = sVal j
   in ( log(sRange/(sj-si))   -- code sEst, uniform prior
      + snd( doMean si sj )
      ) * (sj - si) / sRange  -- expectation due to [i..j]

  doMean sLo sHi =  -- given [sLo...sHi], optimally divide...
   let              -- ...[mLwb...mUpb] into equal parts
    sEst = (sLo + sHi) / 2

    divide n =
      let n' = fromIntegral n
          dM = (mUpb - mLwb) / n'
          mEst = dM / 2
          eMsgL =
           log n'   -- code for mEst, uniform prior
           + nData * (integrateKL 0 dM sLo sHi mEst sEst)
                   / ((sHi - sLo) * dM)
           -- i.e. + nData * avg data "error"
      in (n, eMsgL) -- divide

    best (xc:xcs) =  -- linear-search for the (only) min
      let b (xc@(x,c)) ((xc'@(x',c')):xcs) =
            if c' < c then b xc' xcs else xc
      in b xc xcs

   in best (map divide [1.. ])  -- doMean

  (sNums, c) = dpa1D 0 nS sInterval
  sVals = map sVal sNums
  sIntvls = zip sVals (tail sVals)
  mNs = map (\(sLo,sHi) -> fst(doMean sLo sHi)) sIntvls

 in ((zip sIntvls mNs),  c)   -- normal


integrateKL m1Lo m1Hi s1Lo s1Hi m2 s2 =  -- integrate KL over m1 & s1
 let f m1 = ((m1-m2)**3) / (6*sqr s2) + m1*(log s2 - 0.5)
     g s1 = (s1**3) / (6*sqr s2) - s1*(log s1 - 1)
     -- NB d/dx {x.log x - x} = x/x + log x - 1 = log x
 in (f m1Hi - f m1Lo)*(s1Hi - s1Lo) + (m1Hi-m1Lo)*(g s1Hi - g s1Lo)

sqr x = x*x  -- maybe faster than x**2 ?
-- --------------------------------------------LA--UWA-&-U.York--2004--

module Main where
import DPA1D
import Normal

main =
  putStrLn ("Approximate code-book for Normal (Gaussian) distribution\n"
   ++ "key: ((sigmaLo,sigmHi),nMuIntervals)... E |H|+n*KL_dist (nits)" )
 >> putStrLn(
      let (is,c) = (normal 30  0.0 2.0  0.1 1.0  100)
      in (foldr (\x -> \rest -> (show x) ++ "\n" ++ rest) "" is)
	 ++ show c
    )
-- ---------------------------------------LA--UWA-&-U.York--2004--

E.g. n=30; μ 0.0..2.0; σ 0.1..1.0 (100 quanta):

Approximate code-book for Normal (Gaussian) distribution
key: ((sigmaLo,sigmHi),nMuIntervals)... E |H|+n*KL_dist (nits)
((0.10,0.16),24)
((0.16,0.25),15)
((0.25,0.40),10)
((0.40,0.63),6)
((0.63,1.0), 4)
4.32656332488434

References

G. E. Farr and C. S. Wallace. The complexity of strict minimum message length inference. Comp. J., 45(3), pp.285-292, 2002.
Note that they give expected message lengths for transmitting #H only, excluding the trial results.