2-State and Binomial Distributions
eg. HTTHTTTHTT n=10, h=3, t=7 |
'A' is an event for which P(A)=p and hence P(not A)=1-p. Consider 'n' independent trials (also known as Bernoulli trails). The random variable X is defined to be the number of times that A occurred. Then X has a binomial distribution with parameters n and p.
The following is taken from Wallace & Boulton (1968) to follow the historical development of MML. The general MML form, explicitly based on the [Fisher] information, is given later.
Coding
To transmit a sequence of 'n' coin-tosses containing 'h' heads, This is a multi-state distribution with m=2.
A uniform prior is assumed for P(H). The value of 'n' is either common knowledge or is transmitted by standard methods. The receiver does not know h or t.
h = #heads, t = #tails, n = h+t r(H) = h/n = frequency of heads, r(T) = ... P(H) = probability of a head, P(T) =... P'(H) = estimate of P(H), P'(T) =...
We describe three methods to transmit the sequence of
coin-tosses, from Wallace & Boulton (1968):
Method 1
- (i) Transmit h, 0≤h≤n, requires log(n+1) bits
(base 2 ⇒ bits,
base e ⇒ nits).
- (ii) Transmit which combination out of C(n,h) appeared,
requires log(n!/(h!(n-h)!)) bits.
Total msg len = log(n+1) + log(n!/(h!(n-h)!)) = log( (n+1)!/(h!(n-h)!) ).
Method 2
- Use an
'adaptive code'.
- H and T running counts are maintained, initialised to one so that there are no zero probabilities.
- The code for each H or T is based on the counts at that point.
- H and T running counts are maintained, initialised to one so that there are no zero probabilities.
eg. H T T H T T T H T T H count: 1 2 2 2 3 3 3 3 4 4 T count: 1 1 2 3 3 4 5 6 6 7 sum : 2 3 4 5 6 7 8 9 10 11 count 1 1 2 2 3 4 5 3 6 7 p = ----- = - - - - - - - - -- -- sum 2 3 4 5 6 7 8 9 10 11 p = h!.(n-h)!/(n+1)! msg len = log( (n+1)!/(h!(n-h)!) )
So method 1 and method 2 require the same message length.
Method 3
- from appendix of W&B(1968) for 2 states, m=2:
- (i) Transmit an estimate P'(H) of P(H), to some accuracy.
P'(T) = 1-P'(H).
- (ii) Transmit the sequence of Hs and Ts in a code based on that estimate, requires -h.log(P'(H))-t.log(P'(T)).
Total msg len = msglen(P'(H)) - h.log(P'(H)) - t.log(P'(T))
Part (ii) is minimised if P'(H)=r(H) but then part (i) is large.
-
^ part(ii) | . . | . . | . . D( -| . . ( -| . | ---------------------------------->P'(H) | | | | 0 | r(H) 1 P'(H)
- Let r(H) = P'(H)+a(H) etc. where a(H)+a(T)=0.
D = -n{ Σ[m:HT] (P'(m)+a(m))log(P'(m)) -- using estimates P'() -Σ[m:HT] (P'(m)+a(m))log(P'(m)+a(m))} -- using r()'s = n.Σ[m:HT] (P'(m)+a(m))log(1+a(m)/P'(m)) -- assume a(m)/P'(m) is small, expand log(1+a(m)/P'(m)) -- to 2nd term NB. log(1+x) = 1.x - x2/2 + x3/3 - ... ~ n.Σ[m:HT] (P'(m)+a(m)).(a(m)/P'(m) - a(m)2/2P'(m)2) = n.Σ[m:HT] { a(m) - a(m)2/2P'(m) + a(m)2/P'(m) - a(m)3/2P'(m)2 } NB. Σ a(m) = 0 ~ (n/2).Σ[m:HT] a(m)2/P'(m)
The values of P(H) range over G=[0,1] uniformly and a particular estimate <P'(H),P'(T)=1-P'(H)> will be used over some range of r(H). This range will depend on n and on P'(H).
r(T)^ Σ[m:HT] r(m) = 1 1|. r(m)≥0, m:HT | . | . | . | . | . 0.------------->r(H) 0 1 For 2 state, {H,T}, since a(H)=-a(T) & P'(H)=1-P'(T), D = (n/2).{ a(H)2/P'(H) + a(T)2/P'(T) } = (n/2).{ a(H)2/P'(H) + a(H)2/P'(T) } = (n/2). a(H)2 / (P'(H).(1-P'(H)))
Note that for a given a(H), D is large for P'(H) close to 0 or to 1, and also that D is symmetric for a(H) +ve or -ve, so P'(H) should be at the centre of the range over which it is used.
Average value of a(H)2 over the range [P'(H)-x/2, P'(H)+x/2] of length x is
= (1/x).∫[-x/2,x/2] y2 dy = (1/x).{y3/3}[-x/2,x/2] = (1/x).(x3 / 3 / 4) = x2 / 12
avgD = n . x2 / (24 . P'(H).(1-P'(H)))
The message length to state the interval [P'(H)-x/2, P'(H)+x/2] and thus P'(H) is
-log(x) -- 'cos of the uniform prior in P(H)
Total avg cost to transmit interval and also D due to mismatch of P'(H) to r(H) is
-log(x) + n . x2 /(24 . P'(H).(1-P'(H)))to minimise avg increase, differentiate w.r.t. x and find zero
-1/x + n . x / (12 . P'(H).(1-P'(H))) = 0 x2 = 12 . P'(H) . (1-P'(H)) / nSo min' increase is
1/2 log(n/12) - 1/2 {log(P'(H)) - log(1-P'(H))} + n. 12.P'(H).(1-P'(H))/n /(24.P'(H).(1-P'(H))) = 1/2 {log(n/12) - log(P'(H)) - log(1-P'(H)) + 1}
If we take P'(H) ~ r(H) and do averaging around r(H), total message length to transmit data by method 3 is approximately:
1/2 {log(n/12) + 1} - Σ[m:HT] (n.r(m) + 1/2).log(r(m))
= 1/2 {log(n/12) + 1 - Σ[m:HT]log(r(m))} - h.log(h/n) - t.log(t/n) ^^^^^^^^^^^^^^^^^^^^^ part (ii) in "optimal" code
To compare Method 3 with methods 1 and 2:
Stirling's approx: log(x!) = x.log(x) - x + 1/2 log(x) + 1/2log(2 pi) + ...
method 1 or 2: log( (n+1)!/(h! t!) ) = log(n+1) + log(n!/(h! t!)) = log(n+1) + n log n - n + 1/2 log n + 1/2 log(2 pi) - h log h + h - 1/2 log h - 1/2 log(2 pi) - t log t + t - 1/2 log t - 1/2 log(2 pi) = log(n+1) - h log(h/n) - t log(t/n) + 1/2(log n -log h -log t -log(2 pi)) = 1/2(log((n+1)2 . n / (h t)) - log(2 pi) ) - h.log(h/n) - t.log(t/n) = 1/2{log((n+1)2/n) - log(h/n) - log(t/n) -log(2 pi)} - h.log ...... = 1/2{log((n+1)2/n)-log(2 pi)-log(r(H))-log(r(T))} -h.log(h/n)-t.log(t/n) method 3 - method 1 = 1/2{log(n/12) + 1 - Σ[m:HT]log(r(m))} -h.log(h/n)-t.log(t/n) -{1/2{log((n+1)2/n)-log(2 pi)-log(r(H))-log(r(T))} -h.log(h/n)-t.log(t/n)} = 1/2{-2log(1+1/n) -log(12)+1+log(2 pi)} ~ 1/2 { log(pi/6) + 1 } -- for large n = 1/2 log(pi.e/6) = 0.176 nits
So method 3 is just marginally(!) worse than methods 1 and 2 - but then it states an estimate P'( ) of P( ) which methods 1 and 2 do not.
Demonstration
Use the HTML FORM below to generate a sequence of coin tosses for a specified pr(H) and length N. The 'code' button calculates code length for a fixed code (nominate an estimate p' for P(H)), combinatorial code, adaptive code, and a code which states pr(H) to an optimal level of accuracy. NB. the appoximations used may break down for very small values of N.
Don't worry if JavaScript gives a ridiculous number of decimal places, it's just its way.
Method 4, sparse list
- There is a fourth method, not mentioned by W&B, of representing sequences of binary data which may seem tempting for "sparse" (skewed) sequences -- mostly tails (or mostly heads) -- that is by a list of the positions of the heads (tails). It takes log(n) bits (logs base 2) to specify an arbitrary integer in [0 .. n-1] (i.e., a position). It takes
- log(n) + log(n-1) + ...
+ log(n-h+1)
- log(h!),
where h = #heads,
- = log(n! / (h! (n-h)!) bits
- to specify h different integers. As before, h ∈ [0 .. n] must also be transmitted, which takes log(n+1) bits under the uniform prior that we have been using. The total message length is therefore
- log((n+1)! / (h! (n-h)!),
- the same as methods 1 and 2, as one would hope.
This result is relevant when considering the adjacency-matrix v. adjacency-lists representations of graphs (networks).
Notes
This page is directly based on the appendix in the paper Wallace & Boulton (1968); any errors by L.A..
- C. S. Wallace & D. M. Boulton.
An Information Measure for Classification.
Computer Journal, 11(2), pp.185-194,
Aug. 1968.
(see the appendix) and ... - D. M. Boulton & C. S. Wallace.
The Information Content of a Multistate Distribution.
J. Theor. Biol.,
23, No.2, pp.269-278, 1969.
The multi-state distribution generalises the 2-state distribution.