2State and Binomial Distributions
eg. HTTHTTTHTT n=10, h=3, t=7 
'A' is an event for which P(A)=p and hence P(not A)=1p. Consider 'n' independent trials (also known as Bernoulli trails). The random variable X is defined to be the number of times that A occurred. Then X has a binomial distribution with parameters n and p.
The following is taken from Wallace & Boulton (1968) to follow the historical development of MML. The general MML form, explicitly based on the [Fisher] information, is given later.
Coding
To transmit a sequence of 'n' cointosses containing 'h' heads, This is a multistate distribution with m=2.
A uniform prior is assumed for P(H). The value of 'n' is either common knowledge or is transmitted by standard methods. The receiver does not know h or t.
h = #heads, t = #tails, n = h+t r(H) = h/n = frequency of heads, r(T) = ... P(H) = probability of a head, P(T) =... P'(H) = estimate of P(H), P'(T) =...
We describe three methods to transmit the sequence of
cointosses, from Wallace & Boulton (1968):
Method 1
 (i) Transmit h, 0≤h≤n, requires log(n+1) bits (base 2 ⇒ bits, base e ⇒ nits).
 (ii) Transmit which combination out of C(n,h) appeared,
requires log(n!/(h!(nh)!)) bits.
Total msg len = log(n+1) + log(n!/(h!(nh)!)) = log( (n+1)!/(h!(nh)!) ).
Method 2
 Use an 'adaptive code'.
 H and T running counts are maintained, initialised to one so that there are no zero probabilities.
 The code for each H or T is based on the counts at that point.
eg. H T T H T T T H T T H count: 1 2 2 2 3 3 3 3 4 4 T count: 1 1 2 3 3 4 5 6 6 7 sum : 2 3 4 5 6 7 8 9 10 11 count 1 1 2 2 3 4 5 3 6 7 p =  =           sum 2 3 4 5 6 7 8 9 10 11 p = h!.(nh)!/(n+1)! msg len = log( (n+1)!/(h!(nh)!) )
So method 1 and method 2 require the same message length.
Method 3
 from appendix of W&B(1968) for 2 states, m=2:
 (i) Transmit an estimate P'(H) of P(H), to some accuracy. P'(T) = 1P'(H).
 (ii) Transmit the sequence of Hs and Ts in a code based on that estimate, requires h.log(P'(H))t.log(P'(T)).
Total msg len = msglen(P'(H))  h.log(P'(H))  t.log(P'(T))
Part (ii) is minimised if P'(H)=r(H) but then part (i) is large.

^ part(ii)  . .  . .  . . D(  . . (  .  >P'(H)     0  r(H) 1 P'(H)
 Let r(H) = P'(H)+a(H) etc. where a(H)+a(T)=0.
D = n{ Σ_{[m:HT]} (P'(m)+a(m))log(P'(m))  using estimates P'() Σ_{[m:HT]} (P'(m)+a(m))log(P'(m)+a(m))}  using r()'s = n.Σ_{[m:HT]} (P'(m)+a(m))log(1+a(m)/P'(m))  assume a(m)/P'(m) is small, expand log(1+a(m)/P'(m))  to 2nd term NB. log(1+x) = 1.x  x^{2}/2 + x^{3}/3  ... ~ n.Σ_{[m:HT]} (P'(m)+a(m)).(a(m)/P'(m)  a(m)^{2}/2P'(m)^{2}) = n.Σ_{[m:HT]} { a(m)  a(m)^{2}/2P'(m) + a(m)^{2}/P'(m)  a(m)^{3}/2P'(m)^{2} } NB. Σ a(m) = 0 ~ (n/2).Σ_{[m:HT]} a(m)^{2}/P'(m)
The values of P(H) range over G=[0,1] uniformly and a particular estimate <P'(H),P'(T)=1P'(H)> will be used over some range of r(H). This range will depend on n and on P'(H).
r(T)^ Σ[m:HT] r(m) = 1 1. r(m)≥0, m:HT  .  .  .  .  . 0.>r(H) 0 1 For 2 state, {H,T}, since a(H)=a(T) & P'(H)=1P'(T), D = (n/2).{ a(H)^{2}/P'(H) + a(T)^{2}/P'(T) } = (n/2).{ a(H)^{2}/P'(H) + a(H)^{2}/P'(T) } = (n/2). a(H)^{2} / (P'(H).(1P'(H)))
Note that for a given a(H), D is large for P'(H) close to 0 or to 1, and also that D is symmetric for a(H) +ve or ve, so P'(H) should be at the centre of the range over which it is used.
Average value of a(H)^{2} over the range [P'(H)x/2, P'(H)+x/2] of length x is
= (1/x).∫_{[x/2,x/2]} y^{2} dy = (1/x).{y^{3}/3}_{[x/2,x/2]} = (1/x).(x^{3} / 3 / 4) = x^{2} / 12
avgD = n . x^{2} / (24 . P'(H).(1P'(H)))
The message length to state the interval [P'(H)x/2, P'(H)+x/2] and thus P'(H) is
log(x)  'cos of the uniform prior in P(H)
Total avg cost to transmit interval and also D due to mismatch of P'(H) to r(H) is
log(x) + n . x^{2} /(24 . P'(H).(1P'(H)))to minimise avg increase, differentiate w.r.t. x and find zero
1/x + n . x / (12 . P'(H).(1P'(H))) = 0 x^{2} = 12 . P'(H) . (1P'(H)) / nSo min' increase is
1/2 log(n/12)  1/2 {log(P'(H))  log(1P'(H))} + n. 12.P'(H).(1P'(H))/n /(24.P'(H).(1P'(H))) = 1/2 {log(n/12)  log(P'(H))  log(1P'(H)) + 1}
If we take P'(H) ~ r(H) and do averaging around r(H), total message length to transmit data by method 3 is approximately:
1/2 {log(n/12) + 1}  Σ_{[m:HT]} (n.r(m) + 1/2).log(r(m))
= 1/2 {log(n/12) + 1  Σ_{[m:HT]}log(r(m))}  h.log(h/n)  t.log(t/n) ^^^^^^^^^^^^^^^^^^^^^ part (ii) in "optimal" code
To compare Method 3 with methods 1 and 2:
Stirling's approx: log(x!) = x.log(x)  x + 1/2 log(x) + 1/2log(2 pi) + ...
method 1 or 2: log( (n+1)!/(h! t!) ) = log(n+1) + log(n!/(h! t!)) = log(n+1) + n log n  n + 1/2 log n + 1/2 log(2 pi)  h log h + h  1/2 log h  1/2 log(2 pi)  t log t + t  1/2 log t  1/2 log(2 pi) = log(n+1)  h log(h/n)  t log(t/n) + 1/2(log n log h log t log(2 pi)) = 1/2(log((n+1)^{2} . n / (h t))  log(2 pi) )  h.log(h/n)  t.log(t/n) = 1/2{log((n+1)^{2}/n)  log(h/n)  log(t/n) log(2 pi)}  h.log ...... = 1/2{log((n+1)^{2}/n)log(2 pi)log(r(H))log(r(T))} h.log(h/n)t.log(t/n) method 3  method 1 = 1/2{log(n/12) + 1  Σ_{[m:HT]}log(r(m))} h.log(h/n)t.log(t/n) {1/2{log((n+1)^{2}/n)log(2 pi)log(r(H))log(r(T))} h.log(h/n)t.log(t/n)} = 1/2{2log(1+1/n) log(12)+1+log(2 pi)} ~ 1/2 { log(pi/6) + 1 }  for large n = 1/2 log(pi.e/6) = 0.176 nits
So method 3 is just marginally(!) worse than methods 1 and 2  but then it states an estimate P'( ) of P( ) which methods 1 and 2 do not.
Demonstration
Use the HTML FORM below to generate a sequence of coin tosses for a specified pr(H) and length N. The 'code' button calculates code length for a fixed code (nominate an estimate p' for P(H)), combinatorial code, adaptive code, and a code which states pr(H) to an optimal level of accuracy. NB. the appoximations used may break down for very small values of N.
Don't worry if JavaScript gives a ridiculous number of decimal places, it's just its way.
Method 4, sparse list
 There is a fourth method, not mentioned by W&B, of representing sequences of binary data which may seem tempting for "sparse" (skewed) sequences  mostly tails (or mostly heads)  that is by a list of the positions of the heads (tails). It takes log(n) bits (logs base 2) to specify an arbitrary integer in [0 .. n1] (i.e., a position). It takes
 log(n) + log(n1) + ... + log(nh+1)  log(h!), where h = #heads,
 = log(n! / (h! (nh)!) bits
 to specify h different integers. As before, h ∈ [0 .. n] must also be transmitted, which takes log(n+1) bits under the uniform prior that we have been using. The total message length is therefore
 log((n+1)! / (h! (nh)!),
 the same as methods 1 and 2, as one would hope.
This result is relevant when considering the adjacencymatrix v. adjacencylists representations of graphs (networks).
Notes
This page is directly based on the appendix in the paper Wallace & Boulton (1968); any errors by L.A..
 C. S. Wallace & D. M. Boulton.
An Information Measure for Classification.
Computer Journal, 11(2), pp.185194,
Aug. 1968.
(see the appendix) and ...  D. M. Boulton & C. S. Wallace.
The Information Content of a Multistate Distribution.
J. Theor. Biol.,
23, No.2, pp.269278, 1969.
The multistate distribution generalises the 2state distribution.