Information
This page: Information, (un)certainty, entropy, Kullback-Leibler distance. |
Chris Wallace proposed this (slightly) simplified taxonomy of science to place ‘information’ in perspective.
-
Science matter chemistry energy physics life biology money economics information [*] 1920's on
Examples
I toss a coin and tell you that it came down ‘heads’. I have told you some information. How much? A computer scientist immediately says ‘one bit’ (I hope).
I roll a four-sided "dice" with faces labelled {a,c,g,t} and tell you that it landed on ‘c’: two bits of information.
Suppose we have a trick coin with two heads and this is common knowledge. I toss the coin and tell you that it came down ... heads. How much information? Nothing, zero, of course. You knew it would come down heads and I have told you nothing that you did not already know. But if you had not known that it was a trick coin, you would have learned one bit, so the information learned depends on your prior knowledge.
We have a biased coin; it has a head and a tail but comes down heads about 75 in 100 times and tails about 25 in 100 times, and this is common knowledge. I toss the coin and tell you ... tails, two bits of information. A second toss lands ... heads, rather less than one bit, wouldn't you say? (Only 0.42 bits in fact.)
I pull a coin out of my pocket and tell you that it is a trick coin with two heads. How much information have you gained? Well "quite a lot", maybe 20 or more bits, because trick coins are very rare, and you may never even have seen one before.
So if something is certain then it is no information to learn that it has occurred. The less probable, the less likely, that an event is, the more information is learned by being told of its happening.
Information: Definition
The amount of information in learning of an event ‘A’ which has probability P(A) is
- I(A) = -log2(P(A)) bits
- I(A) = -loge(P(A)) nits (aka nats)
- P(A) = 1 => I(A) = -log2(1) = 0
- P(B) = 1/2 => I(B) = -log2(1/2) = log2(2) = 1 bit
- P(C) = 1/2n => I(C) = n bits
- P(D) = 0 => I(D) = ∞ ... think about it
- P(B) = 1/2 => I(B) = -log2(1/2) = log2(2) = 1 bit
Entropy
Entropy tells us the average information in a probability distribution over a sample space S. It is defined to be
- H = - ∑v in S {P(v) log2 P(v)}
Examples
The fair coin
- H = -1/2 log2(1/2) - 1/2 log2(1/2)
- = 1/2 + 1/2
- = 1 bit
- = 1/2 + 1/2
That biased coin, P(head)=0.75, P(tail)=0.25
- H = - 3/4 log2(3/4) - 1/4 log2(1/4)
- = 3/4×0.42 + 2/4 = 0.31 + 1/2
- = 0.81 bits, approx.
- = 3/4×0.42 + 2/4 = 0.31 + 1/2
A biased four-sided dice, p(a)=1/2, p(c)=1/4, p(g)=p(t)=1/8
- H = - 1/2 log2(1/2)
- 1/4 log2(1/4)
- 1/8 log2(1/8)
- 1/8 log2(1/8)
- = 1 3/4 bits
Theorem H1
(The result is "classic" but this is from notes taken during talks by Chris Wallace (1988).)
If (pi)i=1..N and (qi)i=1..N are probability distributions, i.e. each non-negative and sums to one, then the expression
- ∑1=1..N { - pi log2(qi) }
Proof:
First note that to minimise f(a,b,c) subject to g(a,b,c)=0,
we consider f(a,b,c)+λ.g(a,b,c).
We have to do this because a, b & c are not independent;
they are constrained by g(a,b,c)=0.
If we were just to set d/da{f(a,b,c)} to zero
we would miss any effects that ‘a’ has
on b & c through g( ).
We don't know how important these effects are in advance,
but λ will tell us.
We differentiate and set to zero the following:
d/da{f(a,b,c)+λ.g(a,b,c)}=0
and similarly for d/db, d/dc and
d/d λ.
- d/d dwi {
- ∑j=1..N pj log2 wj
+ λ((∑j=1..N wj) - 1) }
- = - pi/wi + λ = 0
- hence wi = pi / λ
- ∑ pi = 1, and ∑ wi = 1, so λ = 1
- hence wi = pi
Corollary (Information Inequality)
- ∑i=1..N { pi log2( pi / qi) } ≥ 0
Kullback-Leibler Distance
The left-hand side of the information inequality
- ∑i=1..N { pi log2( pi / qi) }
Exercise
- Calculate the Kullback Leibler distances between the fair and biased (above) probability distributions for the four-sided dice. [Ans.]
Notes
- S. Kullback & R. A. Leibler. On Information and Sufficiency. Annals of Math. Stats. 22 pp.79-86 1951
- S. Kullback. Information Theory and Statistics. Wiley 1959
- C. S. Wallace.
Information Theory.
Dept. Computer Science, Monash University, Victoria, Australia, 1988
Notes from an honours course on I.T. c1988 and later. - G. Farr.
Information Theory and MML Inference.
School of Computer Science and Software Engineering,
Monash University, Victoria, Australia, 1999
Good source of research material into coding, machine learning and inference. - [log2 tables]
-- L. Allison 1999
Thanks to Dean McKenzie for the K-L ref's.