Information

This page:
Information, (un)certainty, entropy, Kullback-Leibler distance.

Chris Wallace proposed this (slightly) simplified taxonomy of science to place ‘information’ in perspective.

Science
matterchemistry  
energyphysics  
lifebiology  
moneyeconomics  
information [*] 1920's on
The term ‘Information Technology’ is much used. The concept of information is worth a bit of study.

Examples

I toss a coin and tell you that it came down ‘heads’. I have told you some information. How much? A computer scientist immediately says ‘one bit’ (I hope).

I roll a four-sided "dice" with faces labelled {a,c,g,t} and tell you that it landed on ‘c’: two bits of information.

Suppose we have a trick coin with two heads and this is common knowledge. I toss the coin and tell you that it came down ... heads. How much information? Nothing, zero, of course. You knew it would come down heads and I have told you nothing that you did not already know. But if you had not known that it was a trick coin, you would have learned one bit, so the information learned depends on your prior knowledge.

We have a biased coin; it has a head and a tail but comes down heads about 75 in 100 times and tails about 25 in 100 times, and this is common knowledge. I toss the coin and tell you ... tails, two bits of information. A second toss lands ... heads, rather less than one bit, wouldn't you say? (Only 0.42 bits in fact.)

I pull a coin out of my pocket and tell you that it is a trick coin with two heads. How much information have you gained? Well "quite a lot", maybe 20 or more bits, because trick coins are very rare, and you may never even have seen one before.

So if something is certain then it is no information to learn that it has occurred. The less probable, the less likely, that an event is, the more information is learned by being told of its happening.

Information: Definition

The amount of information in learning of an event ‘A’ which has probability P(A) is

I(A) = -log2(P(A)) bits
I(A) = -loge(P(A)) nits (aka nats)
Note that
P(A) = 1 => I(A) = -log2(1) = 0
P(B) = 1/2 => I(B) = -log2(1/2) = log2(2) = 1 bit
P(C) = 1/2n => I(C) = n bits
P(D) = 0 => I(D) = ∞ ... think about it

Entropy

Entropy tells us the average information in a probability distribution over a sample space S. It is defined to be

H = - ∑v in S {P(v) log2 P(v)}
This is for a discrete sample space but can be extended to a continuous one by the use of an integral.

Examples

The fair coin

H = -1/2 log2(1/2) - 1/2 log2(1/2)
= 1/2 + 1/2
= 1 bit

That biased coin, P(head)=0.75, P(tail)=0.25

H = - 3/4 log2(3/4) - 1/4 log2(1/4)
= 3/4×0.42 + 2/4 = 0.31 + 1/2
= 0.81 bits, approx.

A biased four-sided dice, p(a)=1/2, p(c)=1/4, p(g)=p(t)=1/8

H = - 1/2 log2(1/2) - 1/4 log2(1/4) - 1/8 log2(1/8) - 1/8 log2(1/8)
= 1 3/4 bits

Theorem H1

(The result is "classic" but this is from notes taken during talks by Chris Wallace (1988).)

If (pi)i=1..N and (qi)i=1..N are probability distributions, i.e. each non-negative and sums to one, then the expression

1=1..N { - pi log2(qi) }
is minimised when qi=pi

Proof: First note that to minimise f(a,b,c) subject to g(a,b,c)=0, we consider f(a,b,c)+λ.g(a,b,c). We have to do this because a, b & c are not independent; they are constrained by g(a,b,c)=0. If we were just to set d/da{f(a,b,c)} to zero we would miss any effects that ‘a’ has on b & c through g( ). We don't know how important these effects are in advance, but λ will tell us.
We differentiate and set to zero the following: d/da{f(a,b,c)+λ.g(a,b,c)}=0 and similarly for d/db, d/dc and d/d λ.

d/d dwi { - ∑j=1..N pj log2 wj + λ((∑j=1..N wj) - 1) }
= - pi/wi + λ = 0
hence wi = pi / λ
∑ pi = 1, and ∑ wi = 1, so λ = 1
hence wi = pi

Corollary (Information Inequality)

i=1..N { pi log2( pi / qi) } ≥ 0
with equality iff pi=qi, e.g., see Farr (1999).

Kullback-Leibler Distance

The left-hand side of the information inequality

i=1..N { pi log2( pi / qi) }
is called the Kullback-Leibler distance (also relative entropy) of the probability distribution (qi) from the probability distribution (pi). It is always non-negative. Note that it is not symmetric in general. (The Kullback-Leibler distance is defined on continuous distributions through the use of an integral in place of the sum.)

Exercise

Notes

-- L. Allison 1999

Thanks to Dean McKenzie for the K-L ref's.