Information

This page:
Information, (un)certainty, entropy, Kullback-Leibler distance.

Chris Wallace proposed this (slightly) simplified taxonomy of science to place ‘information’ in perspective.

Science
matter	chemistry
energy	physics
life	biology
money	economics
information	[*]	1920's on

The term ‘Information Technology’ is much used. The concept of information is worth a bit of study.

Examples

I toss a coin and tell you that it came down ‘heads’. I have told you some information. How much? A computer scientist immediately says ‘one bit’ (I hope).

I roll a four-sided "dice" with faces labelled {a,c,g,t} and tell you that it landed on ‘c’: two bits of information.

Suppose we have a trick coin with two heads and this is common knowledge. I toss the coin and tell you that it came down ... heads. How much information? Nothing, zero, of course. You knew it would come down heads and I have told you nothing that you did not already know. But if you had not known that it was a trick coin, you would have learned one bit, so the information learned depends on your prior knowledge.

We have a biased coin; it has a head and a tail but comes down heads about 75 in 100 times and tails about 25 in 100 times, and this is common knowledge. I toss the coin and tell you ... tails, two bits of information. A second toss lands ... heads, rather less than one bit, wouldn't you say? (Only 0.42 bits in fact.)

I pull a coin out of my pocket and tell you that it is a trick coin with two heads. How much information have you gained? Well "quite a lot", maybe 20 or more bits, because trick coins are very rare, and you may never even have seen one before.

So if something is certain then it is no information to learn that it has occurred. The less probable, the less likely, that an event is, the more information is learned by being told of its happening.

Information: Definition

The amount of information in learning of an event ‘A’ which has probability P(A) is

I(A) = -log₂(P(A)) bits

I(A) = -log_e(P(A)) nits (aka nats)

Note that

P(A) = 1 => I(A) = -log₂(1) = 0

P(B) = 1/2 => I(B) = -log₂(1/2) = log₂(2) = 1 bit

P(C) = 1/2ⁿ => I(C) = n bits

P(D) = 0 => I(D) = ∞ ... think about it

Entropy

Entropy tells us the average information in a probability distribution over a sample space S. It is defined to be

H = - ∑_{v in S} {P(v) log₂ P(v)}

This is for a discrete sample space but can be extended to a continuous one by the use of an integral.

Examples

The fair coin

H = -1/2 log₂(1/2) - 1/2 log₂(1/2)

= 1/2 + 1/2

= 1 bit

That biased coin, P(head)=0.75, P(tail)=0.25

H = - 3/4 log₂(3/4) - 1/4 log₂(1/4)

= 3/4×0.42 + 2/4 = 0.31 + 1/2

= 0.81 bits, approx.

A biased four-sided dice, p(a)=1/2, p(c)=1/4, p(g)=p(t)=1/8

H = - 1/2 log₂(1/2) - 1/4 log₂(1/4) - 1/8 log₂(1/8) - 1/8 log₂(1/8)

= 1³/₄ bits

Theorem H1

(The result is "classic" but this is from notes taken during talks by Chris Wallace (1988).)

If (p_i)_i=1..N and (q_i)_i=1..N are probability distributions, i.e. each non-negative and sums to one, then the expression

∑_1=1..N { - p_i log₂(q_i) }

is minimised when q_i=p_i

Proof: First note that to minimise f(a,b,c) subject to g(a,b,c)=0, we consider f(a,b,c)+λ.g(a,b,c). We have to do this because a, b & c are not independent; they are constrained by g(a,b,c)=0. If we were just to set d/da{f(a,b,c)} to zero we would miss any effects that ‘a’ has on b & c through g( ). We don't know how important these effects are in advance, but λ will tell us.
We differentiate and set to zero the following: d/da{f(a,b,c)+λ.g(a,b,c)}=0 and similarly for d/db, d/dc and d/d λ.

d/d dw_i { - ∑_j=1..N p_j log₂ w_j + λ((∑_j=1..N w_j) - 1) }; = - p_i/w_i + λ = 0
hence w_i = p_i / λ: ∑ p_i = 1, and ∑ w_i = 1, so λ = 1
hence w_i = p_i

Corollary (Information Inequality)

∑_i=1..N { p_i log₂( p_i / q_i) } ≥ 0

with equality iff p_i=q_i, e.g., see Farr (1999).

Kullback-Leibler Distance

The left-hand side of the information inequality

∑_i=1..N { p_i log₂( p_i / q_i) }

is called the Kullback-Leibler distance (also relative entropy) of the probability distribution (q_i) from the probability distribution (p_i). It is always non-negative. Note that it is not symmetric in general. (The Kullback-Leibler distance is defined on continuous distributions through the use of an integral in place of the sum.)

Exercise

Calculate the Kullback Leibler distances between the fair and biased (above) probability distributions for the four-sided dice. [Ans.]

Notes

S. Kullback & R. A. Leibler. On Information and Sufficiency. Annals of Math. Stats. 22 pp.79-86 1951
S. Kullback. Information Theory and Statistics. Wiley 1959
C. S. Wallace. Information Theory. Dept. Computer Science, Monash University, Victoria, Australia, 1988
Notes from an honours course on I.T. c1988 and later.
G. Farr. Information Theory and MML Inference. School of Computer Science and Software Engineering, Monash University, Victoria, Australia, 1999
Good source of research material into coding, machine learning and inference.
[log₂ tables]

-- L. Allison 1999

Thanks to Dean McKenzie for the K-L ref's.