Compression

The information in learning of an event E of probability pr(E) is -log₂(pr(E)) bits. For example, if the four DNA bases {A,C,G,T} each occured 1/4 of the time, an optimal code would be

A 00,

C 01,

G 10,

T 11

Note that -log₂(0.25)=2: Each base would be worth 2-bits of information. However, if the probabilities of the bases were A 1/2, C 1/4, G 1/8, T 1/8, say, then

A 0,

C 10,

G 110,

T 111

would be optimal; note -log₂(1/8)=3, etc.. In this case the average code length would be

0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 1³/₄-bits/base

which is less than before. A base G or a T contains more information than a C which contains more information than an A. There is more information on information under [MML].

In general the probability of the next symbol, S[i], of a sequence, S, may depend on previous symbols, and then we deal with conditional probabilities pr(S[i] | S[1..i-1]). This gives much better compression of DNA and protein sequences [Cao et al] [Allison et al].

Information content can be used to discover patterns, repeats, gene duplications and the like in sequences. It can also give a distance between DNA sequences or protein sequences, for classification or for inference of phylogenetic (evolutionary) trees, without aligning the sequences. And "costing" the symbols in an [alignment] according to the symbols' information content gives an alignment algorithm, M-Align, that is more accurate in detecting genuine relatedness in populations of non-random (repetitive, low information content, compressible) sequences [Allison] [Powell et al].