Compression |
The information in learning of an event E of probability pr(E) is -log2(pr(E)) bits. For example, if the four DNA bases {A,C,G,T} each occured 1/4 of the time, an optimal code would be
Note that -log2(0.25)=2:
Each base would be worth 2-bits of information.
However, if the probabilities of the bases were
would be optimal; note -log2(1/8)=3, etc.. In this case the average code length would be
which is less than before.
In general the probability of the next symbol, S[i], of a sequence, S,
may depend on previous symbols, and then we
deal with conditional probabilities Information content can be used to discover
patterns, repeats, gene duplications and the like in sequences.
It can also give a distance between DNA sequences or protein sequences,
for classification or for inference of phylogenetic (evolutionary) trees,
without aligning the sequences.
And "costing" the symbols in
an [alignment]
according to the symbols' information content gives an alignment
algorithm, |
|
↑ © L. Allison, www.allisons.org/ll/ (or as otherwise indicated). Created with "vi (Linux)", charset=iso-8859-1, fetched Friday, 29-Mar-2024 10:44:40 UTC. Free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop, Firefox web-browser, FlashBlock flash on/off. |