Compression

LA home
Computing
Bioinformatics
 Glossary
 Algorithms
 Alignment ||
 Compression
 Mapping
 Multiple ||
 Protein
 Three ||
 Trees
 Notes
 Compression

  AEMB11
  BMCbio10
  PAKDD09
  DCC07
  BMCbio07
  AI04
  MBP01
  CC00
  CompJ99
  ISMB98

also see
MML

The information in learning of an event E of probability pr(E) is -log2(pr(E)) bits. For example, if the four DNA bases {A,C,G,T} each occured 1/4 of the time, an optimal code would be

A 00,
C 01,
G 10,
T 11

Note that -log2(0.25)=2: Each base would be worth 2-bits of information. However, if the probabilities of the bases were A 1/2, C 1/4, G 1/8, T 1/8, say, then

A 0,
C 10,
G 110,
T 111

would be optimal; note -log2(1/8)=3, etc.. In this case the average code length would be

0.5×1 + 0.25×2 + 0.125×3 + 0.125×3 = 13/4-bits/base

which is less than before. A base G or a T contains more information than a C which contains more information than an A. There is more information on information under [MML].

In general the probability of the next symbol, S[i], of a sequence, S, may depend on previous symbols, and then we deal with conditional probabilities pr(S[i] | S[1..i-1]). This gives much better compression of DNA and protein sequences [Cao et al] [Allison et al].

Information content can be used to discover patterns, repeats, gene duplications and the like in sequences. It can also give a distance between DNA sequences or protein sequences, for classification or for inference of phylogenetic (evolutionary) trees, without aligning the sequences. And "costing" the symbols in an [alignment] according to the symbols' information content gives an alignment algorithm, M-Align, that is more accurate in detecting genuine relatedness in populations of non-random (repetitive, low information content, compressible) sequences [Allison] [Powell et al].

www #ad:

↑ © L. Allison, www.allisons.org/ll/   (or as otherwise indicated).
Created with "vi (Linux)",  charset=iso-8859-1,   fetched Friday, 29-Mar-2024 10:44:40 UTC.

Free: Linux, Ubuntu operating-sys, OpenOffice office-suite, The GIMP ~photoshop, Firefox web-browser, FlashBlock flash on/off.