A Simple Statistical Algorithm for Biological Sequence Compression
Minh Duc Cao, Trevor I. Dix, Lloyd Allison, Chris Mears
Abstract:
This paper introduces a novel algorithm for biological sequence compression
that makes use of both statistical properties and repetition within
sequences.
A panel of experts is maintained to estimate the probability
distribution of the next symbol in the sequence to be encoded.
Expert probabilities are combined to obtain the final distribution.
The resulting information sequence provides insight for further study of
the biological sequence.
Each symbol is then encoded by arithmetic coding.
Experiments show that our algorithm outperforms existing compressors on
typical DNA and protein sequence datasets while maintaining
a practical running time.
|