Discovering simple DNA sequences by compression
David R. Powell, David L. Dowe, Lloyd Allison and Trevor I. Dix, Pacific Symposium on Biocomputing (PSB), pp.597-608, 1998
Abstract: An information-theoretic DNA compression scheme devised by Milosavljevic and Jurka (1993) has been used in many places in the literature for both the discovery of new genes and the compression of DNA. Their compression method applies an encoding of previously occurring runs. They use 5 different code-words: four being the DNA bases, A, C, G and T, and the other being a pointer to a previously occurring run. They advocate a code-word of length log2(5) for each of these and then encoding a run by a code-word of length 2 × log2(n), where n is the length of the sequence. This scheme encodes the start of the sequence with a code-word of length log2(n) and likewise encodes the end of the sequence with a code-word of length log2(n). In this paper, we show the above coding scheme to be inefficient in various ways and improve upon it so that it can compress DNA. We discuss our implementation of various schemes some of which run in linear time."