A Genome Alignment Algorithm Based on Compression

Minh Duc Cao, Trevor I. Dix, and Lloyd Allison

Technical Report 233, January 2009, Faculty of Information Technology (Clayton Campus), Monash University

Abstract: Traditional genome alignment methods based on dynamic programming are often a. computationally expensive, b. unable to compare the genomes of distant species and c. unable to deal with low information regions. We are presenting an information-theoretic approach for pairwise genome local alignment. Our method, the expert model aligner, the XMAligner, relies on the expert model compression algorithm. To align two sequences, XMAligner first compresses one sequence to measure the information content at each position in the sequence. Then the sequence is compressed again but this time with the background knowledge from the other sequence to obtain the conditional information content. The information content and the conditional information content from the two compressions are examined. Similar regions in the compressed sequence should have the conditional information content lower than the individual information content. The method is applied to align the genomes of Plasmodium falciparum and Plasmodium knowlesi against other three Plasmodium genomes with different levels of diversity. Despite the differences in nucleotide composition of the reference sequences, the conserved regions found by XMAligner in three alignments are relatively consistent. A strong correlation was found between the similar regions detected by the XMAligner and the hypothetical annotation of Plasmodium species. The alignment results can be integrated into the DNAPlatform for visualisation.

(Technical Reports can be obtained from the Faculty of Information Tech. (Clayton), Monash University, Australia 3800.)