Information-Theoretic Sequence Alignment

Technical Report 98/14, © L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168, 15 June 1998

NB. A related, and much more detailed paper on this topic appears in:
L.Allison, D.Powell & T.I.Dix, Compression and Approximate Matching, the Computer Journal, OUP, 42(1), pp.1-10, 1999.

Abstract: The sequence alignment problem is reexamined from the point of view of information theory. The information content of an alignment is split into two parts, that due to the differences between the two sequences and that due to the characters of the sequences. This allows the appropriate treatment of the non-randomness of sequences that have low information content. Efficient algorithms to align sequences in this way are given. There is a natural significance test for the acceptability of alignments. Under the new method, the test is unlikely to be fooled into believing that unrelated sequences are related simply because they share statistical biases.

Keywords: data compression, edit distance, information content, sequence alignment, sequence analysis.

Introduction

An alignment of two (or more) sequences is here treated as a hypothesis; it is just a possible way in which characters of one sequence may be matched with characters of another sequence. It shows one way, out of many, to 'edit' one sequence into the other. All other things being equal, an alignment corresponding to fewer mutations, i.e. insertions, deletions and changes, is more probable. Here it is shown how to include the probabilities of the characters themselves in the alignment process. These probabilities can vary from character to character in low information content sequences. Using this knowledge in alignment allows a more reliable assessment of whether or not two sequences are (probably) related.

It is a well known difficulty that alignments of unrelated but low information content sequences can give unreasonably low costs or, equivalently, high scores. This can result in 'false positive' matches when searching against a sequence database, for example. It can also lead to poor alignments even in cases of true positives. A partial solution is to mask-out low information content regions altogether before processing, as described by Wootton (1997). This is drastic and cannot be used if one is interested in the low information content regions. Wootton uses 'compositional complexity' defined as a moving average of the complexity under the multi-state distribution (Boulton and Wallace 1969).

An alignment is formed by padding out each sequence with zero or more null characters '-', until they have the same length. The sequences are then written out, one above the other.

e.g. ACGTACGTA-GT
     ||  | ||| ||
     AC--ATGTACGT
Each column or 'pair' contains two characters. The pair <-,-> is not allowed. A pair <x,x> represents a match or copy, <x,y> represents a change, <x,-> a deletion and <-,x> an insertion. The pairs represent point-mutations of the sequences or equivalently 'operations' on characters to edit one sequence into the other. Given a cost or score function on alignments one can search for an optimal alignment. The cost (score) usually takes the form of costs (scores) for the individual operations.

There are a great many sequence alignment methods. In general terms they attempt to maximise the number of matches, or to minimise the number of mutations, or to do both in some combination (Needleman and Wunsch 1970). The longest common subsequence (LCS) problem (Hirschberg 1975) is to maximise the number of matching characters in an alignment. The edit distance problem (Levenshtein 1965, Sellers 1974) is to minimise the number of mutations - insertions, deletions and changes - in an alignment. If probabilities are assigned to mutations (Allison 1993) then one can talk of, and search for, a most probable alignment.

Most existing alignment algorithms make the tacit assumption that all character values are equally likely in all positions of a sequence and are independent of each other. Equivalently the sequences are assumed to be 'random', that is, generated by a 'uniform' zero-order Markov model. For DNA this amounts to giving each base a 2-bit code in data-compression terms. It is difficult to compress typical DNA much below 1.9 bits/base or 1.8 bits/base, if the word 'typical' has any meaning for DNA. Sophisticated models can yield compression to below 1.75 bits/base for human Haemaglobin region, HUMHBB, for example (Loewenstern and Yianilos 1997, Allison et al 1998). However some subsequences are highly compressible, e.g. poly-A, (AT)*. Common subsequences, such as the Alu's (Bains 1986), could also be given special codes to make them compressible.

This paper argues that any compressibility, i.e. non-randomness, in sequences should be used in their alignment. First, there is a natural null hypothesis that two given sequences, S1 and S2, are unrelated. Its information content is the sum of the information content of each sequence:

P(S1&S2 |unrelated) = P(S1).P(S2),  so
-log2(P(S1&S2 |unrelated)) = -log2(P(S1)) + -log2(P(S2))
If the sequences are compressible then the null hypothesis is simpler in consequence. The null's converse is the hypothesis that the sequences are related. An alignment is a sub-hypothesis of the latter: it shows one particular way in which the sequences could be related by mutation. In the extreme case that S1=S2, the optimal alignment has an information content only slightly greater than half that of the null hypothesis, assuming that the sequences are reasonably long. As S1 and S2 grow further apart, the information content of even an optimal alignment increases until it exceeds the null's, at which point the alignment is not 'acceptable'. Thus an alignment can only compete fairly with the null hypothesis if it can take advantage of any compressibility of the sequences. Later sections show how to realise this.

If sequences S1 and S2 are drawn from a (statistical) family of compressible sequences, e.g. they are AT-rich, then a simple algorithm may find a 'good' alignment and conclude that S1 and S2 are 'related' when the only sense in which this is true is that they share certain statistical biases. Algorithms described below use the compressibility of such sequences and give more reliable inferences of the true picture. The examples are of DNA sequences but the method and algorithms are obviously applicable to other alphabets and sequences.

Alignment of Random Sequences

This section recalls the information-theoretic alignment of random, incompressible sequences. The following section shows how the compressibility of non-random sequences can be treated.

Figure 1 shows a variation of the well-known dynamic programming algorithm (DPA) to calculate the probability of a most probable alignment of sequences S1 and S2. Illustrated is the algorithm for simple point-mutations but linear gap-costs and other gap-costs (Gotoh 1982) can be given a similar treatment. The alignment itself is recovered by a trace-back of the choices made by the 'max' function. In practice, algorithms work with the -log2s of probabilities, i.e. information content, and therefore '+' replaces '*' and 'min' replaces 'max'. This gives values in a better range for computers to handle.

M[0,0] = 1

M[i,0] = M[i-1,0]*P(<S1[i], ->) i = 1..|S1|

M[0,j] = M[0,j-1]*P(<-, S2[i]>) j = 1..|S2|

M[i,j] = max(M[i-1, j-1]*P(<S1[i], S2[j]>)
             M[i-1, j  ]*P(<S1[i], -    >)
             M[i,   j-1]*P(< -,    S2[j]>)) i=1..|S1|, j=1..|S2|

where P(<x,x>) = probability of match / copy
      P(<x,y>) = probability of mismatch / change
      P(<x,->) = probability of deletion
      P(<-,x>) = probability of insertion

Figure 1: Dynamic Programming Algorithm (DPA), most probable alignment.
A 'pair' from an alignment, e.g. <x,x> is given a probability, say 0.8*1/4 reflecting the chance of a copy, 0.8 say, and the probability of the character 'x', 1/4 say. In previous work (Allison et al 1992) the explicit assumption was made that all DNA bases were equally likely, i.e. a character 'x' had probability 1/4, and that two differing characters had probability 1/12. The probabilities of copy, change, insertion and deletion were estimated by an expectation maximisation process (Baum and Eagon 1967, Baum et al 1970).

Alignment of Non-Random Sequences

A family of sequence is called 'non-random', or equivalently of 'low information content', if there is some statistical model and an associated compression algorithm that, on average, compresses members of the family more than does the uniform model. It is being realised that compression is important in sequence analysis (e.g. Grumbach and Tahi 1994) not so much to save on data communication costs or on disc space as to advance understanding of the nature of molecular sequences. The following terms are all aspects of the same notion: repetition, structure, pattern, signal, motif, non-randomness, compressibility, low information content. Informally, there are two kinds of 'boring' sequence: (i) random sequences with no structure as 'typed by a monkey' which, counter to intuition, have high information content and (ii) very regular sequences, such as AAAAAA....., which have almost zero information content. 'Interesting' sequences have some structure, i.e. have moderate information content, perhaps containing a mixture of low and high information content subsequences.

Assuming that runs of repeated characters are common, figure 2 shows local alignments of low (i) and high (ii) information content subsequences.

 ...AAAA...      ....ACGT....
    ||||             ||||
 ...AAAA...      ....ACGT....

    (i)              (ii)

Figure 2: Alignment of Low and High Information Content Regions.
Both partial alignments are good, but it is intuitively obvious that (ii) is better than (i) because ACGT is less likely to occur twice by chance than AAAA, under the model. If an alignment of long sequences containing the fragments could contain (i) or (ii) but not both, it would be better for it to contain (ii). However, if alphabetically ordered runs were common and repeated runs were not then the roles would be reversed! In examples below, it is often assumed that runs of repeated characters are common. This model is only used because it is the simplest possible example to give low information content sequences. The alignment method that is being described is not limited to this case and the use of 'repeated-runs' is not meant to imply that it is a good model of DNA, say.

Alignments give an indication of how much information one sequence gives about another:

P(S1&S2 |related) = P(S1).P(S2 |S1 & related)
                  = P(S2).P(S1 |S2 & related)
If S1 and S2 are compressible, we could encode them both by first encoding S1, compressing it of course, and then encoding S2 given S1, e.g. by encoding the editing operations. However, there are now two sources of information about S2: first, the 'alignment' information from its presumed relative S1 and, second, information from the compressibility of S2 itself. It is not obvious how to combine these two sources. If we used only the alignment information on the second sequence, it is likely that the resulting calculated values for P(S1).P(S2|S1&related) and P(S2).P(S1|S2&related) would differ which is clearly unsatisfactory. We therefore seek a more symmetrical method.

Algorithmic Considerations

It is important that any new cost function on alignments should lead to an efficient search algorithm for optimal alignments. The line of attack is to average S1 and S2 in a certain way; an alignment can be thought of as a 'fuzzy' sequence representing all possible intermediate sequences 'between' S1 and S2. A good strategy is to avoid committing (strongly) to S1 or to S2 at a step (i,j) because this would introduce nuisance parameters. We also avoid introducing hidden variables because we should (ideally) integrate over all their possible values. Doing only a bounded amount of computation for each entry M[i,j] is desirable to maintain the O(n**2) time complexity of the DPA. These objectives are met if the calculation of character probabilities for S1[i] and S2[j] is not conditional on the choice of alignment from M[0,0] to M[i,j] but depends only on S1[1..i-1] and on S2[1..j-1].

Insertions and deletions are the easiest operations to deal with: At some point (i,j) in the DPA the probability of deleting S1[i] is evaluated. This is defined to be the product of P(delete) and the probability of the character S1[i] at position i in S1, i.e. P(S1[i]|S1[1..i-1]). What form the last term takes depends entirely on the compressibility of S1. Insertions are treated in a similar way:

deletion:  P(<S1[i],->) = P(delete).P(S1[i] |S1[1..i-1])
insertion: P(<-,S2[j]>) = P(insert).P(S2[j] |S2[1..j-1])

A consequence of these definitions is that insertions (and deletions) are no longer all equal. For example, in figure 3 and assuming that repeated runs are common, deletion (i) costs less than deletion (ii) because the base G is more surprising in its context. Hence, there would be a greater benefit if the G in S1 could be matched with a G in S2 than if the unmatched A at (i) could be matched.

S1: ....AAAAAAAAAGAAAA....
        |||| |||| ||||
S2: ....AAAA-AAAA-AAAA....
            ^    ^
           (i)  (ii)

Figure 3: Example Insertions.

Copies and changes are a little more complex to deal with than insertions and deletions because the former involve a character from each sequence, S1[i] and S2[j], and this gives three sources of information to reconcile: (i) the fact that the characters are the same, or not, (ii) the compressibility of S1 and (iii) the compressibility of S2. The approach taken with copies is to average the predictions from S1 and from S2 of the character involved, x=S1[i]=S2[j].

copy: P(<S1[i],S2[j]> & S1[i]=S2[j]=x)
      = P(copy)
        .(P(S1[i]=x|P(S1[1..i-1]))+P(S2[j]=x|S2[1..j-1]))/2
For example in figure 4, the effect is to place more emphasis on the prediction from S2 for copy (i), and more emphasis on S1 for copy (ii), assuming that runs of repeats are common.
S1: ....AAAAGAAAA....AAAAAAAAA....
            |            |
S2: ....GGGGGGGGG....GGGGAGGGG....
            ^            ^
           (i)          (ii)

Figure 4: Example Copies/ Matches.

Changes are treated in a similar way to copies, complicated by the fact that S1[i] and S2[j] differ. Briefly, the prediction of x=S1[i] from S1 is multiplied by that of y=S2[j] from S2 after the latter has been renormalised, because y cannot not equal x, and this is averaged with the mirror calculation that swaps S1 and S2:

change:
   P(<S1[i],S2[j]> & S1[i]=x & S2[j]=y)  where x~=y

   = P(change)
     .( P(S1[i]=x|S1[1..i-1]).P(S2[j]=y|S2[1..j-1])
        / (1-P(S2[j]=x|S2[1..j-1]))
      + P(S2[j]=y|S2[1..j-1]).P(S1[i]=x|S1[1..i-1])
        / (1-P(S1[i]=y|S1[1..i-1])) )/2

   = P(change).P(S1[i]=x|S1[1..i-1]).P(S2[j]=y|S2[1..j-1])
     .( 1/(1-P(S1[i]=y|S1[1..i-1]))
      + 1/(1-P(S2[j]=x|S2[1..j-1])) )/2
A result of this treatment is, for example, that the change in figure 5 is mainly, but not totally, treated as a case of S2 having an 'anomaly' (C) (as always, assuming that repeated runs are common).
S1: ....AAAAA...
        || ||
S2: ....AACAA...
          ^

Figure 5: Example Change/ Mismatch.

There may be other reasonable ways of combining the predictions of character values from S1 and from S2 but the above method is symmetrical with respect to S1 and S2 and it meets the requirement of M[i,j] only depending on S1[1..i-1] and S2[1..j-1], not on a choice of alignment to M[i,j], and thus leads to an O(n**2) time DPA. The basic algorithm uses O(n**2) space for M[,] but Hirschberg's (1975) technique can be used to reduce this to O(n) space while maintaining the O(n**2) time complexity. Linear gap costs and other gap costs can clearly be given the same treatment as the point mutations above.

Flexibility

The modified DPA described above only requires a method of obtaining the probability of each character value occurring at each position of sequences S1 and S2:

P(S1[i]=x|S1[1..i-1]), i=1..|S1| and
P(S2[j]=y|S2[1..j-1]), j=1..|S2|
where x and y range over the alphabet.

Many statistical models of sequences and most data compression algorithms, for example order-k Markov Models and the well-known Lempel-Ziv (1976) model, can easily be adapted to deliver such probabilities. The probabilities can be obtained in a preliminary pass over S1 and one over S2 and stored for later use by the DPA. The passes also give the information content of S1&S2 under the null hypothesis. Since the DPA has O(n**2) time complexity, the passes can take up to a similar amount of time without worsening the overall complexity.

Discussion

Figure 6 shows an example first-order Markov model, MMg, that was used to generate AT-rich artificial DNA sequences; this is not meant to imply that it is a model of real DNA.

MMg:        A    C    G    T
          +------------------- P(S[i]|S[i-1])
        A |1/12 1/12 1/12 9/12
          |
        C |9/20 1/20 1/20 9/20
S[i-1]    |
        G |9/20 1/20 1/20 9/20
          |
        T |9/12 1/12 1/12 1/12

Figure 6: AT-rich Generating Model MMg.
100 pairs of sequences were generated from the model. Each sequence was of length 100 and was unrelated to, and independent of, the other 199 sequences, except in sharing their general statistical biases. Each pair of sequences was aligned under three different assumptions: that the sequences were drawn from (i) a uniform model, (ii) the known, fixed model MMg, and (iii) an adaptive order-1 Markov model where the parameter values were not known in advance. In each case, the information content of the null hypothesis and of an optimal alignment were calculated under the appropriate model. The difference of these quantities gives the -log-odds ratio of the hypotheses. These values and the number of pairs that were inferred to be related '+' or unrelated '-', are shown in table 1.
                      -log odds        ?related?
method                null:alignment   inference
  (i) uniform/uniform  13.3 +/- 13.3,   15-, 85+
 (ii) MMg/MMg         -44.1 +/-  7.7,  100-,  0+
(iii) MM1/MM2         -30.4 +/-  7.3,  100-,  0+

Table 1: Unrelated Sequences, Alignment with 3 Models.

Under assumption (i), 85 out of 100 pairs are inferred to be related under an optimal alignment, because it is easy to find alignments of unrelated sequences having a high number of matches just because of the AT-richness. Alignment with knowledge of the true model (ii) reveals the true picture; the null hypothesis has low information content and alignments cannot better it. Assumption (iii) also implies that the pairs are unrelated, but is 14 bits less sure than (ii) on average because the model's parameter values must be inferred from the data.

Similar tests were done using other generating models and with related and unrelated pairs of sequences. The unsurprising conclusion is that alignment using the best available model of the sequences gives the most reliable results.

There is yet another approach to aligning low information content sequences: It is assumed that the given sequences are different noisy observations of one 'true' sequence which is treated as a hidden variable. This is close to the situation in the sequence assembly problem also known as the shortest common supersequence (SCSS) problem. An O(n**2) time alignment algorithm is possible for 'finite-state' models of sequences (Powell et al 1998), e.g. order-k Markov models, although k is limited to 0, 1 or 2 in practice.

There is a common technique that is used to correct partially the alignment costs (equivalently scores) of repetitive sequences in a non-information-theory setting: The sequences are aligned and get a certain cost (score). Each sequence is then permuted at random and the jumbled sequences are aligned and get a different cost (score). The original alignment is not considered to be significant unless its cost (score) is significantly better than that of the second. A jumbled sequence has the same zero-order Markov model statistics as the original but has different high-order statistics. It is hard to imagine how to jumble a sequence while preserving its statistics under an arbitrary statistical model. In contrast, the information theoretic alignment method described here compares the information content of the unaltered sequences under two hypotheses: null (unrelated) and alignment (related as shown). Absolutely any statistical model of sequences can be used with the new method, provided only that it can give the information content of a sequence character by character.

Finally, this paper begs the question of what is a good model of sequences to use in the new alignment algorithm. The only possible answer is that 'it just depends' on the nature of what is being aligned. Provided that there is enough data, i.e. pairs of related sequences from the same source, then two or more models of sequences can be compared: alignment with the better model will give the greater compression on average. Statistical modelling of DNA sequences, in particular, is still a research area and a detailed discussion of it is beyond the scope of this paper. As noted, the alignment algorithm can be used with a wide range of sequence models. For example, an approximation to Wootton's compositional complexity is a model that bases the prediction of the next symbol on the frequencies of symbols in the immediately preceding 'window'; such models are common in data compression and are sometimes described as 'forgetful' or 'local'. Using whatever model, the new DPA does not mask-out low information content subsequences but rather gives them their appropriate (low) weight. If something is known about the source of the data then this knowledge should be built into the models. If proteins are being aligned the model should at least include the bias in the use of the alphabet. If little is known about the source of the data then a rather 'bland' model can be used. It can have parameters that are fitted to the data, but the number of parameters should be small compared to the lengths of the sequences.

References

Also see:
L.Allison, D.Powell & T.I.Dix,
Compression and Approximate Matching.
Computer Journal, [OUP], 42(1), pp.1-10, 1999

[All93] L. Allison (1993). Normalization of affine gap costs used in optimal sequence alignment. Jrnl. Theor. Biol. 161 263-269.

[All92] L. Allison, C. S. Wallace and C. N. Yee (1992). Finite-state models in the alignment of macromolecules. Jrnl. Molec. Evol. 35 77-89.

[All98] L. Allison, T. Edgoose and T. I. Dix (1998). Compression of strings with approximate repeats. Proc. Intelligent Systems in Molecular Biology, ISMB98, AAAI Press, 8-16.

[Bai86] W. Bains (1986). The multiple origins of the human Alu sequences. J. Mol. Evol. 23 189-199.

[Bau67] L. E. Baum and J. E. Eagon (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model of ecology. Bulletin of AMS 73 360-363.

[Bau70] L. E. Baum, T. Petrie, G. Soules and N. Weiss (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals Math. Stat. 41 164-171.

[Bou69] D. M. Boulton and C. S. Wallace (1969). The information content of a multistate distribution. J. Theor. Biol. 23 269-278.

[Got82] O. Gotoh (1982). An improved algorithm for matching biological sequences. Jrnl. Molec. Biol. 162 705-708.

[Gru94] S. Grumbach and F. Tahi (1994). A new challenge for compression algorithms: genetic sequences. Inf. Proc. and Management 30(6) 875-886.

[Hir75] D. S. Hirschberg (1975). A linear space algorithm for computing maximal common subsequences. Comm. Assoc. Comp. Mach. 18(6) 341-343.

[Lem76] A. Lempel and J. Ziv (1976). On the complexity of finite sequences. IEEE Trans. Info. Theory IT-22 783-795.

[Lev65] V. I. Levenshtein (1965). Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4) 845-848 (trans. Soviet Physics Doklady 10(8) 707-710, 1966).

[Loe97] D. Loewenstern and P. N. Yianilos (1997). Significantly lower entropy estimates for natural DNA sequences. Data Compression Conference DCC '97, 151-160.

[Nee70] S. B. Needleman and C. D. Wunsch (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Jrnl. Mol. Biol. 48 443-453.

[Pow98] D. Powell, L. Allison, T. I. Dix and D. L. Dowe (1998). Alignment of low information sequences. Proc. 4th Australasian Theory Symposium, (CATS '98), Perth, 215-229, Springer Verlag.

[Sel74] P. H. Sellers (1974). An algorithm for the distance between two finite sequences. Jrnl. Combinatorial Theory 16 253-258.

[Woo97] J. C. Wootton (1997). Simple sequences of protein and DNA. In DNA and protein Sequences Analysis, M. J. Bishop and C. J. Rawlings (eds), IRL Press, 169-183.


See Also: