## An Information Measure for the String to String Correction Problem with Applications.

#### isbn:0-473-02313-X, ACSC-17, pp.659-668, 19-21 Jan. 1994.

#### L. Allison & C. S. Wallace,

Department of Computer Science,
Monash University,
Victoria 3168,
Australia.

July 1993; revised Nov 1993; published Jan 1994

NB.
A related paper appeared as,
L. Allison & C. S. Wallace,
The Posterior Probability Distribution of Alignments and
Its Application to Parameter Estimation of Evolutionary Trees and
to Optimization of Multiple Alignments,
J. Mol. Evol. 39 pp.418-430, 1994
[HTML]. |

**Abstract**: Algorithms for
approximate string matching are widely used to infer a
relation between two strings in situations
where strings mutate, evolve or contain noise,
eg. DNA or protein strings.
Such a relation is (just) a hypothesis.
It has been shown how to calculate a probability for such hypotheses.
Unfortunately the matching algorithms do not directly
extend to more than two strings
because of rapidly increasing complexity.
However, a stochastic method is one way around this difficulty.
The probabilities of competing hypotheses are used
to generate *random* alignments of two or more strings.
The alignments are sampled from the correct probability distribution.
Averaging over many such alignments gives good estimates
of how closely the strings are related and in what way.
(Mediocrity is useful.)
In addition, sampling in an increasingly selective way gives
a simulated annealing search for an optimal multiple-alignment.

This work was partly supported by Australian Research Council grant A49030439.

**Keywords and Phrases:**
approximate string matching,
edit distance,
evolutionary tree,
hidden Markov model,
inductive inference,
Gibbs sampling,
Monte Carlo method,
multiple alignment,
simulated annealing,
stochastic,
string to string correction problem.

### Introduction

Algorithms on strings are much studied in computer science.
Exact matching algorithms are ruled out in many real-world problems
by the properties of the data:
DNA strings evolve and mutate,
speech is uttered at different rates,
computer images often contain noise,
there may be experimental error
and so on.
Approximate matching algorithms based on
Levenshtein's (1966) metric,
Sellers' (1974) edit-distance,
Wagner and Fischer's (1974) string to string correction problem,
the longest common subsequence
and other related notions
can cope with this reality (Sankoff & Kruskall 1983).
The idea is that characters can be changed, inserted or deleted
and that a "good" relation between two or more strings will involve
a "small" number of such character *mutations*.
Note that other models of string mutation are possible,
for example Tichy's (1984) block-moves model,
and may be more appropriate in certain circumstances.

string A: ACGTAATACGT string B: ACGTATAACCT Some possible alignments: (i) ACGTAATA-CGT (ii) ACGTAATACGT (iii) ACGTAAT-ACGT ||||| || | | ||||| || | |||| || || | ACGTA-TAACCT ACGTATAACCT ACGT-ATAACCT | | | | | mismatch | ins_{B}ins_{A}Generation instruction sequence equivalent to alignment (i): match(A); match(C); match(G); match(T); match(A); ins_{A}(A); match(T); match(A); ins_{B}(A); match(C); mismatch(G,C); match(T) Mutation instruction sequence equivalent to alignment (i): copy; copy; copy; copy; copy; del; copy; copy; ins(A); copy; change(C); copyFigure 1: Alignment Examples.

Given K>=2 strings, we are interested in the following kinds of question:
are the strings related,
and if so are they related by a *given* evolutionary tree,
and if so which parts of the strings are related to each other?
Many programs exist for comparing multiple strings
but most use more or less *ad hoc* weights or penalties
to score relations,
and employ heuristics to search for one good relation.
The aim here is to treat the problems as inductive inference problems:
an answer is just a hypothesis and
an estimate of a parameter of a model is just that,
an estimate.
Weights or parameters should be inferred from the data.
Probabilities, error bounds or other figures of merit
should be given for hypotheses.
There are efficient inference algorithms for K=2 strings.
Bishop *et al* (1987) first cast the two string problem as
a maximum likelihood problem.
Thorne *et al* (1991, 1992) extended it to models that were
more realistic for DNA sequences.
Allison and Yee (1990) used a minimum message length (MML) approach
and Allison *et al* (1992a) examined finite-state models
or hidden Markov models of string evolution.
Wallace and Freeman (1987) gave the statistical foundations
of the MML approach to inductive inference.

The next section summarises the inductive inference approach
to two strings as a basis for the remainder of the paper.
The naive extensions of algorithms for the K=2 case
to several strings are infeasible.
Instead, they are used as the basis of a stochastic method
for comparing K>=2 strings.
Note that if the "correct" evolutionary tree to use is not known,
various methods, such as Hein's (1989) or Thorne and Kishino's (1992),
could be used to suggest one or more candidate trees to be evaluated in turn.
The paper is framed in terms of DNA strings over the alphabet
of four *bases*, {A, C, G, T}
but the general stochastic technique is applicable to other strings
and alphabets.

### Two Strings

An *alignment* is a way of displaying a relation
between the parts of two or more strings.
Some or all of the strings are padded out
with a special *null* character `-'
so that all strings have the same length.
The resulting strings are written out one above another.
The columns of the alignment are *tuples* of proper characters and `-',
one component coming from each padded string.
The only restriction is that the all-null tuple is not allowed.
Some examples of alignments are given in figure 1.
The number of possible alignments grows exponentially with string length
and one often wants a "good" alignment.
Each tuple can be given a cost (or a score) in various ways and
the cost (score) of an alignment is
the sum of the costs (scores) of its tuples.
Tuples with matching characters generally have a low cost (high score).
Attention is restricted to K=2 in this section.

If we are given two strings, A and B,
it is sometimes convenient to think of A as the ancestor and B as
the descendant.
In that case we relate A and B by a *mutation machine*
which reads A
and executes a sequence of *mutation instructions* to output B:

copy: copy a character from A to B change(y): read a character x but output y del: read char' from A but output nothing ins(x): output x (reads no character of A)

We can just as well consider B to be the ancestor and A the descendant by
interchanging insert and delete and reversing changes.
Insertions and deletions are called *indels*.

It is sometimes convenient to think of A and B as having equal standing,
particularly if they are both descended from an unknown ancestor.
They can then be related by a *generation machine*
which outputs A and B by executing a sequence of *generation instructions*:

match(x): output x both in A and in B mismatch(x,y): output x in A and y in B ins_{A}(x): output x in A only ins_{B}(y): output y in B only

It is clear that there is a correspondence between an alignment, a mutation sequence and a generation sequence and they are used interchangeably as convenient. For example, the pair <x,y> corresponds to the character plus the mutation instruction {x,change(y)} and to the generation instruction mismatch(x,y); see also figure 1. The longest common subsequence (LCS) problem is to find a sequence of generation-instructions (or an alignment etc.) which maximises the number of matches. That number can be found by the dynamic programming algorithm (DPA), figure 2(i), in O(|A|*|B|) time and in O(|B|) space if only two rows of the matrix D[,] are kept. An alignment can be recovered in O(|A|*|B|) space by a trace-back through D[,] if all of D[,] is kept, or in just O(|B|) space if Hirschberg's (1975) divide and conquer technique is used. The edit-distance problem is to minimise the number of mutations and can be solved by a second version of the DPA, figure 2(ii). There are faster algorithms for the LCS and edit-distance problems in the important special case of similar strings (eg. Ukkonen 1983) but they are not suitable for what follows.

Boundary Conditions: D[0, 0] = v D[i, 0] = f( D[i-1, 0], g(A[i], `-') ), i = 1..|A| D[0, j] = f( D[0, j-1], g(`-', B[j]) ), j = 1..|B| General Step, i = 1..|A|, j = 1..|B|: D[i, j] = h( f(D[i-1, j-1], g(A[i], B[j])), -- match/mismatch f(D[i-1, j], g(A[i], `-' )), -- delete A[i] f(D[i, j-1], g(`-' , B[j])) ) -- insert B[j] Return: D[|A|,|B|] (i) longest common subsequence (LCS): v = 0 f = + g(x,x)=1, g(x,y)=g(x,`-')=g(`-',y)=0 h = max (ii) edit-distance: v = 0 f = + g(x,x)=0, g(x,y)=g(x,`-')=g(`-',y)=1 h = min (iii) most probable alignment: v = 1 f = * g(x,x) = P(<x,x>), g(x,y) = P(<x,y>), g(x,`-') = P(<x,`-'>), g(`-',y) = P(<y>) h = max (iv) r-theory, P(A&B): v, f and g as for (iii) h = +Figure 2: Versions of the Dynamic Programming Algorithm.

It turns out that we cannot easily answer the question,
are A and B related?
We can however answer the nearly equivalent question,
are A and B related by a generation machine M?
The machine is defined by the probabilities of its instructions.
Given M, the probability of A and B being generated
by the sequence of instructions
S_{1}; S_{2}; ...; S_{L} is

P(L)*P(S_{1})*P(S_{2})*...*P(S_{L})

The probability of the length, P(L), presents a nasty technical problem.
If any good prior for lengths is known then it should certainly be used.
Failing that, Rissanen's (1983) universal prior for integers
or similar can be used.
In any case, most reasonable priors on length behave
as ~1/L^{1+delta} (delta>=0)
and, as the lengths of all *plausible* alignments are similar,
the probabilities of those lengths are very similar.
Now, one instruction sequence represents
one hypothetical relation between A and B.
If the question of its length is ignored,
a most-probable alignment, conditional on M, can
be found by a third version of the
dynamic programming algorithm, figure 2(iii),
which deals in probabilities.

M's instruction probabilities are called its *parameters*.
They can be estimated from an alignment
if they are not known *apriori*.
This leads to an iterative method of estimating the parameters:
first, initial estimates are guessed and an initial alignment is found,
then the parameters are reestimated from the alignment,
a new alignment is found and the process repeated
until convergence.
Convergence is guaranteed, and four to eight iterations are usually
sufficient, but
it is possible to get stuck in local optima for distantly related strings.
Machine instructions are drawn from a multi-state distribution
and Wallace and Boulton (1968) show with what accuracy
such probabilities can be legitimately inferred.

We do sometimes want a most-probable alignment,
but it is too detailed an answer for the question,
are A and B related by M, ie. in some *unspecified* way?
This is called the *r-theory*.
To calculate P(A&B|M) we need to sum the probability over *all*
of M's instruction sequences that give A and B.
Fortunately this can be done by yet another version of the DPA, figure 2(iv).
If the parameters are to be estimated from the strings,
the previous iterative method can be used and
a weighted-average of the observed frequencies of machine-instructions
is calculated as the DPA proceeds.
Simulation results from Yee and Allison (1993) confirm
that a single optimal alignment gives *biased* estimates
of machine parameters
and that the r-theory gives unbiased estimates.
The reason is clear: mutation is random and does not follow an optimal path.
Mutations can compensate for each other
so estimates based on a single optimal alignment
are literally over-optimistic and tend to underestimate
*systematically* the amount of change that occurred.
(We do not yet know how to model any pressure of selection.)

It is convenient for the last two DPAs to deal
with -log_{2} of probabilities, also known as message lengths,
which are measured in *bits*,
rather than with probabilities directly,
as the final values of the latter can be very small.

Lastly, there is a natural *null-theory*, that the strings
are not related.
When the -log_{2} of its probability is
taken, this amounts to approximately two bits
per character for the case of DNA.
Comparison with the null-theory gives a hypothesis test.
Note it can easily be the case that no single alignment is
an acceptable hypothesis yet the r-theory,
that the strings are related by M, *is* acceptable.

### Several Strings

An alignment of K>=2 strings is formed as above by padding out the strings with `-' so that they have the same length and by writing the resulting strings one above another. This gives a formal language of multiple-alignments but its semantics must also be specified. A very natural interpretation of a multiple-alignment is conditional upon an evolutionary tree over the strings: Each tuple of the alignment represents the descendants, over all possible histories consistent with the tree, of an ancestral character or an inserted character. We consider unrooted binary trees here, the position of the ancestral string being arbitrary. Each edge of the tree is modelled by a mutation-machine. Unrooted trees are natural unless a nearly constant rate of evolution can be assumed and binary trees are sufficient for most purposes, although the general methods could be applied easily to rooted trees and to n-ary trees. The main problems considered here are to find an optimal multiple-alignment given a tree and to estimate the edges (machines) of the tree.

An unrooted tree over K strings has 2K-3 edges or machines.
If all their parameters are known,
the machines can be combined into an equivalent generation machine for
the K strings.
The probability of a K-tuple of characters can be calculated during an
O(K)-time traversal of the tree; the details are given by
Felsenstein (1981, 1983)
who considered inferring an evolutionary tree from a fixed alignment
which is the complementary problem to the multiple-alignment problem
considered here.
For example, given the tree ((1 2) (3 4)) and assuming that all edges
are similar and that P(change) is small,
the 4-tuple <ACAC> is less probable than <AACC>
Conversely, if <ACAC> is more common than <AACC> and <ACCA>
in alignments,
this provides evidence to support the tree ((1 3) (2 4)).
Unfortunately, if the strings' lengths are about `n',
the naive extensions of the DPA of the previous section
would take O(K*n^{K}) time which is infeasible
for typical values of K and n.
(The practical limit for the simple edit-distance and using various
additional short-cuts seems to be about six *similar*, short sequences -
see Carrillo and Lipman (1988), Altschul and Lipman (1989), and Allison (1993).)
However, just as two strings can be aligned,
so two alignments of disjoint *sets* of strings
can be aligned in O(K*n^{2}) time
to give an alignment of all of the strings.
This has been used as a *deterministic heuristic*
to improve a K-way alignment:
projecting the K-way alignment onto two subsets of strings
formed by breaking the tree at an edge
and then realigning the subalignments.
The process is iterated until it converges (Allison *et al* 1992b).
The results may not be optimal, particularly if the strings
are distantly related.
In what follows, a simulated annealing search is described for
an optimal multiple-alignment which can jump out of troublesome local optima.
It may also avoid the problem identified by Lake (1991) where
combining sequences deterministically can bias the search for an (unknown)
tree topology.

An optimal multiple-alignment can be useful in its own right
but it is likely to give biased estimates of the tree edges
as does optimal pairwise-alignment.
The solution used for K=2 of averaging over *all* alignments
is infeasible for larger K,
but a Gibbs sampling or Monte Carlo method (Hastings 1970)
is described which averages over many, if not all, multiple-alignments
to give a good approximation.

The key to the stochastic methods to be described is the generation of random alignments sampled from the correct probability distribution. The general step of the DPA (figure 2(iv)) calculates P(A[1..i]&B[1..j]|M) by adding the probabilities of its three immediate predecessors:

diagonal: P(A[1..i-1]&B[1..j-1]; <A[i],B[j]> |M) vertical: P(A[1..i-1]&B[1..j] ; <A[i],- > |M) horizontal: P(A[1..i] &B[1..j-1]; <- ,B[j]> |M)

although A and B now represent alignments of sets of strings.
If at the same time a *random* choice is made of one of these directions,
in proportion to their probabilities,
then a traceback of these choices will give a random alignment sampled
from the probability distribution of alignments.
This requires O(|A|*|B|) space
which can be reduced to O(|B|) by using Hirschberg's (1975) technique.

The process to sample random multiple-alignments
repeatedly chooses a random edge to break the tree on.
The current multiple-alignment is projected onto
two subalignments for the resulting partition of the strings.
A new random K-way realignment is then generated as above.
The initial multiple-alignment is formed by a suboptimal process.
The search space has many degrees of freedom.
The use of subalignments amounts to keeping most of them fixed at each step,
but the realignment is chosen from the conditional
posterior probability distribution
which is sufficient for a *Gibbs sampling* or
Monte Carlo method.
Parameters are estimated from each multiple-alignment
and averaged to give final estimates.
Standard deviations are calculated and give an indication of accuracy.
A working estimate from "recent" alignments
is used during the sampling process.

s8 s15 s9 . . s14 . . . . . . . . .. .. s4 s7 . . . . . . s2---------s1=s3 . . . . . . s5 s6 .. .. . . . . . . . . s10 . . s13 s11 s12 During Evolution: Ancestor, s1=s3, length(s1)=500 and for each edge P(copy)=0.8; P(change)=0.1; P(insert)=P(delete)=0.05Figure 3: Full Tree.

The Gibbs sampling process generates random alignments
by the repeated choice of directions in proportion to their probabilities.
The choice can be influenced in a useful way
if the probabilities are raised to a power, q>=0.
When q=0 the choice is uniform.
When q=1 the choice is in proportion to probability.
When q>1 the choice is biased towards more probable alignments
and when q is large the bias is to optimal alignments.
Increasing q can therefore be used to implement the cooling
of a *simulated annealing* search for an optimal multiple-alignment.
Recall that the programs actually deal with -log_{2} probabilities,
so multiplying these quantities by q is equivalent to
raising probabilities to the power q.
This strategy is very different from that of Ishikawa *et al* (1992)
which repeatedly makes a small perturbation to the multiple-alignment
and either accepts the result or rejects it.

### Implementation

A Pascal program has been written to analyse DNA strings using the methods described above. A hash-table is used to store the probabilities of tuples when they are met for the first time in the DPA so that later meetings do not need recalculation. The program also uses Hirschberg's technique. Two alignments of four strings of length 500 characters can be aligned to give an alignment of eight strings in 15 to 30 seconds on a Sparc-station. 1000 random alignments of eight strings can be sampled in a few hours. The running time is affected by the similarity of the strings which influences both the lengths of alignments and how many different kinds of tuple are evaluated. The present implementation embodies the simplest one-state mutation-machines and assumes that all characters are equally likely, and that for each edge all insertions are equally likely, all changes are equally likely and P(ins) = P(del). These assumptions are not unreasonable for DNA and they could be relaxed. Each edge of the tree has a different mutation machine.

edge Simulated Ann' Gibbs Sampling 1000x ------ -------------- -------------------------------- s2-s3 : 40chng 28ins 27del (55indel) ED=85 .816 .169 .015 .799(.023) .107(.020) .094(.024) s2-s4 : 51chng 29ins 25del (54indel) ED=96 .825 .119 .056 .748(.018) .092(.015) .160(.026) s2-s5 : 50chng 22ins 18del (40indel) ED=85 .845 .119 .036 .817(.019) .083(.017) .100(.026) s3-s6 : 60chng 29ins 23del (52indel) ED=103 .828 .164 .008 .816(.014) .141(.015) .043(.013) s3-s7 : 50chng 26ins 31del (57indel) ED=97 .860 .080 .060 .759(.021) .109(.018) .132(.032) s4-s8 : 48chng 25ins 24del (49indel) ED=90 .842 .116 .042 .810(.018) .099(.009) .092(.019) s4-s9 : 37chng 22ins 26del (48indel) ED=80 .852 .084 .064 .837(.020) .079(.010) .085(.020) s5-s10: 51chng 29ins 35del (64indel) ED=108 .827 .062 .111 .763(.023) .068(.013) .169(.032) s5-s11: 53chng 23ins 17del (40indel) ED=92 .830 .139 .031 .817(.015) .111(.013) .071(.012) s6-s12: 60chng 23ins 29del (52indel) ED=99 .795 .145 .060 .773(.014) .138(.012) .089(.019) s6-s13: 53chng 21ins 20del (41indel) ED=92 .861 .078 .061 .835(.011) .096(.014) .069(.017) s7-s14: 54chng 18ins 23del (41indel) ED=89 .825 .122 .053 .798(.012) .109(.009) .093(.013) s7-s15: 58chng 30ins 28del (58indel) ED=108 .826 .097 .077 .785(.013) .100(.010) .115(.015) -------------- -------------------------------- Mean: .833 .115 .052 .797 .102 .101 -log_{2}probabilities or message lengths (bits): 7705 9336(+-274) mean(SD) Null=8088 Key: edge: actual #changes #inserts #deletes (#indels) Edit-Distance P(copy) P(chng) P(indel) P(copy)(SD) P(chng)(SD) P(indel)(SD) Evolution, each edge: P(copy)=0.8; P(chng)=0.1; P(ins)=P(del)=0.05 Length(s1=s3) = 500Table 1: Inferred Parameters.

### Results

The current program has been used on both artificial and real DNA data, the evolutionary history of artificial strings being known.

Figure 3 shows a tree used in one test.
The ancestral string is s1,
the first generation descendants s2 and s3, and so on.
S1 and s3 are identical because we are dealing with unrooted trees.
The machines on all edges were similar during evolution,
causing 20% mutation on average:
P(copy) = 0.8, P(change) = 0.1 and P(ins) = P(del) = 0.05.
The process was random so there was variation around these figures.
These parameter values were not known to the analysis program
which was given only the tree topology and
the third generation descendants s8 to s15.
20% mutation per edge may seem a small amount
but brothers, such as s8 and s9 say,
are separated by *two* edges,
cousins such as s8 and s10 by four edges, and
s8 and s12 by five edges.
This data is at quite a high level of mutation for DNA
where the small character set causes many spurious matches.

Table 1 summarises the results of the test. It shows the edge parameters estimated from the putative optimal alignment found by simulated annealing and those estimated by averaging over 1000 alignments from Gibbs sampling. The following values from the evolution of the strings are also given for each edge: actual numbers of changes, inserts and deletes, and the simple edit-distance of the two strings at the ends of the edge. Note that every one of these edit-distances is less than the sum of the corresponding mutations as a "better" explanation than the true events can be found. The optimal(?) multiple-alignment underestimates P(indel) in particular and this is a general tendency that is greatest for internal edges of the tree and increases with mutation level.

CATQ CATB . . . . . . . . . . . . (.81,.19,.00) . <--- (P(copy),P(change),P(indel)) CCOLI--------------. + . + . + . (.67,.32,.00)+ . +. . . . . . CATP CATD -log_{2}probabilities or message lengths (bits): Tree optimal Gibbs mean(SD) ---- ---- -------------- (((CATB CATQ)(CATD CATP))CCOLI) --- : 4369 4555(+-40) ((CATB CATQ)(CATD(CATP CCOLI))) +++ : 4376 4502(+-26) Null: 6449 Data: Genbank-id organism designation m55620 Clostridium perfringens CATQ chloramphenicol acetyltransf... m28717 Clostridium perfringens CATP chloramphenicol-resistance p... x15100 Clostridium difficile CATD gene for chloramphenicol ace... m93113 Clostridium butyricum CATB chloramphenicol acetyltransf... m35190 Campylobacter coli CCOLI chloramphenicol acetyltransf...Figure 4: Two Alternative Trees for Five Bacterial Sequences.

The mean machine from simulated annealing further illustrates this bias. The estimates from Gibbs sampling do not show bias and the discrepancy between the "true" and estimated values of P(indel) is less than two standard deviations, except for the edge s3-s6, and is often much less. The standard deviations of the estimates are large, as is to be expected for this level of mutation, but a single (optimal) alignment gives no indication of this inherent uncertainty. The optimal alignment beats the null-theory on message length and so is an acceptable hypothesis but an "average" alignment fails to be acceptable by a wide margin. However the latter collectively dominate the search space and the total probability which gives an indication of how many of them there are; only a tiny fraction is sampled by Gibbs sampling.

Many similar tests were done, with balanced and unbalanced trees and with different levels of mutation, some being recorded by Allison and Wallace (1993). At low levels of mutation, <10% per edge, the deterministic heuristic is rarely beaten by simulated annealing in the search for an optimal alignment and such an alignment can give reasonable edge estimates. At greater levels of mutation, >=15% per edge, simulated annealing invariably beats the deterministic heuristic and optimal alignments give increasingly biased edge estimates. These effects would be modified for a different alphabet. Much experimentation remains to be done; in particular the tuning of simulated annealing algorithms is something of a black art.

Five bacterial DNA sequences for
chloramphenicol acetyltransferase (CAT)-encoding
resistance determinant (Huggins *et al* 1992) provide an interesting case
summarised in figure 4.
The coding regions or open reading frames were compared;
these are about 640 characters long.
It is clear from comparisons of four sequences
that the (unrooted) tree topology ((CATB CATQ) (CATD CATP))
is strongly preferred over the two other possible topologies.
Where CCOLI fits in is less clear.
The tree topology (((CATB CATQ) (CATD CATP)) CCOLI) is slightly preferred
on the basis of an optimal alignment,
but the tree topology ((CATB CATQ) (CATD (CATP CCOLI))) is preferred
on the basis of average alignment probabilities.
This suggests some genuine uncertainty.
It is resolved with more information:
Huggins *et al* performed an analysis using protein sequences,
including sequences from more organisms,
and bringing experts' knowledge to the problem,
and (((CATB CATQ) (CATD CATP)) CCOLI)
is a subtree of their preferred tree.

### Conclusions

Deterministic algorithms to estimate the edges of an evolutionary tree and to find an optimal alignment of more than two strings are infeasible in general. Stochastic processes provide one way around this problem. Averaging over many random alignments, sampled from the correct probability distribution, gives good estimates of the tree's edges. Secondly, sampling in an increasingly selective way gives a simulated annealing search for an optimal alignment. Although valuable for some purposes, an optimal alignment gives biased estimates of tree edges, the effect increasing with the level of mutation and being worst for internal edges of the tree. A program based on these ideas has been implemented for DNA strings using a simple 1-state model or mutation machine. Current work includes generalising the model of mutation that is used.

The stochastic techniques could be applied to other kinds of strings
and to related problems.
It is not essential for probabilities to be calculated
exactly for the simulated annealing method (only) to work;
it is enough that the rank-ordering of the better alignments
be correct at low temperature.
This would allow simulated annealing to be applied
to *star* and *all-pairs* costs (Altschul & Lipman 1989)
and to multiple LCS and edit-distance problems in a direct way.

The processes described are good candidates for both fine-grain
and coarse-grain parallel implementations.
Collins and Coulson (1984) and Dautricourt *et al* (1991)
have implemented fine-grain parallel versions of DPAs for two strings
and this could be done for the central step of our stochastic processes.
Dautricourt *et al* report that, for some algorithms,
1 hour's processing on a sparc-station could be done in 27 seconds on a Maspar.
Independent copies of the stochastic processes could also be run
as a coarse-grain parallel program.

### References

Allison L. (1993) A fast algorithm for the optimal alignment of three strings. J. Theor. Biol.

**164**(2) 261-269.Allison L. & C. S. Wallace. (1993) The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimisation of multiple alignments. TR 93/188 July '93, Dept. Comp. Sci., Monash University.

Allison L., C. S. Wallace & C. N. Yee. (1992a) Finite-state models in the alignment of macro-molecules. J. Mol. Evol.

**35**(1) 77-89.

*Also see [TR 90/148 (HTML)].*Allison L., C. S. Wallace & C. N. Yee. (1992b) Minimum message length encoding, evolutionary trees and multiple alignment. 25

^{th}Hawaii Int. Conf. Sys. Sci.**1**663-674.Allison L. & C. N. Yee. (1990) Minimum message length encoding and the comparison of macromolecules. Bull. Math. Biol.

**52**(3) 431-453.Altschul S. F. & D. J. Lipman. (1989) Trees, stars and multiple biological sequence alignment. SIAM J. Appl. Math.

**49**(1) 197-209.Bishop M. J., A. E. Friday & E. A. Thompson. (1987) Inference of evolutionary relationships. In

*Nucleic Acid and Protein Sequence Analysis*. M. J. Bishop & C. J. Rawlings (eds), IRL Press, Oxford 359-385.Carrillo H. & D. J. Lipman. (1988) The multiple sequence alignment problem in biology. SIAM J. Appl. Math.

**48**1073-1082.Collins J. F. & A. F. W. Coulson. (1984) Applications of parallel processing algorithms for DNA sequence analysis. Nucl. Acids Res.

**12**(1) 181-192.Dautricourt J-P., D. Brutlag, B. Moxon & J. Fier. (1991) Similarity searching algorithm comparisons using massively parallel computing hardware. Genome Sequencing Conf. III.

Felsenstein J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol.

**17**368-376.Felsenstein J. (1983) Inferring evolutionary trees from DNA sequences. In

*Statistical Analysis of DNA Sequence Data*B. S. Weir (ed), Marcel Dekker 133-150.Hastings W. K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika

**57**97-109.Hein J. (1989) A tree reconstruction method that is economical in the number of pairwise comparisons used. Mol. Biol. & Evol.

**6**669-684.Hirschberg D. S. (1975) A linear space algorithm for computing maximal common subsequences. CACM

**18**(6) 341-343.Huggins A. S., T. L. Bannam & J. I. Rood. (1992) Comparative sequence analysis of the catB gene from clostridium butyricum. Antimicrobial Agents & Chemotherapy

**36**(11) 2548-2551.Ishikawa M., T. Toya, M. Hoshida, K. Nitta, A. Ogiwara & M. Kanehisa. (1992) Multiple sequence alignment by parallel simulated annealing. Institute for New Generation Computing (ICOT) TR-730.

Lake J. A. (1991) The order of sequence alignment can bias the selection of tree topology. Mol. Biol. Evol.

**8**(3) 378-385.Levenshtein V. I. (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady

**10**(8) 707-710.Rissanen J. (1983) A universal prior for integers and estimation by minimum description length. Ann. Stat.

**11**(2) 416-431.Sankoff D. & J. B. Kruskal (eds). (1983)

*Time Warps, String Edits and Macro-Molecules: The Theory and Practice of Sequence Comparison.*Addison Wesley.Sellers P. H. (1974) On the theory and computation of evolutionary distances. SIAM J. Appl. Math.

**26**(4) 787-793.Thorne J. L. & H. Kishino. (1992). Freeing phylogenies from artifacts of alignment. Molec. Biol. & Evol.

**9**(6) 1148-1162.Thorne J. L., H. Kishino & J. Felsenstein. (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol.

**33**114-124.Thorne J. L., H. Kishino & J. Felsenstein. (1992) Inching towards reality: an improved likelihood model of sequence evolution. J. Mol. Evol.

**34**3-16.Tichy W. F. (1984) The string to string correction problem with block moves. ACM Trans. Comp. Sys.

**2**(4) 309-321.Ukkonen E. (1983) On approximate string matching. In

*Proc. Int. Conf. on Foundations of Computation Theory*M. Karpinski (ed) Springer Verlag LNCS**158**487-495.Wagner R. A. & M. J. Fischer. (1974) The string to string correction problem. JACM

**21**(1) 168-173.Wallace C. S. & D. M. Boulton. (1968) An information measure for classification. Comp. J.

**11**(2) 185-194.Wallace C. S. & P. R. Freeman. (1987) Estimation and inference by compact coding. J. Royal. Statist. Soc. B

**49**(3) 240-265.Yee C. N. & L. Allison. (1993) Reconstruction of strings past. Comp. Appl. BioSci.

**9**(1) 1-7.

**Note,
17th Annual Computer Science Conference, ACSC-17, Christchurch, New Zealand,
Proceedings are published in
Australian Computer Science Communications
Vol 16 (!) No 1, 1994,
isbn:0-473-02313-X,
ed. G. Gupta.**