Multiple Alignment
Given k>2 sequences, the multiple alignment problem is to pad them out with null characters, '-', so that they all have the same length and to do this in such a way as to optimise some criterion -- minimise a cost or maximise a score.
The most popular choices, in no particular order, for specifying the cost (score) function of mutiple alignment are:
- All-pairs costs: The multiple alignment is "projected" onto all k×(k-1)/2 possible pairs out of the k sequences, the cost (score) of every such two-way alignment is calculated, and they are all summed.
- Star costs: A single hypothetical ancestor is assumed connected to each one of the given k sequences by an edge. The minimum possible sum of costs (max score) from the ancestor to each leaf is calculated.
- Tree-costs: The leaf sequences are related by a phylogenetic tree (evolutionary tree, family tree). The cost is the sum of the costs corresponding to each edge in the tree. The tree can be unrooted or rooted and, especially in the latter case, a "molecular clock" may be assumed (or not). This models the evolutionary process most closely but, the problem is, where to get the phylogenetic tree?
The simple multiple alignment algorithm for two sequences of length ~n takes O(n2)-time. The corresponding algorithm for k≥3 sequences takes O(nk)-time. This is infeasible if n is "large" and k is more than a "few" in which case either heuristics or a stochastic search must be used. For example, each internal hypothetical node in an unrooted phylogenetic tree has three neighbours so it is possible to do repeated three-way alignment in either a greedy or a stochastic way. Alternatively a k-way alignment may be "projected" onto any k'<k of those sequences. Therefore, under tree-costs the phylogenetic tree can be "broken" on any edge and the two corresponding sub-alignments of leaves can be realigned in about O(n2)-time. One can either (Gibbs-) sample random alignments or "cool" the process towards an optimal multiple alignment [Allison & Wallace].