Hirschberg's Algorithm
Edit-Distance Revisited
An earlier section described the dynamic programming algorithm (DPA) which calculates the edit-distance of two strings s1 and s2. This algorithm takes O(|s1|*|s2|) time. An optimal alignment which displays an actual sequence of operations editing s1 into s2 can be recovered from the distance matrix `m' using O(|s1|*|s2|) space. Hirschberg (CACM 18(6) pp.341-343 1975) showed that an optimal alignment can be recovered in O(|s1|*|s2|) time and only O(|s2|) space using binary-recursion. If A contains zero characters or one character, an alignment is easily found. Otherwise the approach is to cut string s1 in the middle to form s1a and s1b and to find the corresponding place to cut string s2 into s2a and s2b. Note, s2a and s2b may have quite different lengths. The alignment problem is then solved recursively on s1a and s2a and on s1b and s2b. This is an example of the divide and conquer paradigm.
Recall that one row of `m' can be calculated from just the previous row. This allows the last row of `m' to be calculated in O(|s2|) space. By adding a boundary row m(|s1|+1, ) and a boundary column m( ,|s2|+1), the DPA can also be run in reverse from m(|s1|+1,|s2|+1) to m(1,1). Note that D(reverse(s1), reverse(s2)) = D(s1,s2). This allows the DPA to be run both forwards and backwards.
The DPA is run forward on s1a and on all of s2 with a step of +1 and in reverse on s1b and all of s2 with a step of -1. An optimal alignment of s1 and s2 could be partitioned between s1a and s1b. This would give s2a and s2b. The edit-distances of s1a and all possible s2a's are contained in the result of comparing s1a and s2. Similarly, the edit-distances of s1b and all possible s2b's are contained in the result of comparing s1b and s2. The last or bottom row for the forward calculation and last or top row for the reverse calculation are found in O(|s2|) space. Adding corresponding elements of these two rows together gives the total cost of editing s1 into s2 for each possible way of dividing s2 into s2a and s2b. Choosing a minimum sum gives an optimal s2a and s2b.
1 1 0 1 2 3 4 5 6 7 8 9 0 1 A C G T A C G T A C G T * . . . . . . . . . . . . 0 A . * . . . . . . . . . . . 1 C . . * * . . . . . . . . . forward DPA 2 T . . . . * . . . . . . . . down and right 3 A . . . . . * . . . . . . . 4 C . . . . . . * . . . . . . 5 C . . . . . . . * . . . . . | an optimal | path & split . . . . . . . * . . . . . 6 T . . . . . . . . * . . . . 7 A . . . . . . . . . * . . . 8 C . . . . . . . . . . * . . reverse DPA 9 A . . . . . . . . . . * . . up and left 10 G . . . . . . . . . . . * . 11 T . . . . . . . . . . . . * A C G T A C G T A C G T |
The sub-problems of aligning s1a with s2a and s1b with s2b are solved recursively. The base cases when the length of s1 is zero or one are easily dealt with.
Complexity
The space used is dominated by the rows of the edit matrix and is O(|s2|). The time used to calculate a value for D(s1,s2) is O(|s1|*|s2|) but the recursion must be considered. The two, top-level recursive calls each solve a smaller problem. Together these add up to half of the size of the original problem, and take O(|s1|*|s2|/2) time in total (think about the area of matrix). At the next recursive level, four problems in total one quarter of the original size are solved, and so on. Therefore the total time taken is O(|s1|*|s2|*(1+1/2+1/4+...)) which is O(2*|s1|*|s2|) which is O(|s1|*|s2|). This use of recursion reduces the space used from O(|s1|*|s2|) to O(|s2|) and roughly doubles the time taken. A decision to use the technique can only be made with regard to the length of typical strings and the memory size and speed of the computer. Strings from the molecular-biology application are typically a few hundred to a few thousand characters long which can make its use worthwhile.
- Try `go', change the strings in the HTML FORM below and experiment:
- The trace above shows the sequence of recursive calls with indentation used to indicate the depth of recursion.
References
- D. S. Hirschberg.
A linear space algorithm for computing maximal common subsequences.
Comm. A.C.M., 18(6), pp.341-343, 1975.
The original algorithm was given in terms of the longest common subsequence (LCS) problem, but it is easily adapted for the edit-distance problem, as above. - D. R. Powell, L. Allison, T. I. Dix.
A versatile divide and conquer technique for optimal string alignment.
Inf. Proc. Lett., 70, pp.127-139, 1999.
A variation on the theme allows divide and conquer to be applied to more difficult problems, involving complex gap-costs, and even to Ukkonen's greedy, fast algorithm. - L. Allison, Dynamic programming algorithm (dpa) for edit-distance, AlgDS/Dynamic/Edit/, 1999.