Strings
String Searching
The problem is to search for a pattern string,
pat[1..m]
,
in a text string
txt[1..n]
.
Usually n>>m
,
and txt
might be very long indeed,
although this is not necessarily so.
This problem occurs in text-editors and many other computer applications.
Naive Search
The naive string searching algorithm
is to examine each position, i>=1, in txt
,
trying for equality of
pat[1..m]
with txt[i..i+m-1]
.
If there is inequality, position i+1
is tried, and so on.
The worst-case time-complexity of the naive algorithm is seen to be
O(m*n),
e.g.,
pat=am-1b
and
txt=an-1b
Mercifully there are faster algorithms!
Demonstration
The HTML FORM below allows you to search for a pattern string
pat
in a text string txt
.
It uses three algorithms: the naive algorithm (above),
Rabin's algorithm and the Boyer-Moore algorithm (below).
Change the values of pat
and txt
,
run the algorithms and experiment:
Rabin's Algorithm
© L . A l l i s o n |
Rabin's string searching algorithm uses a "rolling hash". Hashing is an idea borrowed from hash- [tables] but used here without the "table". In Rabin's algorithm, the hash function is defined to compute a number in [0..p-1] for a large prime number p. The integers modulo p, Zp, behave as a mathematical field. That is, they behave just as the usual unbounded integers do (see [intro']) with respect to addition and multiplication, and in particular there are no "divisors of zero", that is no non-zero values j & k such that j*k=0.
At each stage of the algorithm, hash(txt[i..i+|pat|-1]) is compared with hash(pat), e.g.,
patHash = (((ord(pat[1])*3+ord(pat[2]))*3+...+ord(pat[m]) ) mod p txtHash1= (((ord(txt[1])*3+ord(txt[2]))*3+...+ord(txt[m]) ) mod p txtHashi= ((txtHashi-1 - ord(txt[i-1])*power)*3 + ord(txt[i+m-1]) ) mod p, where i>1 where power = 3m-1 and ord(ch) is the character code of ch
If the values are equal then a match has almost certainly been found. A character by character check must be made to be absolutely sure.
Time Complexity
Rabin's algorithm is (almost always) fast, i.e. O(m+n) average-case time-complexity, because hash(txt[i..i+m-1]) can be computed in O(1) time - i.e. by two multiplications, a subtraction, an addition and a `mod' - given its predecessor hash(txt[i-1..i-1+m-1]).
The worst-case time-complexity does however remain at O(m*n) because of the possibility of false-positive matches on the basis of the hash numbers, although these are very rare indeed.
Space Complexity
Space complexity is O(1) for a few scalar variables.
Boyer-Moore Algorithm
The Boyer Moore algorithm is always fast having worst-case time-complexity O(m+n), but on natural-language text it actually gets faster as m increases to a certain extent, e.g., Boyer and Moore suggesting O(n/4)-time on average for m=5.
The first key idea, and it is a good one,
is that if the mth
character ch=txt[m]
(numbered from `1' upwards)
does not occur in pat
at all
then any instance of pat
in txt
must start at position m+1 or later.
E.g., if searching for pat=`freddy'
,
in txt=`I floated lonely as a cloud'
,
the 6th character of txt
is `a
'
which is not in {d,e,f,r,y}
and there is no need to consider positions 1 to 6 of txt
any more!
In this way, we can move along txt
in steps
of m
positions, i.e. k=m, 2m, 3m, ...,
provided we are lucky.
The second idea is that if the last occurrence of
the character txt[k]
in pat
is delta1[ch]
positions from the right hand end of
pat
, then
pat
can be slid that many positions to the right
before we might get a match in txt
.
If ch=txt[k]
does not occur in pat
,
set delta1[ch]
equal to m.
E.g.,
delta1['e']=3,
delta1['f']=5,
delta1['d']=1,
delta1['r']=4,
delta1['y']=0.
In general if the last j
characters match,
i.e. txt[k-j..k]=pat[m-j..m]
, but
txt[k-j-1]~=pat[m-j-1]
,
then pat
can be "slid"
a certain distance along txt
depending on where, if at all,
there is an earlier instance of pat[m-j..m]
within pat
;
e.g., consider pat=abracadabra
or, e.g., pat=fababab
.
An array, delta2[1..m]
,
is used to hold these precomputed distances.
Time Complexity
The worst-case time-complexity of the Boyer-Moore algorithm is O(m+n).
The algorithm often runs in O(n/m)-time on natural-language text
for small values of m.
Note that if txt
is in slow, block-access, backing store,
it is generally not possible to bring just every
m
th character
into main memory, so the time-complexity is then O(n).
Space Complexity
The space-complexity is O(m + |alphabet|), for the arrays delta1[ ] and delta2[ ].
Preprocessing txt
.
It is intuitively obvious that the worst-case for searching for an arbitrary pattern, pat, in a text, txt, must take at least O(n)-time, where n=|txt|. However, if some time is spent pre-processing txt, then individual searches can be made more quickly, e.g., in O(m)-time using a suffix-tree, where m=|pat|. It takes O(n)-time to build a suffix tree for txt.
Notes
- R. S. Boyer and J. S. Moore.
A Fast String Searching Algorithm.
Comm. A.C.M. 20(10) p762-772 Oct 1977.
The paper has a rather nice section on how the algorithm was developed, over quite some time, from the initial idea, with contributions from various people. - D. E. Knuth, J. H. Morris and V. R. Pratt. Fast Pattern Matching in Strings. SIAM Jrnl. Comput. 6(2) p323-350 Jun 1997.
- R. M. Karp and M. O. Rabin. Efficient Randomized Pattern-Matching Algorithms. IBM Jrnl. Res. and Dev. 31(2) p249-260 Mar 1987.
- G. de v. Smit.
A Comparison of Three String Matching Algorithms.
Software Practice and Experience 12 p57-66 Jan 1982.
Compared (i) a straightforward algorithm, (ii) Knuth Morris Pratt algorithm, and (iii) the Boyer Moore algorithm. The last of these was usually best, by far. - There are slight programming variations depending on whether strings are indexed from zero or from one.
- Beware: Some implementations of `m mod n' in some languages and systems are incorrect when m is negative which can cause a nasty surprise when implementing Rabin's algorithm!
- The [Suffix Tree] data structure for txt can be built in O(|txt|)-time and can then be used to find a substring, pat, in O(|pat|)-time.
- Also see the [edit-distance] problem, a.k.a. approximate string matching.