Strings

String Searching

The problem is to search for a pattern string, pat[1..m], in a text string txt[1..n]. Usually n>>m, and txt might be very long indeed, although this is not necessarily so. This problem occurs in text-editors and many other computer applications.

Naive Search

The naive string searching algorithm is to examine each position, i>=1, in txt, trying for equality of pat[1..m] with txt[i..i+m-1]. If there is inequality, position i+1 is tried, and so on.

The worst-case time-complexity of the naive algorithm is seen to be O(m*n),
e.g., pat=am-1b and txt=an-1b

Mercifully there are faster algorithms!

Demonstration

The HTML FORM below allows you to search for a pattern string pat in a text string txt. It uses three algorithms: the naive algorithm (above), Rabin's algorithm and the Boyer-Moore algorithm (below). Change the values of pat and txt, run the algorithms and experiment:

pat:
txt:
-trace

Rabin's Algorithm

©
L
.
A
l
l
i
s
o
n

Rabin's string searching algorithm uses a "rolling hash". Hashing is an idea borrowed from hash- [tables] but used here without the "table". In Rabin's algorithm, the hash function is defined to compute a number in [0..p-1] for a large prime number p. The integers modulo p, Zp, behave as a mathematical field. That is, they behave just as the usual unbounded integers do (see [intro']) with respect to addition and multiplication, and in particular there are no "divisors of zero", that is no non-zero values j & k such that j*k=0.

At each stage of the algorithm, hash(txt[i..i+|pat|-1]) is compared with hash(pat), e.g.,

patHash = (((ord(pat[1])*3+ord(pat[2]))*3+...+ord(pat[m])
          ) mod p


txtHash1= (((ord(txt[1])*3+ord(txt[2]))*3+...+ord(txt[m])
          ) mod p


txtHashi= ((txtHashi-1 - ord(txt[i-1])*power)*3
                       + ord(txt[i+m-1])
          ) mod p,                              where i>1


where power = 3m-1
and   ord(ch) is the character code of ch

If the values are equal then a match has almost certainly been found. A character by character check must be made to be absolutely sure.

Time Complexity

Rabin's algorithm is (almost always) fast, i.e. O(m+n) average-case time-complexity, because hash(txt[i..i+m-1]) can be computed in O(1) time - i.e. by two multiplications, a subtraction, an addition and a `mod' - given its predecessor hash(txt[i-1..i-1+m-1]).

The worst-case time-complexity does however remain at O(m*n) because of the possibility of false-positive matches on the basis of the hash numbers, although these are very rare indeed.

Space Complexity

Space complexity is O(1) for a few scalar variables.

Boyer-Moore Algorithm

The Boyer Moore algorithm is always fast having worst-case time-complexity O(m+n), but on natural-language text it actually gets faster as m increases to a certain extent, e.g., Boyer and Moore suggesting O(n/4)-time on average for m=5.

The first key idea, and it is a good one, is that if the mth character ch=txt[m] (numbered from `1' upwards) does not occur in pat at all then any instance of pat in txt must start at position m+1 or later.
E.g., if searching for pat=`freddy', in txt=`I floated lonely as a cloud', the 6th character of txt is `a' which is not in {d,e,f,r,y} and there is no need to consider positions 1 to 6 of txt any more! In this way, we can move along txt in steps of m positions, i.e. k=m, 2m, 3m, ..., provided we are lucky.

The second idea is that if the last occurrence of the character txt[k] in pat is delta1[ch] positions from the right hand end of pat, then pat can be slid that many positions to the right before we might get a match in txt. If ch=txt[k] does not occur in pat, set delta1[ch] equal to m.
E.g., delta1['e']=3, delta1['f']=5, delta1['d']=1, delta1['r']=4, delta1['y']=0.

In general if the last j characters match, i.e. txt[k-j..k]=pat[m-j..m], but txt[k-j-1]~=pat[m-j-1], then pat can be "slid" a certain distance along txt depending on where, if at all, there is an earlier instance of pat[m-j..m] within pat; e.g., consider pat=abracadabra or, e.g., pat=fababab. An array, delta2[1..m], is used to hold these precomputed distances.

Time Complexity

The worst-case time-complexity of the Boyer-Moore algorithm is O(m+n).

The algorithm often runs in O(n/m)-time on natural-language text for small values of m. Note that if txt is in slow, block-access, backing store, it is generally not possible to bring just every mth character into main memory, so the time-complexity is then O(n).

Space Complexity

The space-complexity is O(m + |alphabet|), for the arrays delta1[ ] and delta2[ ].

Preprocessing txt.

It is intuitively obvious that the worst-case for searching for an arbitrary pattern, pat, in a text, txt, must take at least O(n)-time, where n=|txt|. However, if some time is spent pre-processing txt, then individual searches can be made more quickly, e.g., in O(m)-time using a suffix-tree, where m=|pat|. It takes O(n)-time to build a suffix tree for txt.

Notes