## Strings

### String Searching

The problem is to search for a pattern string,
`pat[1..m]`

,
in a text string
`txt[1..n]`

.
Usually `n>>m`

,
and `txt`

might be very long indeed,
although this is not necessarily so.
This problem occurs in text-editors and many other computer applications.

### Naive Search

The naive string searching algorithm
is to examine each position, i>=1, in `txt`

,
trying for equality of
`pat[1..m]`

with `txt[i..i+m-1]`

.
If there is inequality, position `i+1`

is tried, and so on.

The worst-case time-complexity of the naive algorithm is seen to be
O(m*n),

e.g.,
`pat=a`

and
^{m-1}b`txt=a`

^{n-1}b

Mercifully there are faster algorithms!

#### Demonstration

The HTML FORM below allows you to search for a pattern string
`pat`

in a text string `txt`

.
It uses three algorithms: the naive algorithm (above),
Rabin's algorithm and the Boyer-Moore algorithm (below).
Change the values of `pat`

and `txt`

,
run the algorithms and experiment:

### Rabin's Algorithm

© L . A l l i s o n |

Rabin's string searching algorithm uses a
"rolling hash".
Hashing is an idea borrowed from hash-
[tables]
but used here without the "table".
In Rabin's algorithm,
the hash function is defined to compute a number in [0..p-1] for a large
prime number p.
The integers modulo p, Z_{p}, behave as a mathematical *field*.
That is, they behave just as the usual unbounded integers do
(see [intro'])
with respect to addition and multiplication, and in particular
there are no "divisors of zero",
that is no non-zero values j & k such that j*k=0.

At each stage of the algorithm, hash(txt[i..i+|pat|-1]) is compared with hash(pat), e.g.,

patHash = (((ord(pat[1])*3+ord(pat[2]))*3+...+ord(pat[m]) ) mod p txtHash_{1}= (((ord(txt[1])*3+ord(txt[2]))*3+...+ord(txt[m]) ) mod p txtHash_{i}= ((txtHash_{i-1}- ord(txt[i-1])*power)*3 + ord(txt[i+m-1]) ) mod p, where i>1 where power = 3^{m-1}and ord(ch) is the character code of ch

If the values are
equal then a match has *almost certainly* been found.
A character by character check must be made to be absolutely sure.

#### Time Complexity

Rabin's algorithm is (almost always) fast,
i.e. O(m+n) *average-case* time-complexity, because
hash(txt[i..i+m-1]) can be computed in O(1) time -
i.e. by two multiplications, a subtraction,
an addition and a `mod' -
given its predecessor hash(txt[i-1..i-1+m-1]).

The *worst-case* time-complexity does however remain at O(m*n)
because of the possibility of false-positive matches on
the basis of the hash numbers,
although these are very rare indeed.

#### Space Complexity

Space complexity is O(1) for a few scalar variables.

### Boyer-Moore Algorithm

The Boyer Moore algorithm
is *always* fast having worst-case time-complexity O(m+n),
but on natural-language text it actually gets faster as m increases
to a certain extent,
e.g., Boyer and Moore suggesting O(n/4)-time on *average* for m=5.

The first key idea, and it is a good one,
is that if the m^{th}
character `ch=txt[m]`

(numbered from `1' upwards)
does not occur in `pat`

at all
then any instance of `pat`

in `txt`

must start at position m+1 or later.

E.g., if searching for `pat=`freddy'`

,
in `txt=`I floated lonely as a cloud'`

,
the 6^{th} character of `txt`

is ``a`

'
which is not in {d,e,f,r,y}
and there is no need to consider positions 1 to 6 of `txt`

any more!
In this way, we can move along `txt`

in steps
of `m`

positions, i.e. `k=m, 2m, 3m, ..., `

provided we are lucky.

The second idea is that if the *last* occurrence of
the character `txt[k]`

in `pat`

is `delta`

positions from the right hand end of
_{1}[ch]`pat`

, then
`pat`

can be slid that many positions to the right
before we *might* get a match in `txt`

.
If `ch=txt[k]`

does not occur in `pat`

,
set `delta`

equal to m._{1}[ch]

E.g.,```
delta
```

_{1}['e']=3,
delta_{1}['f']=5,
delta_{1}['d']=1,
delta_{1}['r']=4,
delta_{1}['y']=0.

In general if the last `j`

characters match,
i.e. `txt[k-j..k]=pat[m-j..m]`

, but
`txt[k-j-1]~=pat[m-j-1]`

,
then `pat`

can be "slid"
a certain distance along `txt`

depending on where, if at all,
there is an earlier instance of `pat[m-j..m]`

within `pat`

;
e.g., consider `pat=`

or, e.g., **abra**cad**abra**`pat=fababab`

.
An array, `delta`

,
is used to hold these precomputed distances.
_{2}[1..m]

#### Time Complexity

The worst-case time-complexity of the Boyer-Moore algorithm is O(m+n).

The algorithm often runs in O(n/m)-time on natural-language text
for small values of m.
Note that if `txt`

is in slow, block-access, backing store,
it is generally not possible to bring just every
`m`

^{th} character
into main memory, so the time-complexity is then O(n).

#### Space Complexity

The space-complexity is O(m + |alphabet|), for the arrays delta1[ ] and delta2[ ].

### Preprocessing `txt`

.

It is intuitively obvious that the worst-case for
searching for an arbitrary pattern, pat, in a text, txt,
must take at least O(n)-time, where n=|txt|.
However, if some time is spent *pre-processing* txt,
then individual searches can be made more quickly,
e.g., in O(m)-time using a
suffix-tree,
where m=|pat|.
It takes O(n)-time to build a suffix tree for txt.

### Notes

- R. S. Boyer and J. S. Moore.
*A Fast String Searching Algorithm*. Comm. A.C.M.**20**(10) p762-772 Oct 1977.

The paper has a rather nice section on how the algorithm was developed, over quite some time, from the initial idea, with contributions from various people. - D. E. Knuth, J. H. Morris and V. R. Pratt.
*Fast Pattern Matching in Strings*. SIAM Jrnl. Comput.**6**(2) p323-350 Jun 1997. - R. M. Karp and M. O. Rabin.
*Efficient Randomized Pattern-Matching Algorithms*. IBM Jrnl. Res. and Dev.**31**(2) p249-260 Mar 1987. - G. de v. Smit.
*A Comparison of Three String Matching Algorithms.*Software Practice and Experience**12**p57-66 Jan 1982.

Compared (i) a straightforward algorithm, (ii) Knuth Morris Pratt algorithm, and (iii) the Boyer Moore algorithm. The last of these was usually best, by far. - There are slight programming variations depending on whether strings are indexed from zero or from one.
*Beware:*Some implementations of `m mod n' in some languages and systems are incorrect when m is negative which can cause a nasty surprise when implementing Rabin's algorithm!- The
[
**Suffix Tree**] data structure for txt can be built in O(|txt|)-time and can then be used to find a substring, pat, in O(|pat|)-time. - Also see the
[edit-distance] problem,
a.k.a.
*approximate*string matching.