## Lookup Tables

A *table* is used to store *elements*.
An element may be *inserted* in the table,
*searched* for (looked up) in the table,
and sometimes *deleted* from the table.
The idea of a table is very general and important in information
processing systems.
For example, a university data base might contain tables of
departments, staff, courses, lectures, students, exam results.
A compiler might use tables of
keywords, variables, types, constants, procedures and functions.
A table is a mutable data structure,
it has state,
it changes with time.
Its
properties
are therefore dynamic and are
okup
expressed by pre and postconditions (assertions):

- Type:
table of Element_Type

- Operations:
clear :table insert :Element_Type × table delete :Element_Type × table lookup :Element_Type × table -> boolean empty :table -> boolean

- Rules, for all x:Element_Type and T:table :
1. clear(T) assert empty(T) assert not lookup(x,T) 2. assert T=T' and not lookup(x, T) insert(x, T) assert lookup(y,T) = x=y or lookup(y,T') 3. assert T=T' and not lookup(x,T') insert(x,T) delete(x,T) assert T=T'

**The Table Abstract Data Type.**

Clear, insert and delete change the state of the table.

When the table has just been cleared, lookup returns false for any element. When an element is added to the table, lookup returns true for that element and the contents are otherwise unaffected. If a new element is added to the table and then deleted there is no net effect on the table. Two further situations need to be considered. If an element is in the table and an attempt is made to insert it again, the result could either be an error (4a) or no change (4b). If an element is not in the table and an attempt is made to delete it, the result could either be an error (5a) or no change (5b):

4a assert lookup(x,T) 4b assert T=T' and lookup(x,T) insert(x, T) % error insert(x, T) assert T=T' % no change 5a assert not lookup(x,T) 5b assert T=T' and not lookup(x,T) delete(X, T) % error delete(x, T) assert T=T' % no change

**Application Dependent Cases.**

Sometimes the element type
consists of a *key*
plus some other attributes or data.
The key might be a number or a name (string);
it is something *unique* to identify an element.
In this case the *lookup* operation usually returns the associated
attributes rather than just true or false.
If there is no extra data we call the element itself the key.

It is important for a program to be efficient: to use little space and to run quickly. How these aims can be achieved depends on the properties of the data and particularly the key, on the operations on the table and on the frequency and pattern of use of the different operations.

We say that the key type is *sparse* if the key values in use
form a very small fraction of the number of possible values in the type.
We call it a *dense* type if most of the possible values come into play.
These are only loose definitions and each case must be judged on its merits.
For example,
if keys are alpha-numeric strings they are necessarily sparse because
there are infinitely many possible strings.
Even if the string length is limited but reasonably large there
are a very great many strings.
On the other hand,
the days of the week form a dense enumerated type.
The sparseness of (some) key-types
has a large influence on the method of choice for implementing a table.
Indeed many of the data structures described in this book
have been devised precisely to deal with sparse types efficiently.

#### Sets by Tables

A table where the element is all key
can also be thought of as a set.
If the element type is a dense index type then
the table can be implemented as a bit mask.
Lookup is done by the membership test, **in**,
addition and deletion by set union and difference.
There is a limit of the range of element values in
some programming languages the provide a built-in `set' data-type.
A large, dense set can be implemented by an array of boolean
so this can also be used to implement the corresponding table.
Conversely, a sparse set can be implemented by the methods
for tables described below.

### Table by Unsorted Array

A table to hold a limited number of elements with a sparse key
can be implemented using an array and a counter.
Initially the table is empty and the counter is set to zero.
When a new element is inserted
the counter is incremented and the new element stored at that index.
The elements are in no particular order other than that of time
of insertion.
Lookup is by the *linear-search* algorithm
which steps sequentially through the array.

If there are n elements in the table, the search takes
O(1) time in the best case and
O(n) time in the average and worst cases.
Deletion first requires a search to find the element to delete;
*after* that it takes O(1) time to actually remove it:
it is over-written with the last element in the array and
the counter is decremented.
On the other hand, insertion is fast and takes a constant amount of time,
provided that the element is known to be absent in which case
a search is unnecessary.

#### Notes.

- See the [String Searching] problem which is related to linear search.

### Table by Sorted Array

A person does not
use linear search to find a name in a telephone directory,
except perhaps for Dustin Hoffman's character in `The Rain Man'.
Rather the directory is opened at a likely place,
and the key compared with names on that page.
If the key is alphabetically before the names on the page,
the search is continued in the front part of the directory
otherwise the back part.
The fact that the directory is alphabetically ordered
or *sorted* is essential to
this process.
The *binary-search* algorithm makes use of an ordering of
the key type in a similar way.
The table is implemented by a sorted array.
Two "pointers" are used to delimit the search area,
initially the occupied part of array.
The middle element is compared with the key and either the
lower pointer moved up or the upper pointer moved down as appropriate.

eg. Search for 3 in array containing 1 2 3 4 5 6 7 8 lo mid hi ? ? ? ? 5 ? ? ? select lower half lo mid hi ? ? 3 ? - - - - select upper half lo mid hi - - ? 4 - - - - select lower half lo hi - - 3 - - - - - ^ 3 is found

**Example Binary Search.**

*if*it is present.

{Search for key in A[1..N] (Language=Pascal)} lo := 1; hi := N+1; while lo < hi-1 do {i.e. there are at least 2 elements} begin {INVARIANT: search key in A[1..N] iff key in A[lo..hi-1]} mid := (lo+hi) div 2; {assert: lo < mid < hi} if key >= A[mid] then {must be in right part} lo := mid else {i.e. key < A[mid], search left} hi := mid end; found := A[lo]=key {?is it there?}

**Binary Search**

Change the data in the HTML FORM below, click `go', and experiment. The region of the array from `lo' to `hi-1' is [bracketted] and a[mid] is starred `*' in the trace:

© L . A l l i s o n |

#### Correctness

The algorithm is short but there are some subtle points to observe.
Firstly, if the key is present it lies between `lo' inclusive and `hi'
*exclusive* - this is the loop invariant.
The invariant is true at the start of the first iteration because
lo=1 and hi=n+1.
The loop body ensures that the invariant remains true
at the start of each subsequent iteration.
This is so because of the choice of test in the central **if** statement.
If the mid element is less than or equal to the key and lo is moved up to mid
the invariant is still true.
The converse is that the mid element is greater than the key
and when hi is moved down the invariant is still true.
Note the connection between the hi-1 in the invariant and the `>=' test.
When hi=lo+1 the search has narrowed to just one element,
A[lo..hi-1] = A[lo..lo] = A[lo],
which can be compared to the search key for equality.

#### Termination

The algorithm terminates because the two integer-valued pointers are moving closer together and the exit test must eventually succeed. If hi>=lo+1 then hi>lo+2 and so lo<mid<hi. Therefore regardless of whether `lo' is moved up to mid or hi is moved down to mid, lo and hi move closer together at each iteration and closer to satisfying the termination test.

#### Testing

It is very easy to get the details of binary search wrong when coding it so it is a good idea to test it on at least the following cases:

- an empty array!
- an array of one element, (a) equal to the key or (b) less than it or (c) greater than it
- several elements with the key less than the 1st or
- the key greater than the last element
- the key equal to the 1st, a middle, or the last element
- the key not in the array and falling between various positions

#### Complexity

Finally, the algorithm takes O(log(n)) time
as each iteration halves the search area.
Note that it is *not* worth testing for equality within the loop
and exiting.
To do so would add an extra test and time to every iteration.
The *average* number of iterations saved would be limited by

1 × 1/2 + 2 × 1/4 + 3 × 1/8 + ...

*iterations*saved would be bounded by two but the (averge and worst-case) loss in

*time*due to the extra test in each iteration executed would be unbounded [Dijkstra].

Binary-search takes O(log(n)) time in all cases which is much faster than linear search. If the table does not change once created, ie. there is no insertion or deletion, it can be sorted once in O(n*log(n)) time; see the chapter on sorting. Insertion and deletion must keep the array sorted and they take O(n) time at worst and on average which is relatively slow. This is because insertion and deletion to the left of position n require elements to be moved up or down to make space or to fill the hole. Using a sorted array may have an extra benefit if the elements are also processed sequentially, for example to print alphabetical lists.

See also the section on binary search trees in the chapter on trees, and solving equations f(x)=0.

### Hash Tables

For an apparently crazy idea hashing works remarkably well.
A *hash* function maps key values onto the index
type of an array which implements the hash table.
Elements are accessed directly using the hash index
as an array index.
The surprising part is that the best hash functions are
essentially pseudo-random functions.
The ideal hash function maps the actual key values uniformly onto
the index type;
unfortunately the ideal is not always achieved.

For many purposes, a "reasonable" hash function on strings is to take the binary value of the 1st character, multiply by three, add the binary value of the 2nd character, multiply by three, and so on, finally adding the binary value of the last character. The hash value is this total modulo the size of the hash table.

The multiplication by three in the hash function ensures that every character contributes to the least significant bit of the hash index. Changing any one character in the string by one will give a different hash index if the size of the hash table is even, and it is often a power of two. Multiplication by an even number would cause the effect of the leading characters of a long string to be lost (exercise: why?).

It is impossible to design a good hash function for all circumstances.
In general it is necessary to analyse the statistical properties of the key
type and design a hash function to suit.
For example,
if a hash table is used to manage a table of identifiers in a compiler,
(some) programmers are fond of names like
x1, x2, x3, y1, ..., z1, ...
some of which are liable to hash to the same index.
Such an occurrence is called a *collision*.
Collisions are in fact inevitable because the key type is infinite
and cannot otherwise be squeezed into a finite index range.
The solution is to hash the key and examine
the table at that index.
If a different key is already at that position
then a collision has occurred.
A *probing* strategy then looks at a sequence of positions
until a blank is found.

Initial Hash Table: ----- carol ----- bill anne ----- ----- insert("dan") ------>-| |collision, hash("dan")=hash("bill") | | | |probe | | |probe v v v ----- carol ----- bill anne ----- ----- Final Hash table: ----- carol ----- bill anne dan -----

**Collisions and Linear Probing.**

Linear probing is the simplest probing strategy. It looks at successive elements, modulo the table size, until a blank is found. The new element is inserted there.

The same probing strategy is used in the search.

Entries may cluster together in *blocks* as a result of collisions
as the table fills up.
Linear probing amounts to using a linear search on the collision blocks.
Its average time complexity is linear in the size of the average block.
A good hash function will minimise the size of blocks.

Quadratic probing is a more sophisticated probing strategy.

The step-size increases *linearly* so the
indices generated,
i, i+1, i+3, i+6, ...,
are described by a quadratic in the step number
modulo the table size.
Quadratic probing is not guaranteed to probe every location in the table -
an insert could fail while there is still an empty location
however hash tables are rarely allowed to get full - see later.
The same probing strategy is used in the associated search routine.

The virtue of quadratic probing is that the probe locations for two near-miss keys, as opposed to two colliding keys, are not closely related and this reduces the likelihood of large collision blocks forming. With linear probing on the other hand, collision blocks following from similar indices tend to join together slowing the search process. For example, if one set of keys all hash to `6' and another set hash to `7' then the collision blocks of these sets will over-run and merge with each other under linear-probing. With quadratic probing on the other hand, the probes after 6 are at positions 7, 9, 12, 16, ... and the probes after 7 are at positions 8, 10, 13, 17, .... This tends to give small disjoint blocks.

Simulations can give an estimate of the performance of a hash table.
The time taken by insertion and lookup
is dependent on the number of collisions.
The results given here are based on randomly generated keys
with a uniform distribution.
This is the best possible case;
*typical* results may therefore be somewhat *worse*.

fill collisions/look 10% 0.04 20% 0.09 30% 0.15 40% 0.27 50% 0.39 60% 0.68 70% 0.92 80% 1.72 90% 2.29 100% 4.92

**Average Collisions per Lookup (Linear Probing)**fill collisions/look 10% 0.04 20% 0.10 30% 0.15 40% 0.25 50% 0.38 60% 0.53 70% 0.67 80% 1.09 90% 1.40

**Average Collisions per Lookup (Quadratic Probing)**

If the number of entries in the table is liable to grow, a common strategy is to increase the table size when the number of entries reaches a certain percentage of the table's current capacity or when the collision rate reaches some critical level. Elements may have to be re-hashed when moved into the enlarged table. The table may also be shrunk if the number of entries falls below some limit.

Another way of dealing with collisions is known as *chaining*.
Here, each hash table entry contains a pointer to a list of colliding
elements.
The lists are created in a collection.
Deletion is more easily implemented under chaining.
A linear search is used within a list.
Provided the hash function is good, each list
is kept short and the operations are fast.
See the chapter on lists for the necessary techniques.

The hash function randomises the order in which elements are stored in the hash table. This makes it difficult to process the elements sequentially and may rule against the use of a hash table in some applications.

#### Notes

- See Rabin's [String Searching] algorithm for an application of hashing.

### Tables by Tree Structures

Various tree-based data-structures can be used to implement tables efficiently, e.g.

- [AVL-trees]
- [Tries]
- [PATRICIA trees]

### Exercises

- A table is implemented by an array of 1000 elements;
initially there are 600 members.
What are the approximate relative efficiencies of using
- linear search and an unsorted array
- binary search and a sorted array
- hashing

- 1000 insertions, 1000 deletions and 1000 lookups (interleaved)
- 100 insertions, 100 deletions and 2800 lookups (interleaved)

- Implement a delete operator for a hash table
under linear-probing.
Think carefully about what happens if x and y collide,
and the sequence
insert(x,T); insert(y,T); delete(x,T); lookup(y,T)

**Long:**Implement a small data base of telephone numbers. The key type is a name, assumed unique. The associated data is a telephone number of up to 10 digits.- The data base commands are:
- I name number -- insert a new entry
- D name -- delete an entry
- P name -- print the number associated with name

**class**or**module**(depending on your programming language) which uses a hash-table internally and exports the table operations.

(ii) Implement an alternative table class/module which uses a sorted array and binary-search internally and exports the table operations.

(iii) Implement a command processor which makes use of a table class/module. Ideally each part should be carried out by a different programmer. Consider the issue of errors, such as trying to print a missing entry, and fully specify the module interface*before*starting to write code. Test the components in isolation*before*putting them together into a complete system.