Lists
The list is a flexible abstract data type. It is particularly useful when the number of elements to be stored is not known before running a program or can change during running. It is well suited to sequential processing of elements because the next element of a list is readily accessible from the current element. It is less suitable for random access to elements as this requires a slow linear search. Two forms of list are considered, flat lists of elements in this chapter and hierarchical lists in a later chapter.
Types: type list e = nil | cons(e × (list e)) Constant: nil Operations: null : list e -> boolean head : list e -> e tail : list e -> list e cons : e × (list e) -> list e Rules, for all x:e and L:list e : null(nil) = true null(cons(x,L)) = false head(cons(x,L)) = x head(nil) = error tail(cons(x,L)) = L tail(nil) = error Shorthand: [x1, x2, ..., xn] = cons(x1,cons(x2, ...cons(xn,nil)...)) The List Abstract Data Type.
The names of the operations are traditional and are derived from versions of the Lisp programming language.
The definition of list can be read as follows: a list of e is (=) either `nil' or (|) is `cons' applied to an element of type e and a list. Nil represents the empty list. Null tests for the empty list. The head of a list is the first element in the list. The tail of a list is everything but the first element and is a list not an element. Cons constructs a list given an element and a list.
- ( The version of cons above is "uncurried". It is also common to use "curried" versions of functions especially in functional languages...
- cons : e -> list e -> list e
- and the pattern
- cons x L
- and so on -- saves on parentheses.)
Given the basic operations on lists, a number of useful functions can be defined.
length(nil) = 0 | length(cons(e,L)) = 1+length(L)
Note that the length function is recursively defined, as is the list type. While there are infinitely many lists, there are only two cases that length must consider. One case is the empty list and the other is the non-empty list. These two cases correspond to the two cases in the definition of the list type - nil and cons(e,L) - and they are separated by `|' to match the type definition. Most functions on lists have the same general structure as length. If one of the cases is missing it probably means the function is incomplete.
append(nil, L) = L | append(cons(e,L), L') = cons(e, append(L,L'))
Append joins two lists together; for example, append([1,2], [3,4]) = [1,2,3,4]. If the first list is empty the result is the second list, otherwise the result is the first element of the first list followed by the result of appending the tail of the first list to the second list. Note that if the length of the first list is n, append takes O(n) time. It makes a copy of this list which finally points to the second; it does not change the first one. In this way append does not cause any side-effect.
L . A l l i s o n |
L ----> 1--->2nil L := [1,2] L'----------->----------> 3--->4nil L':= [3,4] ^ | L"----> 1--->2--->| after L":=append(L,L') = [1, 2, 3, 4]
Note however that if the contents of L' are now changed, by some means, then so also are the contents of L". This kind of side-effect is a common cause of program bugs.
The merge function produces one sorted list when given two sorted lists.
merge(nil, nil) = nil | merge(nil, L) = L | merge(L, nil) = L | merge(L, L') = % ~null(L) & ~null(L') if head L < head L' then cons(head L , merge(tail L, L')) else {head L >= head L'} cons(head L', merge(L, tail L')) %this version allows duplicates
The original lists are unchanged. Note the similarities with the array and file merge operations.
The map function applies a function f to every element of a list and produces a new list. In this way map f processes a single data structure as a whole.
map(f, nil) = nil | map(f, (cons(e,L))) = cons(f(e), map(f, L) ) eg. map(factorial, [1,2,3,4]) = [1,2,6,24]
Map is known as mapcar in some works on Lisp.
The reduce function inserts a binary function, or operator, between every element in a list. If the list is empty, reduce returns an identity or zero element z.
reduce(f, z, nil) = z | reduce(f, z, cons(e,L)) = f(e, reduce(f, z, L))
Reduce(plus, 0, L) adds up the elements in a list of integers.
eg. reduce(plus, 0, [1,2,3,4]) where plus(a,b)=a+b = 1 + reduce(plus, 0, [2,3,4]) = 1 +(2 + reduce(plus, 0, [3,4])) = 1 +(2 +(3 + reduce(plus, 0, [4]))) = 1 +(2 +(3 +(4 + reduce(plus, 0, nil)))) = 1 +(2 +(3 +(4 + 0))) = 10
Reduce(times, 1, L) multiplies the elements in a list of integers together.
eg. reduce(times, 1, [1,2,3,4]) = 24 where times(a,b)=a*b = 24
These functions for manipulating lists can be combined to give many short but powerful programs. For example, the program
lookup(x, L) = reduce(or, false, map(equals_x, L)) where equals_x(y) = x=y
is a search program and returns true only when x is in the list L. Map(equals_x, L) returns a list of boolean values, at least one value is true only when x is in L. Reduce or's all these values together. Processing lists and other linked data structures is important in functional programming languages such as Haskell, Lisp and ML. This style of combining simple functions on these data structures to achieve complex objectives is a powerful programming technique.
As always, the issue of efficiency must be attended to. For example, there are various ways to reverse a list. The simplest way but not the best is:
reverse(nil) = nil | reverse(cons(e,L)) = append( reverse(L), [e] )
To reverse the empty list do nothing. To reverse a non-empty list, reverse the tail of the list and append the first element of the original list. Given a list of length n, an append operation takes O(n) time on average. There are n calls to append so this reverse takes O(n2) time. Note that the element `e' and the list containing it `[e]' have different types.
A fast list-reversal algorithm takes O(n) time and uses the accumulating parameter technique.
reverse(L) = reverse1(L, nil) where reverse1(nil,L) = L | reverse1(cons(e,L),L') = reverse1(L,cons(e,L'))
The local function reverse1 does the real work. Its first parameter is the input list which shrinks at each step. Its second parameter is the accumulating parameter which grows at each step as the head of the first is attached to it. When the first list has shrunk to nothing, the second contains all the original elements in reverse order and this is the final result. There are n calls to reverse1 each of which does a constant amount of work so the algorithm takes O(n) time overall.
Note that although there are infinitely many possible lists there are really only two types of list - empty ones and non-empty ones. Almost every function that manipulates a list must deal with these two cases.
f(nil) = ..... | f(cons(e,L)) = ....
If one of these cases is missing then there is probably an error. Processing the empty list is usually easy. Many functions return a constant - 0, 1, true, false or nil - in this case. Processing the non-empty list, cons(e,L), varies from application to application but often involves e or f(L) or both.
f(nil) = some constant | -- a very f(cons(e,L)) = ...e...f(L)... -- common pattern
Look back at the various functions defined to identify this pattern in each case.
Implementation of Flat Lists
The most natural way to implement lists is by means of records/structs and pointers. They can also be implemented by arrays if necessary.
Flat Lists by Records and Pointers
The list type is defined as a pointer to a record (structure).
- [C/List/List.h]
The basic operations manipulate the pointers:
- [C/List/Ops.c]
Note that cons is implemented by a procedure in Turing and not by a function because it manipulates a `collection' variable and functions are not allowed to change variables in that language. There is no such problem in Algol68, C, Java or Pascal, say.
The extended operations on lists that were described previously are easily defined.
- [C/List/LengthRec.c]
If the implementation of lists by pointers can be assumed,
it is more efficient, if marginally less clear,
to use in-line code such as
L=nil(List)
in place of null(L) and similarly for the other basic operations.
Beware: do not use `length(L)>0' when `not null(L)' is adequate;
the former takes O(|L|) time, the latter O(1).
- [C/List/Append.c]
The list reversal functions can be coded as follows:
- [C/List/RevS.c]
Some languages such as C do not support nested, local subroutines so the fast, O(|L|) list reversal routine must be coded by "lifting" the nested routines out, as follows:
- [C/List/Rev.c]
If the element type of a list can be printed by a library routine, a list can be printed by repeated calls. There are three cases to consider. The empty list should appear as `[ ]', a list of a single element should appear as `[e]' and a list of two or more elements should have the elements separated by commas `[a,b,...]'. The empty list can be treated as a special case by an if statement. The last two cases can be dealt with by using a loop and placing the printing of the current element before the exit and the printing of the comma after the exit. In this way one fewer commas than elements are printed.
- [C/List/Write.c]
Flat Lists by Arrays
A list can be implemented by an array of records. Rather than using a pointer to indicate the next element, an integer is used to hold the index of the next element. Some convention, such as the use of zero, indicates the null list. This implementation has the disadvantage that an array of the maximum size of the list must be declared even if the list(s) grow and shrink. This method is useful in a language without dynamic storage.
The Sieve of Eratosthenes and Sets by Lists
The Sieve of Eratosthenes algorithm is based on a set of integers. Some programming languages that provide a built in `set' data-type restrict the range of elements of a set, e.g. to 32 or 64, so that they can be implemented by bit-masks and bit-wise operations. However, a set can be implemented by a list of members.
In the sieve, composite numbers are removed from the set until only primes remain. To calculate all primes less than or equal to n, the sieve is initialised to the set 2..n or the list [2, 3, ...,n]. The routine `range' produces the list [lo, lo+1,..., hi].
- [C/List/Range.c]
A filter routine can be written to remove all multiples of a number P from a set. A sieve routine can then be written to remove all composites from a set. Sieve requires the first member of the set to be a prime (2 is prime). It calls filter to remove all multiples of this prime. At this stage the second number in the set must also be prime and it calls itself recursively to sieve the remaining (tail) elements of the set.
- [C/List/Sieve.c]
When the process terminates, only prime numbers remain in the set.
Note that lists are fast for sequential access to elements but slow for random access. Since Eratosthenes' algorithm takes large strides through the set, the above implementation is slower than that given in the introductory chapter.
Recursion and Iteration
Many of the operations on lists and other recursive data types are naturally expressed as recursive routines. Recursive routines have fewer state variables than iterative routines and are frequently easier to prove correct. However recursion does require system-stack space to operate. Iterative versions of many routines, particularly for the simpler operations, are straightforward. Note that we saw an iterative routine to print a list.
- [C/List/LengthItr.c]
This is not the place to discuss the pros and cons of recursion. Suffice it to say that recursion and iteration are both powerful and useful techniques.
Side Effects
The list operations defined so far are free of side-effects except to the extent that they perform input and output or reserve dynamic space. With pointers it is easy to create two or more aliases for one variable or area of storage. Assignment via one alias will affect the other implicitly. This can be very useful indeed on efficiency grounds - when done intentionally. It is also easy to cause this sort of side-effect unintentionally which is a frequent source of program bugs. These bugs can be very hard to track down because the symptoms often show up far from the original cause.
Despite the dire warnings, the use of side-effects can sometimes lead to more efficient programs. In some circumstances it is also relatively easy to be assured that they can be used safely. For example, there may only be two list variables in use and it may be obvious that they cannot share cells.
The cons operation adds an element to the front of a list. (If elements are being added to and deleted from a single list, that can be done, but as a side-effecty.)
These operations can also be used to insert and delete internal cells of a list.
before: P E | | | | v v L --->1---->2---->3----->4---->5---->6nil
Del(List(L).tl) Cons(List(P).tl, 9) Cons(List(E).tl, 7) after: P E | | | | v v L --->1---------->3-->9->4---->5---->6---->7nil
A good example of a side-effect involves list reversal. The previous versions made reversed copies of a list. It is also possible to reverse a list in-situ by changing the pointers in its cells.
The pointer in each of the input list must be made to point to the previous cell rather than the next cell. The pointer in the first cell should become nil. At least three pointers are needed to step along the list: L, Next and Next2.
L:| initial state | | v 1--->2--->3--->4--->5nil
L:| |:Next an intermediate state | | | | v v 1nil<---2 3--->4--->5nil ^ | |Next2
|:L | | v 1nil<---2<---3<---4<---5 final state An Example of Reversal.
The cell pointed to by Next is set to point back to L. This would loose the cell after Next so the third pointer, Next2, remembers it in the algorithm:
- [C/List/RevSE.c]
This routine takes O(n) time and uses O(1) storage. Of course, if the list should be partially shared with another then this will probably cause a major program error that is difficult to find and correct. Note particularly the use of the three list variables, L, Next and Next2 and the order of assignment. At an intermediate stage the list is partially reversed; L points to the start of the reversed section. Next points to the start of the unreversed section. The tail pointer of the next cell is made to point to L. Next2 is necessary to avoid loosing our place in the unreversed list. Finally L is moved to Next and then Next to Next2. The process terminates when there are no more cells to reverse and L points to the entire reversed list.
Garbage
Repeated allocation of dynamic space by new may cause all available storage to be used up. It is often the case that as new dynamic structures are created old structures are discarded or become garbage. If they are not recycled the program may come to a halt prematurely. Some languages provide a system garbage collector to do this recycling as necessary but it is an expensive operation. Instead, the free statement is used to return dynamic storage for reuse. This is of course a special kind of side-effect and unless a language processor carries out checks for aliases, at some cost, its use may cause program bugs unless care is taken. In particular part of a structure must not be freed if it is shared with a second structure - consider append.
Free only returns one unit of dynamic storage. To free a complete structure it must be traversed and each cell freed.
- [C/List/Dispose.c]
Testing and Debugging
If all goes well this section will be unnecessary but the manipulation of pointers is a fruitful area for program bugs. The use of pre- and post-conditions and other assertions which can be tested when the program is being developed, and commented out or turned-off in production, is a good idea.
It is useful to write a list output routine as one of your first list routines. This enables tracing statements to be inserted to print out the values of list variables if and when a program fails.
Routines on lists can be easily tested or debugged in isolation especially if they are written in a functional style, free of side-effects. Recall that almost all routines must contain this pattern:
f(nil) = some constant | f(cons(e,L)) = ...e...f(L)...
It is usually sufficient to test list parameter values for the cases of an empty list, a list of several elements and sometimes also a list of one element. If there are two list parameters this gives up to nine cases. If there are other parameters there may be other cases and combinations. Keeping routines small, simple and free of side-effects and global variables reduces the number of cases and interactions to be considered.
If routines are free of side-effects and are individually correct they can usually be combined with great confidence. It is the presence of side-effects, especially the manipulation of global variables, that causes surprising interactions and bugs.
- The most common errors when using pointers are:
- unterminated lists (should terminate with nil),
- lost or tangled pointers often because pointers assigned in wrong order,
- isuse of shared cells, overwriting, premature disposal.
- lost or tangled pointers often because pointers assigned in wrong order,
It is necessary to assign some value, such as nil, to a pointer. An uninitialised pointer variable contains an undefined value. If it is used and you are lucky the program will stop with an immediate error otherwise it may behave incorrectly but take a long time for this to become evident. A list structure may get linked into a circle. This can be useful if it is deliberate - see the chapter on queues. However, many functions will loop forever when given a circular list. If two lists share cells, changing one list may change the other surprisingly.
Exercises
- Write a routine to sort a list without side-effects.
- Write a routine to sort a list in-situ as a side-effect.
- For each sorting method in the chapter on sorts, determine if it can be applied to lists.
- Write an iterative routine to join one list to the end of another as a side-effect.
- Implementing sets as lists, give set intersection, union, difference and membership routines, without side-effects.
- Implement very large integers as lists of digits in some base. Give routines for integer equality and the basic operations such as addition, subtraction, multiplication and (harder) division.