Binary Trees
Trees are natural structures for representing certain kinds of hierarchical
data.
A (rooted) tree consists of a set of nodes (or vertices)
and a set of arcs (or edges).
Each arc links a parent node to one of the parent's children.
A special root node has no parent.
Every other node has exactly one parent.
It is possible to reach any node by following a
unique path of arcs from the root.
If arcs are considered bidirectional,
there is a unique path between any two nodes.
The simplest kind of tree is a binary tree
where each parent has at most two children.
(We consider unrooted trees elsewhere, e.g. in
[minimum spanning trees].)
A way to think of a binary tree is that it is either empty (emptyTree) or it is a fork which contains an element and two subtrees which are themselves binary trees.
Types: type tree e = emptyTree | fork e × (tree e) × (tree e) Operations: empty: tree e -> boolean leaf : tree e -> boolean fork : e × tree e × tree e -> tree e left : tree e -> tree e right: tree e -> tree e contents: tree e -> e Rules: empty(emptyTree) = true empty(fork(x, T, T'))= false leaf(fork(x, emptyTree, emptyTree)) = true leaf(fork(x, T, T')) = false, if not empty(T) or not empty(T') leaf(emptyTree) = error left(fork(x, T, T')) = T left(emptyTree) = error right(fork(x, T, T')) = T' right(emptyTree) = error contents(fork(x, T, T')) = x contents(emptyTree) = error
- The Tree Abstract Data Type.
(It is also common to use curried versions of constructors and functions, such as fork, especially in functional languages.)
Many operations can be defined on trees. For example:
height(emptyTree) = 0 | height(fork(e,T,T')) = 1+max(height(T), height(T')) weight(emptyTree) = 0 | weight(fork(e,T,T')) = 1+weight(T)+weight(T')
Note that these functions are binary-recursive as is the definition of the tree type. (Also note that a tree consisting of a single leaf is sometimes defined to be of height zero, rather than 1, as here.)
Trees have many uses in computing. For example a parse-tree can represent the structure of an expression:
input: a+b*c ----------> + . . . . a * . . . . b c
Multiplication has a higher priority then addition and binds more tightly. This tree shows that a+b*c is interpreted as a+(b*c) rather than as (a+b)*c. Such trees are used in compilers and other programs.
Implementation of Binary Trees by Pointers and Records
A tree data type can be implemented as a collection of records and pointers.
-
- [C/Tree/Tree.h]
- [C/Tree/TreeElement.h]
- [C/Tree/Tree.h]
The basic operations can create new records and manipulate pointers.
- [C/Tree/Ops.c]
Not surprisingly this tree implementation is similar to the implementation of hierarchical lists.
An output routine for trees is a virtual necessity. A tree can be printed in a style like that used for lists but a graphical two-dimensional layout is more informative. It is difficult to print trees down the page, because they quickly grow too wide, but it is relatively easy to print them across the page so that the root is at the left and the tree grows to the right.
- [C/Tree/Write.c]
For an example of tree output, see later sections.
Example: Parse Trees
Parsing an input language according to a grammar is frequently necessary in computing. Here we consider a language of simple arithmetic expressions. A grammar written in Backus Naur form (BNF) is given below.
<exp> ::= <exp> + <term> | <term> <term ::= <term> * <operand> | <operand> <operand> ::= <identifier> | ( <exp> ) <identifier> ::= <letter> <identifier> | <letter> <letter> ::= "a" | "b" | ... | "z"
- Grammar for Simple Expression.
The grammar can be read as a definition of the syntactic structure of expressions, <exp>. The symbol `::=' can be read as `is' or `can be replaced by'. The symbol `|' can be read as `or'. The names in angle brackets, such as '<exp>', are variables for parts of the language. The string <exp>+1 is not a legal expression but it stands for many legal expressions such as x+1, y+1 and (x+y*z)+1. The first line of the grammar can be read as: an <exp> is an <exp> plus <term> or a <term>. An expression is either an expression plus a term or just a term. Note that the syntax rules are recursive. A little thought shows that an expression is a sequence of terms separated by plus signs. Similarly, a term is a sequence of operands separated by multiplication signs. An operand is either an identifier or a bracketed subexpression. An identifier is a sequence of letters.
A parser can be written using the recursive-descent technique. A procedure is written to recognise each class of object in the grammar - exp, term, operand. These routines are recursive, as is the grammar, and call each other to implement the grammar rules. For example, the routine for expression calls the routine for term. The repetitive nature of expressions and terms is coded by the use of a loops. A bracketed subexpression is inherently recursive and so is the parser at that point. The complete parser is given below. It is moderately long but not complex, especially if read with the grammar in mind. A lexical routine, insymbol, skips white space and packages input letters into identifiers and other kinds of symbol. It is followed by the parser proper consisting of the routines Operand, Term and Exp.
- [C/Tree/Parser.h]
- [C/Tree/Parser.c]
- [Java/Exp/]
- [SML97/Exp/]
- [C/Tree/Parser.c]
Note the use of mutual recursion. Various errors may be detected during parsing. If the expression is syntactically correct, the tree representing its structure is finally printed.
- input: a + b*(c+d)
Parse Tree: | | |d---------| | |+---------| | | |c---------| |*---------| | |b---------| +---------| |a---------|
Recursive Tree Traversal
There are three classic ways of recursively traversing a tree or of visiting every one of its nodes once. In each of these, the left and right subtrees are visited recursively and the distinguishing feature is when the element in the root is visited or processed.
In a preorder or prefix traversal the root is visited first (pre) and then the left and right subtrees are traversed.
- [C/Tree/Prefix.c]
In an infix traversal, the left subtree is traversed and then the root is visited and finally the right subtree is traversed.
- [C/Tree/Infix.c]
In a postorder or postfix traversal the left and right subtrees are traversed and then the root is visited afterwards (post).
- [C/Tree/Postfix.c]
This method can be used to generate postfix or reverse Polish code for a stack machine.
Given the following tree,
formed by starting with the empty binary search tree
and inserting, in turn, the words,
|type------| the-------| | |rover-----| | | |of--------| | | | |motor-----| |land------| | |is--------| | | | |car-------| | | |a---------|
- Example Tree.
the three traversals give the following results. The results of yet another method, breadth-first traversal, are included for comparison; see the next section.
- infix:
a car is land motor of rover the type
- prefix: the land is a car rover of motor type
- postfix: car a is motor of rover land type the
- b-first: the land type is rover a of car motor
- Traversals.
- prefix: the land is a car rover of motor type
Note that the method given for printing a tree is a reversed infix traversal.
Breadth-First Traversal
A breadth-first traversal of a tree starts at the root of the tree. It next visits the children, then the grand-children and so on.
1 . . . . 2 3 . . . . . . . . 4 5 6 7 . . . . . . . . . . 8 9 10 11 12 . . . . . . 13 14 15
- Example: Breadth-First Order.
The numbers indicate the order in which the nodes are visited, not the contents of the nodes. Because children are only accessible from a parent, they must be stored while the parent's siblings and cousins are visited. A [queue] is used to do this.
- [C/Queue/Queue.h]
- [C/Tree/QueueElement.h]
- [C/Tree/BFirst.c]
- [C/Tree/QueueElement.h]
Note that the queue is a queue of pointer to node or a queue of tree type. Initially the queue contains just the root of the tree. At each iteration of the algorithm, the first element is removed from the queue. Its children, if any, are pushed onto the end of the queue and the element is processed. The algorithm terminates when the queue is empty. See also the chapter on queues.
Recursion
Most routines on trees are recursive. This is natural because the tree is a recursive data type. It is possible to write iterative versions of these operations but it is harder to do so than is the case for flat lists because the tree type is binary recursive. The flat list and hence most of its operations are linear recursive and a linear recursive routine usually has a simple iterative version. It is often necessary to introduce an explicit stack into a program when writing a non-recursive tree routine. This is often not worth the effort as the language implementors can usually do a better job with the system stack.
The object-oriented programming language
Java
has the notion of an Enumeration
,
i.e. an iterator, which steps through the elements of a set of values.
An enumerator must implement the predicate hasMoreElements
and the function nextElement
.
To program an enumerator which will emit the contents of a tree
in one of the standard orders it is useful to a employ
a Stack
:
- [Java/Tree/egPrefixEnum.txt]
Side-Effects
The tree operations described so far have no side-effects except for input, output and manipulation of dynamic storage; they are pure tree operations. As is the case with lists, it is often necessary or desirable to use operations having side-effects on efficiency grounds. This is particularly natural if a program uses a single tree as a dictionary or database structure. As before, should multiple trees share components, changing one tree may change another and if this is not anticipated it will cause program bugs.
Search Trees
Demonstration: There is a demonstration of Binary Search (& AVL) Trees [here]. |
A binary search tree can be created so that the elements in it satisfy an ordering property. This allows elements to be searched for quickly. All of the elements in the left subtree are less than the element at the root which is less than all of the elements in the right subtree and this property applies recursively to all the subtrees. The great advantage of this is that when searching for an element, a comparison with the root will either find the element or indicate which one subtree to search. The ordering is an invariant property of the search tree. All routines that operate on the tree can make use of it provided that they also keep it holding true.
It takes O(h) time to search a search tree of height h. Since a tree of height `h' can hold n=2h-1 elements, the search takes O(log(n)) time under favourable circumstances. The search-tree and its search routine should be compared with the use of the binary search algorithm on sorted arrays in the chapter on tables.
The use of the tree speeds up the insertion and deletion operations at the price of the space needed to hold the pointers. The tree has the speed advantage when the data in the structure changes rapidly.
The routine given here to insert an element does so as a side-effect by changing the tree.
- [C/Tree/Insert.c]
If elements are inserted in the order d, b, a, c, e, f, g the following tree is created:
----------->d . . . . . . b e . . . . . . a c f . . g
Note that an insertion takes O(h) time. Under favourable circumstances, a balanced tree is created, as for b, a, c, giving O(log(n)) search time. If the input is sorted however an unbalanced tree approximating a list is created, as for e, f, g, and the search degenerates to an O(n) linear search. This problem is addressed later.
- input: the
land
rover
is a type of motor car
Search Tree: |type------| the-------| | |rover-----| | | |of--------| | | | |motor-----| |land------| | |is--------| | | | |car-------| | | |a---------|
- Example Search Tree.
A new element is added to the tree as a new peripheral, leaf node. However if an element can also be deleted it is possible for it to be internal. This makes deletion rather more difficult than insertion.
A leaf element is easily deleted by setting the pointer to it to emptyTree.
T:-----| T:emptyTree | | v x before after
- Deletion of a Leaf.
The node becomes garbage and can be freed.
An element with one child can be deleted by by-passing it.
T:------| T:| | | | | v | x | . emptyTree | . v e e . . . . . . . . before after
- Deletion of a Single-Child Parent.
An internal element x with two children cannot easily be bypassed without loosing one of its subtrees. The solution is to overwrite x with some other element y of the tree and then to delete the original copy of y. There are two obvious elements to choose from - either the largest element `A' less than x or the smallest element `B' greater than x. Each of these has at most one child! The sortedness of the tree is maintained if x is overwritten with either of these.
T:----------| T:-----| | | | | v v x B . . . . . . . . L R L R . . . . . . . . . . . . . . . . A B A C . . . . . . . . . . C . . . . before after
- Deletion of a Two-Child Parent.
Both of A and B fall into one of the cases previously dealt with. A can be found from x by taking one step left and then as many steps right as possible. B can be found by taking one step right and then as many steps left as possible.
Deletion takes O(h) time.
Notes
- [Tries] and [PATRICIA trees] are an alternatives to binary search trees.
- The [Suffix Tree] is a data-structure for solving the string searching problem.
Height-Balanced Trees
Demonstration: There is a demonstration of AVL (& bin. s.) Trees [here]. |
As mentioned above, if elements are inserted into a search tree in sorted order, a tree is created that is equivalent to a list. This will lead to inefficient searches. A height-balanced tree or an AVL-tree (after G. M. Adel'son-Velskii and E. M. Landis) is a search tree in which the height of the right subtree minus the height of the left subtree equals 1, 0, or -1. This property also applies recursively to all subtrees. It can be shown that a height-balanced tree of `n' elements has height O(log(n)) and this guarantees efficient search. Fortunately fast O(log(n)) insertion and deletion is still possible.
A flag indicating the balance of each subtree is added to the node record.
There are four crucial cases during insertion. In the first case, the left subtree L grows too tall on its left:
T L . . . . . . . . L . -----> . T . . | | . . . . | | . . . . | | . . | | * | | | | * | | | | * | | * new elt' rebalanced * disturbs * balance
- Left-2 Rotation.
By rotating about L and T the height balance is restored and the ordering in the tree is maintained. In the second case, L grows too tall on its right:
T LR . . . . . . . . L . L T . . | . . . . . . | ----> . . . . . LR | . . . . | .. | | | | | | . . | | | | | | . . | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * | | | | | * | | | | | * | * * *
- Left-3 Rotation.
The rotation restores the balance while maintaining the tree ordering. In the above example left(LR) grew; an alternative is that right(LR) grew but the same operations still restore the balance. There are mirror-image right-2 and right-3 rotations.
Maintaining the balance significantly complicates the tree insertion routine. However a fixed amount of work is done at each node that is visited and so it takes O(h) time.
Note that if a tree has just grown in height, it can only be perfectly balanced if it is a single (new) leaf. Otherwise one subtree must be higher than the other.
- input: the
land
rover
is a type of motor car
| |type------| |the-------| rover-----| | | |of--------| | |motor-----| | | |land------| |is--------| | | |car-------| | |a---------|
- Height-Balanced Tree.
Compare this with the tree given earlier and created by the non-balancing insert.
Notes
- G. M. Adelson-Velskii and E. M. Landis [AVL]. An algorithm for the organization of information. Soviet Math. Dokl., 3, pp.1259-1263, 1962.
- N. Wirth.
Algorithms and Data Structures.
pp.218, Prentice-Hall, 1986.
AVL-tree insertion and deletion. - M. A. Weiss. Data Structures and Algorithm Analysis in C. s4.4, pp.110, Addison Wesley, 1997.
Implementation of Binary Trees by Arrays
A binary tree can be implemented as an array of records.
type TreeT :0..nmax var Tree: array 1..nmax of record elt :Element_Type left, right :TreeT end record
The empty tree is represented by zero. Left and right are indexes to left and right subtrees.
This implementation has no advantage in a language supporting dynamic storage unless random access to nodes is also needed. It is useful in a language without dynamic storage.
Full Trees by Arrays
It is possible to define a full or weight balanced tree in contrast to a height balanced tree. The empty tree is a full tree of zero levels. If T and T' are full binary trees of n levels then fork(e,T,T') is a full binary tree of n+1 levels.
1 . . . . . . . . 2 3 . . . . . . . . 4 5 6 7
- Full Binary Tree of 3 Levels.
The numbering of the nodes corresponds to a breadth-first traversal. This suggests that such a tree can be stored in an array:
const n = 2**Levels - 1 array 1..n of Element_Type
type tree = 0..n empty(T) = T=0 left(T) = 2*T right(T) = 2*T+1 parent(T) = T div 2 leaf(T) = T >= 2**(Levels-1)
Such an implementation is very efficient indeed,
if the tree is full, because
no space at all is used for pointers.
See the chapter on sorting for an application of this technique in
Testing and Debugging
Programming techniques for trees share a great deal in common with those for lists. Pre and post conditions and assertions should be included where possible. Minimise the use of side-effects in general and the manipulation of global variables in particular.
Note that there are really just two kinds of tree - emptyTree and non-emptyTree. Most operations on a tree follow this pattern:
f(emptyTree) = ... | f(fork(e,T,T')) = ...often f(T)...f(T')...
If a case is missing it probably indicates an error. The empty case is usually very simple, often returning a constant. The main case often operates on one or both subtrees recursively.
When testing or debugging a tree routine, test cases should cover the above options. For non-emptyTree trees, it may be appropriate to try a tree of a "few" nodes and a tree of a single node. As usual it is invaluable to write the output routine first. The most common problems are: . unassigned pointers, particularly those that should be emptyTree, . lost pointer values through a wrong ordering of operations, . knotted pointers maybe introducing a cycle, . side-effects through shared pointers, intentional or otherwise.
The coding of reasonably complex routines such as those on AVL trees is prone to small typographical errors. Nothing can beat careful proof reading but one good testing trick is to use a set of ascending data and then its reversal in which case the mirror image tree should be produced. This test is not sufficient on its own! It is important to exercise all the paths through a complex routine and this requires a great deal of thought.
Exercises
- Classify tree operations (height, weight, min_element, max_element, flatten, insert, lookup, delete, pre,- in- and post-order traversal) on unordered binary trees and on search trees as either linear or binary recursive. Note that a linear-recursive operation may contain two or more calls on itself but only ever execute at most one of them under the control of an if statement. Write iterative versions of one or more of the linear operations.
- Add a full set of operators +, -, *, / to the expression parser.
- Write a routine to traverse the expression parse tree to generate reverse Polish code. eg. a*b+c*d ---> a b * c d * +
- Write a routine to do symbolic differentiation.
Input is the parse tree of an arithmetic expression
which may contain numerals, variables {x, y, ...}, + and *.
Output is the tree of the expression
that is the result of differentiation with respect
to a given variable.
- Use the rules:
- diff(x, x) = 1
- diff(y, x) = 0 if y is a constant or a variable different from x
- diff(A+B, x) = diff(A,x)+diff(B,x)
- diff(A*B, x) = A*diff(B,x)+diff(A,x)*B
- Write a routine to free all of the nodes in a binary tree. Count the number of new and free operations and check that they are equal.
- Write a routine to collect only the elements in all the leaf nodes of a binary tree into a list.
- The path of a node in a tree is the sequence of elements contained in nodes from the root, down to and including the node itself. Give a procedure to print the path of every leaf node in a tree.
- A height-balanced tree is not necessarily full or weight-balanced. How un-weight-balanced can one be? ie. What is the minimum possible number `n' of nodes in a height-balanced tree for height h=1, 2, ..., 8. Is there a pattern? How does n compare with the value 2h-1 for a full tree?
- (Hard) Implement a delete operation for height-balanced trees.