Decision Trees / Classification Trees
A Decision Tree, more properly a classification tree, is used to learn a classification function which predicts the value of a dependent attribute (variable) given the values of the independent (input) attributes (variables). This solves a problem known as supervised classification because the dependent attribute and the number of classes (values) that it may have are given.
We could try to learn a complex tree that best fits the training data. However, such trees do not generalise well, i.e. they over-fit the training data, they are fitting noise. By calculating a message length for a tree, we will provide a natural stopping criterion that can trade-off the complexity of a tree against its fit to the training data. It may be that the data only justifies a tree with a simple structure.
Binary Attributes
First consider the case of decision trees over binary attributes (true/false, yes/no, head/tail ...), e.g. my car won't work:
@1 @2 @3 @4 petrol headlights engine electrical in tank? work? turns over problem case 1: y y n y case 2: n y y n case 3: y n n n ... ... etc.
- — Cases —
A number of cases or training examples may have been produced from historical records and/or by quizzing a (human) expert.
A decision tree, more properly a classification tree, can predict one attribute (variable) from the other attributes.
[@1] . . y. .n . . . . [@2] [3:17] . . y. .n . . . . [@3] [8:2] . . y. .n . . . . [2:6] [6:4]
- — Decision Tree —
The tree consists of fork nodes, each labelled with an attribute number (to test), and leaf nodes representing sets of cases, i.e. 2-state distrubutions over the dependent variable (here @4).
To calculate the message length of a decision tree, we must
encode its structure, the test (fork) attributes and
the leaf-node distributions.
F . . . . . . F L . . . . . . F L . . . . . . L L — Structure — |
Structure
Right: Label each node 'F' for fork or 'L' for leaf:
Write out the prefix traversal of the tree:
F F F L L L L
Note that there is exactly one more 'L' than 'F' in the string.
Consider all strings of 'F's and 'L's that contain exactly one
more 'L' than 'F', and where no prefix of the string has this property.
Each such string corresponds to one binary tree and v.v..
We see that setting P(F)=P(L)=0.5, and
encoding 'F' by a '1' and 'L' by a '0'
gives an efficient code for the structure of the tree.
This is the Wallace tree code, already introduced as a code for
[integers].
Attributes
Given the tree's structure, any one of the 'n' attributes can be encoded in log2(n) bits. However, once a discrete attribute has been selected, it will not be selected again in a sub-tree of that node, so the effective value of n reduces by one in the sub-trees. i.e. log23 for the root fork, log22 = 1 bit for the next level, log21 = 0 bits for the next level.
The tree description now becomes
F @1 F @2 F @3 L L L L
Note that
coding @1 takes log23-bits, @2 takes 1-bit,
and @3 takes zero bits.
Leaves
Recall that the MML estimator for a [2-state] distribution is py = (#y+1/2)/(#cases+1) pn = (#n+1/2)/(#cases+1) = 1-py — Leaves — |
The "input" attribute (here @1, @2, @3) values are common knowledge – known to transmitter and receiver. Transmitter and receiver can both identify the leaf to which a case belongs by carrying out the tests specified in the fork nodes.
The value of the dependent attribute (here @4)
must be coded efficiently for each case,
using the decision tree.
This requires good estimates of
p = P(@4="y"),
q = (1-p) = P(@4="n"),
in each leaf.
Each leaf node represents a
[2-state]
probability distribution, inferred from the training data.
We know how to estimate p and how to calculate
the message length (complexity) of such a distribution,
i.e. how accurately to
state P(@4="y") in each leaf.
Done.
-
F
@1F
@2F
@3L
2state[2:6]L
2state[6:4]L
2state[8:2]L
2state[3:17]
Tree Message Length
The message length of a complete tree is the sum of the message lengths of (i) the structure, (ii) the fork attributes, (iii) the leaf distributions.
This lets us compute the length of a two part message
msgLen(tree) + msgLen(dep' attr' | tree)
This gives a criterion to optimise when searching for the best decision tree.
Code Efficiency
It might be objected that the code for the structure of the tree allows arbitrarily many trees to be transmitted but that only finitely many trees are possible for a given number of attributes. It is possible in principle to renormalise the distribution (that corresponds to the code) but a moment's thought shows that the amount of probablity "lost" is tiny and quite insignificant.
m-ary Attributes
If all forks in a tree have 'm' sub-trees, then
#leaves = #forks x
(m-1) + 1.
Asymptotically, the fraction of nodes that are forks is 1/m, and
the fraction of nodes that are leaves is (m-1)/m.
Therefore, one would code a fork, F, in log2m bits
and a leaf, L, in log2(m/(m-1)) bits.
Mixed Arity
Wallace and Patrick (1993) use the arity of the parent of a node to predict the probability that the node is a fork or a leaf. Note that the receiver knows the arity of the parent node if the parent's switching attribute is transmitted before the child trees are transmitted.
Continuous Valued Attributes
In the case of an input attribute, c, being continuous valued, one of its values from the training data is used as a cut-value.
[@c < ? ] . . . . <. . >= . . . .
Sort the values of the attribute in the training data, e.g.
v1 <= v2 <= v3 <= ... <= v15
We need to pick one of these values as the cut value. The training data will tell which gives the best fit to the dependent attribute values, but we still need a prior over c's values.
A uniform prior over the index numbers (here 1..15) is one possibility. However, it is more appealing to bias the choice towards the median, then the quartiles . . . Note that 1/2 + 2.(1/8) + 4.(1/32) + 8.(1/128) + ... = 1.
quartile median quartile | | | | ----|----|----|----|----|----|----|---- Pr: 1/32 1/8 1/32 1/2 1/32 1/8 1/32 MsgLen: 5 3 5 1 5 3 5 bits
Renormalise for finite numbers of training values.
Search
The MML criterion gives an objective function to optimise when searching for the best tree, but it does not tell us how to carry out the search. If the number of attributes is small, it may be possible to enumerate all trees, evaluate each one, and choose the best, but this is infeasible in general.
Greedy Search
Given a "current" decision tree, it could be expanded by replacing a leaf node by a fork node on some available attribute. The pay-off, in terms of change in total message length can be evaluated for each such possibility, and the best one chosen. The process starts with a tree consisting of one leaf node. It stops when no further split leads to an improvement.
This greedy strategy is reasonably efficient, but can get trapped in local optima.
Lookahead
The greedy search can be improved in terms of results (and slowed down) by looking ahead 'k' levels of possible splits. Such an algorithm is less likely to get stuck in local optima. Running time increases rapidly with 'k'.
Notes
- C. S. Wallace & J. D. Patrick,
Coding Decision Trees,
Machine Learning, 11, pp.7-22, 1993.
- Introduced the tree code and message length calculations for decision trees, tests demonstrate good generalization and resistance to over-fitting. - J. W. Comley, L. Allison & L. J. Fitzgibbon, Flexible Decision Trees in a General Data-Mining Environment, IDEAL 2003.
- L. Allison.
Models for machine learning and data mining in functional programming.
J. Functional Programming, 15(1), pp.15-32,
January 2005.
Includes the source code of a generalised functional-programming algorithm to learn MML classification- (decision-) and regression-trees over arbitrary attribute types. - L. Allison, Function-Models 2, Ch.8 in Coding Ockham's Razor, Springer 2018 [doi:10.1007/978-3-319-76433-7_8].