Decision Graphs

Jon Oliver devised decision graphs (classification graphs) [Oli92, OW92] to remove the difficulty that decision trees (classification trees) have with modelling "disjunctive" functions, for example,

independent variables x1, x2, x3, x4,
dependent variable y,
function to be learned
y = (x1 ∧ x2) ∨ (x3 ∧ x4)

A decision tree can certainly describe such a function,

  x1  
(x1=T)
  x2  
T  
  x3  
  x4  
T   F
  F
  (x1=F)
  x3  
  x4  
T   F
  F

but it requires two copies of the subtree for (here) x3 and x4.

If the learning problem is made slightly harder, for example,

independent variables xi, i=1..9,
dependent variable y,
function to be learned
y = (x1∧x2∧x3) ∨ (x4∧x5∧x6) ∨ (x7∧x8∧x9)

the decision tree-tree to describe it becomes much larger [exercise]. Such trees can be learned, but it generally requires a great deal of data compared to the intuitive complexity of the function to be learned.

The solution is to "factor out" any common subtrees and replace the tree by an equivalent directed acyclic graph (DAG). Such a DAG may be much smaller than the corresponding tree, have a smaller message-length, and therefore can be justified by much less data (evidence). Decision-trees and decision-graphs describe exactly the same set of function but decision-graphs model disjunctive functions more efficiently. (There is a small penalty for describing a DAG that is in fact a tree.)

The tree data structure is extended, to allow DAGs, by adding a new kind of node, a join-node, in addition to a tree's forks (tests) and leaves.

  x1  
  x2  
T   Join1
  Join1
 
Join1:
  x3  
  x4  
T   F
  F

(The examples above use binary variables but decision graphs and trees apply to arbitrary discrete and continuous variables in the obvious way.)

Message Length of a Decision Graph

In learning with MML, the two-part message length consists of (i) the message length of the hypothesis (here a decision graph), plus (ii) the message length of the data given the hypothesis. The 2nd part is calculated, as for trees, using the distribution of the leaf into which a datum falls. The calculation of the message length of the decision graph is generalised from that for decision trees:

  1. Perform a prefix traversal of the decision graph, encoding forks and leaves (as for decision trees) and join nodes, but not traversing below any join nodes.
  2. The traversal may have output j≥0 join nodes. Note that, if j>0, at least two, but not necessarily all, of the join nodes output must join with each other[Oli92]. Encode the actual pattern of joining together.
  3. If some join nodes have been joined together in 2, continue traversing and encoding the graph from them, as in 1. Otherwise stop.

In step 1, cf trees, there are now 3 kinds of node. Oliver suggests using some fixed probability for pr(join_node); the tree method is used for a fork or a leaf given that a node is not a join.

In step 2, it is necessary to calculate the number of possible ways, w, in which at least two of the the dangling joins can be linked together, and encode the actual way, log w nits.

Search

The search for an optimal decision graph is harder than that for an optimal decision tree. Wallace and Oliver suggest a greedy search with some amount of lookahead.

References

[OW92] J. Oliver & C. S. Wallace. Inferring Decision Graphs. TR 92/170, Dept. of Computer Science [*], Monash University, Australia 3800.
Also in, Proc. of the 1992 Aust. Joint Conf. on Artificial Intelligence, pp.361-367, 1992.
 
[Oli92] J. Oliver. Decision Graphs - An Extension of Decision Trees. TR 92/173, Dept. of Computer Science [*], Monash University, Australia 3800.
Also in, 4th Int. Conf. Artificial Intelligence and Statistics, Florida, pp.343-350, 1993.
 
[Hicss93] D. Dow, J. Oliver, T. I. Dix, L. Allison & C. S. Wallace, A decision graph explanation of protein secondary structure prediction. 26th Hawaii Int. Conf. Sys. Sci., 1, pp.669-678, 1993.
Also see Bioinformatics.
 
[*] Became Faculty of Information Technology, Monash.