[LA/ MML, IP, 200309/ TO-DO, JFP'05, Bayes Nets ]

Inductive Inference 1.

Lloyd Allison, School of Computer Science and Software Engineering, Monash University, Clayton, Victoria 3800, Australia.

Report 2003/148

Abstract: This report describes the current state of software, version 2003 09 (tested with ghc 6.0.1) in the project Inductive Inference using Haskell. The main aim of the project is to understand and to model(!), from a programming point of view, "statistical models" as used in artificial intelligence, data mining, inductive inference, machine learning, and statistical inference. Haskell's type-system is used as the analytical tool.

Types and type-classes are given for (basic) models including probability distributions, function models including regressions, time-series models including Markov models, mixture models (unsupervised classification, clustering), classification trees (supervised classification), and operators and conversion functions on these.

In addition, some important and well-known instances of statistical models and of inference algorithms for them are also given, first to make a general claim for being practical, and also to study their structures and parts. A simple algorithm is sometimes given as a proof of concept; any more sophisticated algorithm(s) can obviously be programmed in Haskell, in principle, but this is not the main aim just now.

Keywords: Class, inductive inference, machine learning, model, probability, statistics, type.

1. Introduction

The project was initially motivated by the observation of several postgraduate students and other researchers in machine learning over more than a decade: Often a student devises a new (or extended) "model" for some kind of problem, does some hard mathematics on estimators for the model (i.e. is not just a hacker), implements the ideas in a computer program (usually in C or C++, i.e. can also do practical stuff), compares the results of the program with competitive methods (hopefully beating enough of them often enough), writes a thesis, gets a PhD, and leaves. The programs left behind, not always but often, have rudimentary I/O, use annoyingly different I/O formats from each other, are "delicate" on extreme kinds of data, come with little if any documentation, are hard to move to another platform, and are not as general as they could be. (All this is quite understandable because a postgraduate student is under a lot of pressure to finish quickly and a researcher tends to concentrates on her or his immediate problem; naturally I am guilty too.) Consequently it can be difficult for others to use these programs, or use two or more in combination, or modify or extend them. Many basic components are reinvented over and over again, sometimes with small variations and new bugs etc., e.g. I/O or basic distributions. Opportunities for generalization are lost.

It seemed that if there was a usable "theory" (model, semantics, library, prelude) of the "statistical models" that are produced in inductive inference, it might help to reduce the reinvention of basic components, and to increase sharing of components, which would lead to more reliable, more general and lighter code.

It is therefore necessary to understand just what is a "statistical model", for want of a name, from a programming point of view in artificial intelligence, data mining, inductive inference, machine learning, and/or statistical inference, call the area what you will. It is very desirable to have a theory that can be used -- type-checked, compiled, and run -- and the functional programming language Haskell is the best piece of "programming technology" for this kind of exercise. It was also obvious that polymorphic types and high-order functions would be very useful for programming with statistical models, just as they are for programming with functions. Such ideas, in less general forms, found their ways into application programs (in C, Java and Pascal) for particular inference problems[APD99,ASED00].

On examination, "what is a statistical model from a programming point of view" turns out to be an interesting question. Looking for analogies to the question and possible answers, one thinks of list processing and the body of operators that has grown up for it over 40+ years; it seems likely that programming with statistical models should be at least as rich. "Function" is to "functional programming" as "statistical model" is to what?

This first step of the project is very much a proof of concept, and deliberately makes conservative choices for some options, e.g. sticking with the standard Haskell-98 language, unless there is a strong reason to do otherwise. Note that since the first publications[All03a,All03b], the names of some functions (methods) have been revised, and the straightforward, but important, ability of models to show (print) themselves has been implemented.

The Haskell code itself might or might not grow into a useful package in its own right. It will be nice if it does. Even if it does not, the lessons learned can be incorporated into software written in other languages.

2. Haskell

Haskell is a functional programming language development of which began in the 1990s. It features lazy evaluation, parametric polymorphic types with a type-inference algorithm, and type classes -- which allow overloading of operators. Haskell is a pure functional language, i.e. there are no side-effects within a Haskell program. Nevertheless a program can interact fully with the operating system (using monadic I/O) and can create, read and write files etc..

Most code in this report uses the Haskell-98 language [PJ03] in the interests of familiarity and standardisation. However the Haskell community is very active in research of type-systems and many extensions have been proposed, some of them being available in the "Glasgow Haskell Compiler", ghc, under the control of command-line flags. It seems possible that a new Haskell-X will be standardised in the not too distant future. Haskell-98 is powerful enough for most of the requirements of the inductive inference project at this stage, but some experiments, not described here, have investigated the use of type extensions.

2.1 Types

Haskell has a Hindley - Milner parametric polymorphic type system with a type inference algorithm (similar to that of Standard ML (SML)), e.g. the programmer writes:: map f [] = []; map f (x:xs) = (f x):(map f xs)
and the compiler infers: map :: (t -> u) -> [t] -> [u]
where "[]" is the empty list, ":" is the list constructor, "::" can be read as "has the type", "->" is a function type, "u" and "t" are type variables, and "[u]" is a list type where the elements are of type "u".
(In fact, map is defined in the standard prelude.)

2.2 Classes

Haskell has type-classes to allow ad-hoc polymorphism, i.e. where a function (operator) name is overloaded by different sequences of code to be executed depending on the types of its parameter(s). (Note in contrast that map is uniform and executes the same code regardless of the types t and u. Although note that there is also a class Functor with fmap, and that list ([]) is an instance with map.)

A class is defined by specifying the values (functions and other constants) that must be defined for an instance of the class, e.g.: class Splits t where; splits :: [t] -> [Splitter t]
A type t is in the class Splits if there is a function splits from lists of t to lists of Splitter t. (A Splitter t is, roughly, a function for partitioning the space (type) t.

A type, e.g. the pair (t1,t2) is made in instance of a class by defining the necessary values, e.g.: instance (Splits t1, Splits t2) => Splits (t1, t2) where; splits xys =; let (xs, ys) = unzip xys; in interleave (splitsAttr 0 fst xs) (splitsAttr 1 snd ys)
In this example there is a constraint, (Splits t1, Splits t2), i.e. t1 and t2 must be in Splits already before (=>) the pair (t1,t2) can be in Splits.

Much active research in functional programming centres around the best form of type classes, their generalisations and uses.

Experience with classes on imperative object-oriented languages such as C++, Java and C# may be some help to the new Haskell programmer, but perhaps less than he or she might think. The notion of an "object" as an instance of a class tends to focus the mind on a "state", but the emphasis is different in functional languages. There is also the fortunate absence of the schizophrenia of some imperative O.O. languages over "basic types" and "classes": In Haskell a value has a type and a type can belong to one or more classes; values, types and classes are in clearly separated worlds.

It is clear that computer science has not yet agreed on the perfect class system. It is also interesting to note for example that Haskell has no built-in class Function (overloading juxtaposition "f x" and "$"), nor class Tuple (overloading "fst", "snd", etc.) nor Pair, nor many other "natural" classes. Some of the reasons must be historic and some technical.

3. MML

Any system for inductive inference must address the problem of overfitting: In general a more complex model will likely fit data better than a simpler model. For example, a quadratic fitted to points {(x_i,y_i)}, is likely to give a lower squared error than a straight line. The problem is to decide if the extra complexity of the quadratic over the line worth it. To solve the problem this work uses minimum message length (MML) inference[WB68,WF87,WF02], a Bayesian method based on information theory.

Bayes: Pr(H&D) = Pr(H) . Pr(D|H) = Pr(D) . Pr(H|D)
Shannon: msglen(E) = -log_e(Pr(E)) nits
so: msglen(H&D) = msglen(H) + msglen(D|H) = msglen(D) + msglen(H|D); msglen(H) : 1st part of a two-part message; msglen(D|H) : 2nd part of a two-part message

It is important to note that msglen(H) for a hypothesis (model, theory), H, includes the statement of any continuous valued parameters to (finite) optimum precision. Without this, Bayesianism often does not make sense.

It should be clear that MML could, in principle, be replaced by some other "compositional" method of assessing the complexity of models, provided that it can work on combinations of sub-models used to build larger models. (A replacement might not work as well as MML.) It is even conceivable that the method could be "abstracted out" of the algorithms. In any case, the code described works in the MML framework.

4. Code

The Haskell code is very much a prototype and is organised in the following files and modules:: 01Utilities.hs -- some basic utility code; 02Classes.hs -- most types and classes; 03Models.hs -- basic models (distributions); 04FnModels.hs -- function models; 05TSModels.hs -- time series models; 06MixtureModels.hs -- mixture models; 07CTrees.hs -- classification trees; 091all.hs -- trivial; 099.hs -- Main module, test harness
It is not yet organised in a "nice" package or library. Each file contains some tests, generally not exhaustive, and these are invoked in the test harness.
All files can be compiled together by e.g. ghc 0*.hs which is nice and easy, if crude.
Terminology: base -- the base of logarithms, e --> nits the default, 2 --> bits, 10 --> bans, etc.; freq -- sometimes short for frequency; msg -- total length of a 2-part message; msg1 -- 1st part of a 2-part message, i.e. model; msg2 -- 2nd part of a 2-part message, i.e. data|model; nl -- negative log, as in in nlPr; pr or Pr -- probability

4.1 Utilities

01Utilities.hs contains miscellaneous utility code, e.g.: data Throw = H | T deriving (Enum, Bounded, Eq, Ord, Show) -- Mr. Bernoulli; normalise xs = ... -- scale xs to make sum to 1.0; -- logPlusBase b (-logBase b p1) (-logBase b p2) = -logBase b (p1+p2); logPlusBase b nlPr1 nlPr2 = ...
etc.
Of more interest are class Splits and data Splitter t,: class Splits t where; splits :: [t] -> [Splitter t]
which are for data-sets over type t, i.e. [t], that can be split (partitioned) in various ways; they are used in the current version of classification trees. The present code is conservative, but laziness has future potential for use with Hoare's "Find" algorithm for partitioning real-valued attributes in classification trees.
The standard class Enum (enumerated) is extended to new instances (,) (i.e. pair): instance (Enum t1,Bounded t1, Enum t2,Bounded t2) => Enum (t1,t2) where; fromEnum (v1,v2) = ...
and also triple, quad, etc.. (There is some potential for generic or for template Haskell here.) These instances ensure that the multistate (multinomial) distribution which is defined on Enum, Bounded data-spaces automatically applies to tuples of these, even nested tuples.

4.2 Classes

02Classes.hs defines the more important types and classes for "statistical models".

Apart from providing opportunities to make bad puns and to boost web ratings, SuperModel is the super-class of statistical models and defines their common properties. Note that msg1 refers to the 1st part, i.e. complexity of the (super-) model itself, in a two-part message, i.e. model plus data. A mixture of two or more statistical models is a weighted average of them; mixtures can be formed on (basic) models, function-models and time-series. (Class Mixture is an auxiliary class defined in this module.) Note that all models must be able to Show (print) themselves, e.g. including the values of their parameters, if any. -- SuperModel: class (Show sMdl) => SuperModel sMdl where; prior :: sMdl -> Probability -- prior of blah-Model; msg1 :: sMdl -> MessageLength -- 1st part of any 2-part message; msg1Base :: Double -> sMdl -> MessageLength; mixture :: (Mixture mx, SuperModel (mx sMdl)) =>; mx sMdl -> sMdl -- Mixture of blah_Model -> blah_Model; prior sm = exp (-msg1 sm); msg1 sm = - log (prior sm); msg1Base b sm = (msg1 sm) / (log b); ...
It is interesting that there seem to be no useful SuperModels apart from instances of the subclasses.
(Basic) Model is the class of probability distributions. A model must, at the very least, be able to give a probability, pr, to a datum from its data-space. It is usually better to work with the negative log probability, nlPr. In addition, a model must give a (2nd part) message length, msg2 (negative log likelihood) to a data-set, of type [dataSpace], and similarly a total message length, msg, for model plus data. -- Model: class Model mdl where; pr :: (mdl dataSpace) -> dataSpace -> Probability; nlPr :: (mdl dataSpace) -> dataSpace -> MessageLength -- neg log prob; nlPrBase :: Double -> (mdl dataSpace) -> dataSpace -> MessageLength; msg :: SuperModel (mdl dataSpace) =>; (mdl dataSpace) -> [dataSpace] -> MessageLength; msgBase :: SuperModel (mdl dataSpace) =>; Double -> (mdl dataSpace) -> [dataSpace] -> MessageLength; msg2 :: (mdl dataSpace) -> [dataSpace] -> MessageLength; msg2Base :: Double -> (mdl dataSpace) -> [dataSpace] -> MessageLength; ...
A FunctionModel has an input-space (exogenous variables, attributes) and an output-space (endogenous variables, attributes). Given a datum from the input-space, a function-model gives a conditional, cond, model of the output-space. The conditional model can give a conditional probability to a datum from the output-space. -- FunctionModel: class FunctionModel fm where; condModel :: (fm inSpace opSpace) -> inSpace -> ModelType opSpace; condPr :: (fm inSpace opSpace) -> inSpace -> opSpace -> Probability; condNlPr :: (fm inSpace opSpace) -> inSpace -> opSpace -> MessageLength; condNlPrBase :: Double ->; (fm inSpace opSpace) -> inSpace -> opSpace -> MessageLength; ...
A TimeSeries is a model of sequential data. Its defining property is to make a sequence of predictions, i.e. Models, one for each position in a data sequence plus one for the "next" position after the end, each prediction being conditional on the preceding values. -- TimeSeries: class TimeSeries tsm where; predictors :: (tsm dataSpace) -> [dataSpace] -> [ModelType dataSpace]; prs :: (tsm dataSpace) -> [dataSpace] -> [Probability]; nlPrs :: (tsm dataSpace) -> [dataSpace] -> [MessageLength]; nlPrsBase :: Double -> (tsm dataSpace) -> [dataSpace] -> [MessageLength]; ...
The data types ModelType, FunctionModelType, and TimeSeriesType are declared and made instances of the appropriate classes to get the type & class ball rolling. Parameters are the model's complexity, a function to do its core calculation, and a function to show (print) it. It is expected that further types and instances of the classes will be defined as necessary (also see classification trees). -- ModelType: data ModelType dataSpace =; MPr MessageLength (dataSpace -> Probability) (() -> String) |; MnlPr MessageLength (dataSpace -> MessageLength) (() -> String); -- FunctionModelType; data FunctionModelType inSpace opSpace =; FM MessageLength (inSpace -> ModelType opSpace) (() -> String); -- TimeSeriesType; data TimeSeriesType dataSpace =; TSM MessageLength ([dataSpace] -> ModelType dataSpace) (() -> String)

Note that wallaceIntModel, a universal model over integers >=0, is defined in this module, rather than the next, for convenience: It is used as the default model for "lengths" when a model of a data-space is converted into a model of sequences over that data-space.

4.3 Models

03Models.hs defines some (basic) models, distributions and estimators. As noted above, wallaceIntModel should, logically, be defined in this module but is in the previous one for convenience.

uniform is one of the simplest models, a model of the data-space {lo..hi}: -- Model of Enum Bounded dataSpace: uniform lo hi =; let (lwb, upb) =; if (fromEnum hi) >= (fromEnum lo); then (lo, hi) else (hi, lo) -- ?switch?; in MPr 0 (\_ -> 1 / fromIntegral((fromEnum upb) - (fromEnum lwb) + 1)); (\() -> "uniform[" ++ show lo ++ ".." ++ show hi ++ "]")
The probability of a datum is obviously the inverse of the number of values in the data-space. The model has zero complexity because it has no parameters (hi and lo are taken to be "common knowledge").: fifty50 = uniform 0 1; freqs2model fs = ... -- :: frequencies, [Num] -> Model of bounded Int (i.e. MultiState); etc.
A multistate distribution, as produced by freqs2model, does have parameters and hence has a non-zero message length[WB68,BW69] (complexity), msg1. The current calculation is indirect, relying on the uninformative combinatorial and adaptive codes having equal lengths, and the informative MML code requiring a fraction of a nit more per parameter.
Operators on models are also defined here, e.g. -- (Model of d1, Model of d2) -> Model of (d1, d2): bivariate (m1, m2) =; let nlp (d1, d2) = (nlPr m1 d1) + (nlPr m2 d2) -- d1, d2 assumed independent; in MnlPr ((msg1 m1) + (msg1 m2)) nlp; (\() -> "(" ++ showsPrec 0 m1 ("," ++ show m2 ++ ")") )
This bivariate model assumes that the two components are independent. The assumption could be removed with the use of a function model, or by implementing a factor model, say.

Note that mixture modelling "requires" weighted estimators (the use of unweighted estimators gives biased mixtures in general). A weighted estimator accepts data and also weights, e.g. a datum can be 0.6 present; this is needed for fractional membership of a datum in a component of a mixture model. For this reason weighted and unweighted estimators are defined. (Of course an unweighted estimator could invoke the weighted one with weights of 1.0 -- at a performance penalty. The "best" organisation of estimators, weighted and unweighted, is not yet settled.)

4.4 Function Models

04FnModels.hs defines function-models, including regressions.

Notably: estFiniteFunction ipSeries opSeries = ...
is an estimator for finite function-models (finite input and output spaces), and: estFiniteListFunction k inputs outputs = ...
is an estimator for finite function-models where the input-space is lists of length "k".

4.5 Time Series Models

05TSModels.hs is for time-series models.

The estimator for Markov models of order k uses estFiniteListFunction from the previous module and the conversion function from (appropriate) function-models to time-series models.: estMarkov k dataSeries =; -- scan makes list of input contexts for the predictor function; let scan (d:ds) context = context : (scan ds (d:context)); scan [] _ = []; -- e.g. scan "abcdef" [] -> ["","a","ba","cba","dcba","edcba"]; contexts = scan dataSeries [] -- NB. each context is "backwards"; in functionModel2timeSeries (estFiniteListFunction k contexts dataSeries)

4.6 Mixture Models

06MixtureModels.hs defines mixture modelling, i.e. clustering, unsupervised classification, numerical taxonomy.

The main business of the module is the definition of estMixture, a simple estimator for a mixture model. It is given two parameters -- a list of weighted estimators and a data-set. In this simple example there is one estimator per component of the mixture. The estimator is therefore told in advance how many components there will be; it is not hard to remove this restriction, in principle, but our primary concern is with types and classes rather than algorithms. (The simple estimator could also be run for 1, 2, 3, ..., up to some reasonable limit number of components, and the message lengths compared, to select the best.) The estimators can be heterogeneous but they must all have the same type; they must all be models of the given data-space.

A gradient descent (~expectation maximization (EM)) algorithm is used. Initially data are allocated (fractional) memberships of components at random. Parameters of the components are (re-)estimated from the memberships. The data are re-allocated memberships of the revised components. This process is iterated "for a while". (A similar algorithm could easily be written for unweighted estimators but it would give biased estimates if it made "deterministic" allocations of memberships; stochastic allocation performs well given enough data.)

estMixture ests dataSet =

-- [estimator] -> [dataSpace] -> model of dataSpace

...

A simple test is based on a game played with a silver coin and a gold coin.

twoUp = [(H,H),(H,H),(H,H),(H,H),

(T,T),(T,T),(T,T),(T,T),

(H,T),(T,H), ...] -- data-set

est2coins = estBivariateWeighted( estMultiStateWeighted, estMultiStateWeighted);

mix2 = estMixture [est2coins, est2coins] twoUp;

If the two-component hypothesis, mix2, has a shorter total message length than a one-component hypothesis, something fishy is going on in the game.

4.7 Classification Trees

07CTrees.hs defines classification-trees, sometimes called decision-trees. These are also generalised to function-model trees, i.e. regression-trees.

A classification- (decision-) tree consists of leaf-nodes, CTleaf, and fork-nodes, CTfork. A leaf-node contains a model of some target output attribute(s), and a fork-node contains a test, a Splitter, on one (or potentially more) input attribute. A classification-tree is therefore an implementation of a function-model.

data CTreeType inSpace opSpace =

CTleaf (ModelType opSpace) |

CTfork MessageLength (Splitter inSpace) [CTreeType inSpace opSpace] |

CTforkFM (FunctionModelType inSpace Int) [CTreeType inSpace opSpace]

-- The last option, CTFork, is rather speculative as of 2002, 2003.

instance SuperModel (CTreeType inSpace opSpace) where

...

instance FunctionModel CTreeType where

condModel (CTleaf leafModel) i = leafModel

condModel (CTfork fnLen f dts) i

= condModel (dts !! (applySplitter f i)) i

...

There is an opportunity for a class Function above. (The CTforkFM option of CTreeType is experimental and little exploited. Its purpose is to allow function-models to control mixtures over the sub-trees.)

The estimator for a classification-tree takes four parameters: an estimator for the leaf models, a function that produces lists of possible Splitters given an input data-set, an input data-set, and a corresponding output data-set. (There are cases where it would be more convenient for input and output pairs to come zipped and other cases where it is better for them to come unzipped.)

estCTree estLeafMdl splits ipSet opSet =

...

Note that the splits parameter could be removed, that is made implicit, thanks to class Splits -- see 01Utilities.hs.

Presently, the simplest zero-lookahead search algorithm is implemented: A one leaf tree is compared with each possible one-fork (+leaves) tree, and the best tree, in MML terms, is kept. If the one-leaf tree wins then that is the end of the search. Otherwise the search is continued recursively for each sub-tree with its corresponding input and output data sets. (Zero-lookahead search can form the basis of 1-lookahead search, and hence of k-lookahead search, in the obvious way.) The present MML calculations are "acceptable" but simplified from the original method[WP93] and they should be "tightened up". (The present algorithm can easily be speeded up in some obvious ways but is fast enough for our purposes, particularly on categorical data.)

Note that a leaf model can be absolutely any model of any ouput-space -- be it discrete, continuous, multivariate -- for which there is an estimator, or even a function-model converted to a (basic) model (see 02Classes.hs) which leads to a function-model tree, i.e. a regression-tree. This is more general than classification-trees and classification-graphs produced in CSSE to date, and more general than C4.5 and C5[Qui92].

5. Summary

The code is an experimental prototype; there are obvious gaps to be filled, and algorithmic improvements to be made etc. but this is justified considering the primary aim. Some extensions, e.g.[FAC03], are under consideration. The greatest potential, and the biggest challenge, is to use Haskell's type-classes on this application in the best possible way.

6. References

[All03a] L. Allison. Types and classes of machine learning and data mining. 26th Australasian Computer Science Conference (ACSC) pp.207-215, Adelaide, February 2003.
[All03b] L. Allison. The types of models. 2nd Hawaii Int. Conf. on Statistics and Related Fields (HICS), June 2003.
[APD99] L. Allison, D. Powell & T. I. Dix. Compression and approximate matching. Computer Journal, 42(1), pp.1-10, 1999.
[ASED00] L. Allison, L. Stern, T. Edgoose & T. I. Dix. Sequence complexity for biological sequence analysis. Computers and Chemistry, 24(1), pp.43-55, 2000.
[BW69] D. M. Boulton & C. S. Wallace. The information content of a multistate distribution. J. Theor. Biol., 23, pp.269-278, 1969.
[FAC03] L. J. Fitzgibbon, L. Allison & J. W. Comley. Probability model type sufficiency. 4th Int. Conf. on Intelligent Data Engineering and Automated Learning (IDEAL-2003), Hong Kong, March 2003.
[PJ03] S. Peyton Jones et al. Haskell 98 Language and Libraries, the Revised Report. Cambridge U.P, April 2003. Also see [www.haskell.org]
[Qui92] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992.
[WB68] C. S. Wallace & D. M. Boulton. An information measure for classification. Computer Journal, 11(2), pp.185-194, August 1968.
[WF02] G. E. Farr & C. S. Wallace. The complexity of strict minimum message length inference. Computer Journal 45(3), pp.285-292, 2002.
[WF87] C. S. Wallace & P. R. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society series B., 49(3), pp.240-265, 1987.
[WP93] C. S. Wallace & J. D. Patrick. Coding decision trees. Machine Learning, 11, pp.7-22, 1993.

Also see: L. Allison. Models for machine learning and data mining in functional programming. [doi:10.1017/S0956796804005301], J. Functional Programming, 15(1), pp.15-32, Jan. 2005.

Inductive Inference 1.

Lloyd Allison, School of Computer Science and Software Engineering, Monash University, Clayton, Victoria 3800, Australia.

Report 2003/148

1. Introduction

2. Haskell

2.1 Types

2.2 Classes

3. MML

4. Code

4.1 Utilities

4.2 Classes

4.3 Models

4.4 Function Models

4.5 Time Series Models

4.6 Mixture Models

4.7 Classification Trees

5. Summary

7. Appendix