## A Brief History of MML

### Chris Wallace

#### A seminar to Computer Science and Software Engineering (CSSE), Monash University, 4pm, Thursday, 20 November 2003

Well I'm not exactly sure why I've been asked to give a talk on the history of Minimum Message Length, perhaps to get a little bit of it on some sort of record before I drop dead which won't be all that long. And it is a fairly muddled history which involves a lot of people other than myself and indeed in some respects you would have to say that the earliest steps in the field were taken before I had ever done anything about it, and indeed I didn't even know about them until some years later -- but I will come to that.

This is an edited transcript
of the seminar. The following conventions
are used:[ ...] indicates
something has been deleted
(usually something very minor
due to spoken v. written English).
[.../xyz] indicates
something has been changed to xyz.
And [xyz]
indicates xyz has been inserted.The school does have a video tape of the seminar -- requests should be made to the |

For me the whole business I suppose started to present itself to me as a bit of a problem when I was an honours student studying physics. An experimental group in the school had got very excited by a recent result that they had got. They had been measuring the intensity of a certain type of cosmic ray and discovered that the intensity seemed to go up and down according to the sidereal time. That is to say according to the attitude of the earth with respect to the distant stars. Now if that was so, it would have indicated a preferred direction for the cosmic rays to be coming from, and they were of a type which, it was thought, would not have a preferred direction. So this was quite an exciting result. And the statistical analysis of their intensity records showed that the results they had were, according to the standard statistical tests, significant to about the 1% level. That is to say, you would have expected to get such results by chance only 1% of the time. Now that is a reasonably good significance level. Most studies in medicine and psychology and the more difficult sciences jump up and down if they get a result which is even 5% significant. But it occurred to me listening to their presentation, at a seminar rather similar to this one, that the fit which their presumed sidereal-time variation gave to their data required them to say where the peak of this wave was coming to within about one hour, plus or minus one hour, say two hours. But if you fitted a curve with its peak outside this range the significance level dropped like a stone. So what?

Well then I said to myself, well they would have got just as excited had they found the peak at a different time of the day. And if the peak was sort of about two hours wide, well that would mean anywhere in twelve different positions round the clock (24-hour clock) would have been just as exciting. [...]

So the chances of their getting a result by pure chance that would have made them excited wasn't one in a hundred, it was one in a hundred divided by twelve which is only about one in eight. And then it further occurred to me, well they would have got just as excited had they discovered something which showed two peaks every day, or a peak every year, or whatever. So there were really rather a lot of hypotheses that they would have embraced if they found that they gave a fit to the data with a significance of 1%. And when you throw that in, the real significance of the result wasn't one in a hundred, or even one in eight, but something probably more like one in five. In other words there would be at least a one in five chance that purely random data would have thrown up something that would have got them excited.

Well, while accidents of the order of one in a hundred don't happen often, accidents of the order of chances one in five happen at least once a week. So I didn't think they had a decent result. I raised my little voice in the seminar and tried to explain my reservations but, as usual, was howled down; remember I was only an honours student and that's the way you've got to treat honours students. And they went on collecting data. But over the next year the effect disappeared. I didn't say I told you so because by that time I had become a research student and they have got to be even more careful about what they say.

But it got me thinking, [...] if you are talking about the fit of a hypothesis to some data, and the fit looks good, you really have to discount this by how many hypotheses you would have contemplated before choosing the one that fitted the data best. Put it another way, the more precisely you have to specify your hypothesis out of a range of possibilities in order to get a good fit to the data, the less plausible that hypothesis really becomes given the data.

So anyway, I put that thought to the back of my mind and forgot all about it. Later on as a research student I got involved in some rather complex statistical analyses of data from a different sort of cosmic ray and got introduced to Bayesian statistics because that seemed to be necessary to get more or less unbiased results. So I did that and put that on the side. Well OK, by the time I had assumed the full and unquestionable authority of a junior lecturer I had had the odd thought about how one should go about weighing evidence and had had a little bit of experience with statistics. [...]

It happened that I got a research student coming to do a masters. He was an electrical engineer named David Boulton who came from New Zealand where all good things come from -- like my parents -- and for some reason he decided he would do his masters under my supervision. (No he wasn't my first, he was my second student.) And at that time, which is quite a while back, it was about 1966, one of the very promising techniques for building computer logic was a thing called a tunnel-diode which could switch far faster than the then available transistors. So his masters was to be on making use of tunnel diodes in logic -- logic circuits. [...] This went on and we got some results but after about six months it became clear to both of us that this was going to be a dead-end, as it indeed proved to be. The things were just too difficult to handle; the results were difficult to make reliable and by the time you had made them reliable they weren't all that fast anyway. So here he was, poor student, come over from New Zealand to do a masters and half way through it was obvious that it just wasn't going anywhere. So what to do?

[...] A year or two before I had come across a problem which
was presented to me by a geologist friend of how to classify
samples of data from a large population.
They were in fact samples of sediments from oil wells and
as one put down oil wells in different places one encounters
different sorts of sediments. The origins and nature
of these wasn't entirely understood and he thought it would be a nice
idea to get some sort of classification as to what sort of sediments were what.
[...] At the time I had done nothing much more than
to keep him happy more or less with something which
today would be called a k-means description of the data.
That worked, more or less.
But then I forgot about it.
Anyway, it occurred to me when David Boulton ran out of things to do
that maybe it would be a good idea to go back to that classification problem
and see if we could put it in a proper statistical framework:
Given data from some large population
is it reasonable to regard the samples that you've got
(there may be a thousand different samples) as representing
different *kinds* of thing?
Is the population, in other words, made up of
several different classes of thing?

Now David was an engineer with a good decent science background and he also knew something about information theory. And he had the bright idea that he had always thought of theories as a way of concisely representing data and his classic example was the periodic table of the elements, something which some of you may know about. [...] It is a way of arranging the 92 elements (or as it was originally done [when] only about 70 elements [were] known), in a table in a way which reveals a lot of relationships and similarities among the chemical elements. Ones in the same column of the table have very similar chemistries. Ones in the same row of the table have rather similar masses. And so if you know the table, you already know a lot about these elements, just represented in a simple geometric arrangement of where you have listed the element names.

So *his* cut on doing classification was that if you
do a decent classification and describe what each class looks like and then
what class each thing in your sample belongs to, you have already said
a lot about the thing because if you've said it belongs to a
particular class then you know it's got to look pretty much like the
description of the things in the class.
And so his take on it was that it was all a matter of getting the
data and using some classification in some way to encode the data
more concisely -- data compression.
Now I thought well yeah, well maybe, but look mate
this is statistics, what one really wants to do is
to do a proper Bayesian statistical analysis here,
but taking account of how precisely you have to specify the properties of
each class and thing.
And the more classes you bung in,
the more properties you'll have to describe, so
the more precise, or the more complex, the whole description of the
population becomes.
Let's go and do this properly with Bayesian statistics,
putting on proper prior distributions on numbers of classes and
prior probabilities on parameters and so forth.

[...] We argued about this for some weeks, waving arms and shouting at each other, and then we decided well look this was no good, we will have to prove our points by going away and doing the maths and writing the algorithms and seeing what happens. So peace fell on the subterranean rooms where we were working for a week or two, then we came together again and I looked at his maths and he looked at my maths and we discovered that we had ended up at the same place. There really wasn't any difference between doing Bayesian analyses the way I thought one ought to do Bayesian analyses and doing data compression the way he thought you ought to do data compression. Great light dawned -- wrote a program, got some data from somebody who had been looking at seal skulls from around the world, whacked it in, cranked the handle, and said oh there's six species of seal here. So we went to an expert and said how many species of seal are there and which ones were supposed to be which, and it was a pretty good match except that we had confused two seal species, which are now incidentally regarded as being the same species, they are just two different populations, and unfortunately split up some of the species into two classes which on inspection turned out to be male and female of the species. So we were really chuffed, wrote a paper, got it published and David got his masters.

Then in '68 I came down here to Monash
(I've been here a long, long time)
and David, for his sins, decided to follow me and do a PhD --
on elaborating the classification algorithm we had
so that it
could build a hierarchic tree of classes, not just one class,
two class, three class, four class,`...`
but super-classes and these dividing into
smaller classes and whatever --
the kind of things that taxonomists do in the biological arena.
And so he slaved away at it and we both knew
what was going on by then, so I was able to give him some help,
and [he] finally got a decent PhD out of it.
And there the matter rested for a while.
Except that towards the end of this exercise
it had begun to dawn on me that
this approach to working out models, or forming hypotheses from data,
while it did seem to work
in the problem of forming a classification of a population,
was not necessarily restricted to that arena and could have
wider applications.

So David went off and did analogue computers and such but I, in spare time, and even heads of department occasionally had spare time in those days, pursued this idea of seeing if I could generalise this approach of data compression, or Bayesian analysis done right, to attack in principle more or less any problem of inductive inference. By inductive inference here I mean specifically the kind of thing that scientists try to do, which is by taking a whole lot of data try to come up with general propositions about the source of this data and which are true of lots of the data, even data that you haven't seen yet -- the discovery of at least approximations to general laws, general propositions, by looking at specific instances. And I did come up with a formalism which has now entered the literature as what is called the Strict Minimum Message Length formalism which at least gives a formal answer to the question of how to do it, at least in arenas where you have a fairly well defined body of hypotheses from which you wish to select. It might be quite broad, like the set of all possible ways of classifying a population or, as Mike Georgeff later suggested, the set of all possible finite-state automata which could represent the grammar of a language, but the formalism was there. I could prove a few things about it, but unfortunately the mathematical complexity of it made it almost unusable. So after a while I developed, using the work that Boulton and I had done in Snob, at getting tractable approximations to this general idea and ended up with things which are pretty usable.

Had very little success in getting any of this work published because the statisticians did not want to know on the whole and indeed one paper, which I still consider to be a pretty good paper, was rejected by a notable statistical journal on the basis of a referee's comment that he found it objectionable that you could somehow improve maximum likelihood by adding obscure terms to it. He didn't say it was wrong, he didn't say it didn't work, it was just objectionable. So that was that. Right. I went on to other things.

But then out of the blue a mad Irishman descended on us armed with survey data from piles of rubble which he had found in Ireland, and he was persuaded that these were in fact megalithic stone circles, like Stonehenge only being Irish not quite as big. And an equally mad engineer in Britain called Thom had achieved considerable fame by proposing that the shapes of these megalithic circles were in fact carefully constructed geometric shapes. Although roughly circular, they weren't truly circular and they were actually constructed from Pythagorean triangles (like 3, 4, 5, and 5, 12, 13, whatever, the right angled triangles with integer numbered sides) together with a whole lot of other stuff. And our mad Irishman didn't really think this was very likely and he wanted to do a proper statistical analysis. Now statistical analyses of Thom's results had been attempted but were inconclusive -- basically conventional statistics just is not equipped to resolve questions of that complexity. So John Patrick came with all of his survey data. He'd read something of the work on classification and thought this was maybe a promising approach. Anyway he did a PhD on it, in the course of which we developed and refined some of the approximations to minimum message length and out of it came some pretty reasonable results, which actually did lend evidence to at least one of the geometric shapes that Thom had proposed but could not really find any evidence for any of the others. So this was duly published and excited the attention of a couple of the statisticians who had themselves attempted to analyse Thom's data. One of them, Peter Freeman, then Professor of Statistics at Leicester, got sufficiently interested in it to come out here for a sabbatical year and work with me to see if there really was anything in it. With his assistance, we together ended up writing a paper which actually made sense to the statisticians, at least enough for it to get published. Phew! After quite a number of years this was. So I got interested in it again, did some more work with Peter Freeman on some rather more complex models, again got published, and this was great.

Now if we can stop there and just go back and think
about what was really happening in the rest of the world.
David Boulton and I had got started in '67, 1967.
But we had been well gazumped by a chap in the United States, Ray Solomonoff,
in I think [.../1964].
And he had come up with the notion that he wasn't really interested
so much in getting theories out of data.
His emphasis was, if you get some data from some source, can we predict
what *future* data might come from the same source?
So his emphasis was on prediction.
And his take on the subject was, well,
let us suppose that we can represent the data that we've got in binary
with a very long binary string.
And now let us take a universal Turing machine,
and I'm simplifying here a little bit
but that is to say an honest to God computer
which will accept a binary input.
And feed just random binary bits into that computer
and look at what comes out.
Well what comes out will also be a binary string
and you throw away all of the inputs which made it
produce garbage which wasn't very interesting, or failed,
or halt and catch fire, or loop forever, or all of the other
things that computers can do.
And you look only at those random strings which, when input,
made the computer output a binary string which was *precisely*
the data that you had.
And then you said OK,
these random input binary strings are somehow encoding the data
because when we bung them in this computer it makes
the computer output the data that we've got.
So let us put those strings in again but tack some
more random binary digits on the
end and see what we can get.
Well we'll get our original data in each case
plus some further binary strings which look like
maybe future data from the same source, and
by noticing how often [in] this process,
this gives us future data of a particular kind,
we get a probability over what might be the future data that emerges.
Now this sounds completely bizarre
but Solomonoff managed to prove that if the data comes from a source
which produces data from *any* computable probability distribution
then his prediction technique, given enough data to work on,
will converge to making predictions based on precisely
that computable probability distribution.
In other words, he proves it works.
And it doesn't matter how complicated the source,
what we're talking about,
whether it's the incidence of rabies in French foxes or
the, I don't know, the motion of the planets, whatever,
if this data displays any computable regularity
this process will detect it, pick it up,
and use it to make as accurate a prediction as it is possible to make
of what is going to come next.

What has this to do with minimum message length? Well it turns out that when you look at it, the strings which went into the computer which have most effect on making the predictions of future data are the strings which encoded the available data most concisely. In other words data compression. And the better the compression the more weight was given to that string in making the predictions as to what's happening next. So although he was not using this technique to pick on a model of the source of the data, he was using it to average over all possible models of the data and in effect assigning to each possible model a posterior probability which turned out to be precisely the posterior probability that you end up giving to a model found by minimum message length.

So there is a very, very close correspondence between the work. The mathematical pursuit was quite different: Turing machines are quite awful things to work with on a theoretical basis, whereas I'd stuck to, [...] if not finite, at least countable sets of hypotheses which you can enumerate. The hypotheses which can be somehow used by a Turing machine are not countable unfortunately; well they are countable but they cannot be enumerated because you can't tell by looking at it whether a computer program makes any sort of sense at all -- whether it will ever give any output, in general.

OK, so in terms of practical application our development had
immediate applications, and has been applied, and
quite a lot of people around the world are using some
of the techniques that we've developed.
Solomonoff's work gives a very strong theoretical respectability
to the whole approach because he can prove very powerful
theorems about what is going to happen of a quite different nature
to the kind of thing that we can prove about MML from statistical theory.
But it is much harder to get anything practically useful
out of Solomonoff's approach.
Now we didn't become aware of Solomonoff's work until we
were fairly well down the track, it would have been about '72 or '73,
something like that although he had published, I think, originally
in ['64].
I mean he'd published in an
IEEE journal on information theory [*Information and Control*],
I think, which was not the kind of thing that
your average statistician would read. I certainly hadn't read it.
And I only became aware of it through reading a Scientific American
article written by [yet] another bloke. [...]

Solomonoff's work had been read by a [...]
very famous Russian statistician, Kolmogorov, who'd
seen in it a basis for coming up with a definition of what is randomness.
I mean statisticians and probability people all *talk* about randomness
but nobody knew really what it meant.
Well Kolmogorov latched onto the idea of Solomonoff's work that if
you had a binary string for given data that Solomonoff worked with
then you could find some input to a Turing machine
which made it output this string, but that the
input that you had to put in was shorter than the string
which came out, the only way this could happen would be
if the string wasn't really random.
So he latched onto the idea that the real definition of randomness is,
at least for binary strings and by extension to any sort of data,
if you can compress it, it ain't random.

Unfortunately, there was a bit of a defect in the way he developed this theory. It turns out to be quite trivial, but he never spotted it and he never really succeeded in getting his project to work. He thought it would lead to a theory of statistical inference but he couldn't make it go because of this defect in his theory. The defect had the horrible result that if you took any random string of digits and applied his criterion of randomness to it then, whatever the string was, you would find infinitely many prefixes of the string (initial sequences of the string) which were non-random according to his measure. Which is bizarre.

[...] Another chap, American this time, Chaitin, who gave a talk here a couple of years ago roughly. He spotted what was wrong, set it right. What was wrong was Kolomogorov hadn't thought of the point that if you have got a finite binary string and you want to show that it's compressible, you have to get an input for a Turing machine which will make it output that string and then stop! In other words the input has got to know how long the output is meant to be. Add that little bit in and the theory now works, well Chaitin made it work.

[...] So he wrote the article in Scientific American [1975] mentioning Solomonoff's work. I looked at Chaitin's work and it was all to do with randomness, and frankly I wasn't all that interested in random data. I wanted to know what you could do with something which isn't random, like data from the real world, but it did put me onto Solomonoff's work.

Again round about much the same time I'd been getting IBM
technical bulletins and one of them had an article by a chap
called Rissanen who had come up with a bright idea
for looking at data coming out of, the output of say a chemical plant,
the inputs and the outputs of this chemical plant,
or any sort of plant,
and just from that data working out a model of what was going on in this plant.
Basically if you like,
looking at a very complicated sort of Markov process and
making an estimate of what the Markov process is.
And lo and behold when I looked at the maths it looked remarkably like
the maths for minimum message length, except that
one crucial term looked to me to be the wrong way round --
he had a minus sign where he should have had a plus sign.
So I wrote to him and suggested he have another look at this
and furnishing some of my work. [`...`]
He went on to invent and popularise
what is called minimum description length which is
basically very much the same idea.
And one has to concede that anybody can mistake a plus for a minus,
I'm always doing it myself;
he had come up with the same idea independently. [`...`]
Once he got the sign right
it all clicked and he began talking sensibly about it.

So what have we got here? We've got Solomonoff, to some extent Chaitin and Kolmogorov, certainly Chaitin, and David Boulton and myself all coming up with more or less the same idea at more or less the same time, which I guess might mean that the idea's time had come, except that obviously it hadn't because thirty years has gone by and the world still doesn't know about it! But we're working on that.

Anyway back home, we kept beavering away.
Well Mike Georgeff came along and, as I've mentioned,
had this idea that maybe you could use this scheme actually to
estimate the structure of something, namely the structure
of a grammar just given sentences from the grammar,
at least in simple formal languages.
This idea had actually been suggested by a chap called Gaines who had
suggested [LA:1976] that one could use
a model of a grammar which was essentially a finite-state automaton of
a particular sort. And Gaines had given some examples of
strings, simple strings in a simple grammar,
and shown that his approach could recover the finite-state machine
which represented the grammar from these strings
but he didn't know how to "stop".
He kept on making models which were more and more and more complex
until you ended up with a model which could produce
precisely, but only, the given examples.
In other words a grammar which says these are the sentences and there are no others.
And obviously that wasn't what he was looking for.
But he said, well look if you stop it right at the right spot,
you get the grammar which represents the given sentences but
could also generate other sentences which look sort of similar.
So Mike suggested well why don't we have a go at this.
So we did the not very complicated maths and wrote
some fairly horrible programs and lo and behold
it all worked [*see TR32*],
very slowly, but it worked, and
that resulted in a free trip to Pisa where we gave a talk
at an artificial intelligence conference there.
And this was sort of the first introduction of these ideas to the
artificial intelligence community.

Since then of course we have gone a lot further, or a fair bit further, and with Kevin [Korb] here and a bunch of students we have pursued learning other sorts of structures, in particular causal networks and binary networks, from just raw data. And that seems to work pretty well too. With Trevor Dix and Lloyd Allison they developed some calculations making some use of this theory to answer questions about the relatedness, or otherwise, of DNA strings, and in Trevor's case how fragments of sequences could be put together to make a coherent whole. I think I've got that right. And Ingrid [Zukerman] has been making a bit of use of it in looking at the structure of discourse where one of the parties is a computer trying desperately to understand what the human on the end of a telephone is actually talking about. So there has been a fair bit of progress and there is still an awful lot more progress that could be made, I think. [David Dowe has also pushed the idea along dealing with less usual statistical models.]

The artificial intelligence community, in particular machine learning because really that's what we are trying to do, is beginning to take some notice of these techniques; I think it should be taking a bit more notice. I have had some vehement opposition from a certain quarter but he hasn't actually yet managed to prove us wrong. No doubt he's working on it. But certainly in so far as the basic theory goes, whether you put it in the framework of a la Solomonoff of using a simple universal Turing machine to decode encoded descriptions of the data, or whether we put it in the Bayesian statistical framework of choosing from a possibly large but still not unlimited set of hypotheses which might fit the data, choosing the one that really does, there are lots of good theorems which say this is the right way to go, one of the most powerful actually produced by a theoretical statistician at Stanford who has managed to prove that if you use this technique and the data is indeed coming from a probability distribution which is either computable, or computable given a few uncomputable constants, then (and you go the route that we have of looking for the shortest possible encoding) [...] within a finite time you will get the right answer. You only need a finite amount of data to find the real honest to God probability distribution from which this data is coming. That is a very powerful theorem, it is very general, it's got rates of convergence and all sorts of things coming out of it. I don't understand it myself -- it uses mathematics that I'm not familiar with -- but it's nice to know that apparently we are doing something respectable. And nobody's really been able to show that it doesn't work -- come up with any convincing example where you get the wrong answer by using this technique.

So anyway, that's the end of my history.
It's still open-ended.
There's still a great deal of work to be done.
You may say what the hell has it got to do with
computing but it is indeed very significant, I think,
that the people that got things out of this,
or came up with these ideas,
were
an engineer working with RCA and using computers all the time,
that was Solomonoff,
and who knew a lot about Turing machines (still does),
another engineer, David Boulton,
who knew a lot about coding and information theory,
a lapsed physicist, me, who knew a bit of statistics
and a fair bit about the workings of machines,
and Rissanen who was also an engineer.
Statisticians just didn't come into it.
Learning theorists don't come into it.
I think you had to have the background in dealing with
real numbers crunching through real processes
in a machine and coming up with real answers and thinking along those terms,
rather than elegant mathematical theories, to be led down this path.
It is a very simple-minded path.
What it's saying is that the most convincing story
about some data is the shortest one.
Full stop.
OK, it's not easy to turn that precept
quantitatively into something that you can use,
but that's basically what we've done.
The fundamental idea is trivial, but very strong and I think
it still has a lot in it which remains to be found out.
I have been thoroughly subverted by Kevin Korb and
Charles Twardy -- don't go around talking to philosophers, they muck your
brain around -- but I am now persuaded that, for instance,
we can show that although given present knowledge,
the present state of the universe,
we can make predictions, *deduce* things about the future,
we cannot actually deduce things about the past.
Well you can but you get stupid answers.
When we are reasoning about the past we actually
have to use inductive logic,
which is the basis of statistical and inductive inference.
So OK, this is amusing if true.
If it is true it solves a couple of conundrums which have been puzzling
philosophers of science for a while,
but it remains to be seen whether it is.
So anyway, that's the kind of thing I'm thinking about at the moment,
utterly pointless, but there it is.
I'm retired. I'm allowed to do that.
Any questions?

Q: inaudible.

A: The point that it is the same as the Bayesian approach,
even that hasn't sunk in.
Because until you come to realize the importance of
considering the *precision* of statement of parameters the ordinary
Bayesian approach doesn't get you there.
You can't equate posterior probability density with posterior probability.
Until you do the business about Fisher informations and
precisions of estimation you can't use a Bayesian technique to give you,
to lead you to a hypothesis which you can plausibly say, is the hypothesis
of highest posterior probability,
because you can't assign any posterior probability
to any hypothesis if they've got real parameters --
in the conventional Bayesian sense.
So that message still has got to get across.

What *is* happening is that people who wouldn't touch this
sort of analysis with a barge-pole are in fact ending up
doing things which amount to the same thing.
There's a hot topic in work on
using neural nets to learn supervised classification
which is going for "wide margin" classifiers.
If you look at the mathematics of what they're actually doing,
what the wide margin is doing is precisely taking account of how
precisely you have to specify the parameters of the network.
So they are putting that into their things.

If you look at structural risk minimisation,
another apparently unrelated thing of Vapnik's
which is based on VC dimension, you find that
in certain cases which are pretty general, there are at least
three ways of expressing the VC dimension.
If you look at it closely they correspond
to three different ways of encoding the data.
And the different ways depend on,`...`
well OK one can be better than the others and
which comes best depends on how precisely you
have to specify estimates.

So the actual practice of this, the mathematical consequences, are getting into peoples' heads, but they're getting there by all sorts of funny routes which is encouraging -- it means there are all sorts of justifications for doing it.

Q: inaudible.

A: Well it Snob was produced
by David Boulton and myself to do classification,
that's it.
If you use what we`...`
It was a very specific application and it wasn't until after we'd done it
that we realized the principles behind it were capable of being
generalized to lots of other problems.

We just saw it as, if you want to do classification
this is the way you damn well ought to do it.
Well the principle is:
Find that classification of the sample which allows you to encode
the data in the sample, most concisely.
Full stop.
That's what drove it.
That's what the algorithm,`...`
I mean you can look at the algorithm and that's
obviously what it's doing.
Or you can look at the algorithm and also say what it's obviously doing
is picking the model of highest Bayesian posterior probability.
So it's very obvious.

Q: inaudible.

A: Oh my message is I made a *big* mistake by
getting into academic work.

[nervous laughter from the audience]

No, I mean that. And to be a little pessimistic, I don't see a huge future for people wanting to do the kind of academic work which I've been doing all my life and which most of you have been, well the older ones among us, have been doing all their lives. I don't think it's respected, I don't think it's going to be resourced, and one has to begin to wonder what the hell the point was. Well on that unhappy note, I'll stop.

[applause]

### References mentioned in the text, in publication order

(Added later by LA)

- R. Solomonoff.
*A formal theory of inductive inference, I and II*. Information and Control**7**, pp.1-22 and pp.224-254, 1964. - A. N. Kolmogorov.
*Three approaches to the quantitative definition of information*. Problems of Information and Transmission**1**(1), pp.1-7, 1965. - G. J. Chaitin.
*On the length of programs for computing finite binary sequences*. J. A.C.M.**13**(4), pp.547-569, October 1966. - A. Thom.
*Megalithic sites in Britain.*Clarendon Press, Oxford, 1967. - C. S. Wallace & D. M. Boulton.
*An information measure for classification*. Computer J.**11**(2), pp.185-194, August 1968 (also [html]). - D. M. Boulton & C. S. Wallace.
*The information content of a multistate distribution*. J. Theor. Biol.**23**, pp.269-278, 1969. - D. M. Boulton & C. S. Wallace.
*A program for numerical classification*. Computer J.**13**(1), pp.63-69, February 1970. - D. M. Boulton & C. S. Wallace.
*An information measure for hierarchic classification*. Computer J.**16**(3), pp.254-261, August 1973. - D. M. Boulton.
*The Information Measure Criterion for Intrinsic Classification*. PhD thesis, Monash University, 1975. - G. J. Chaitin.
*Randomness and mathematical proof*. Scientific American**232**, pp.47-52, May 1975. (And also:*Randomness in arithmetic*. Scientific American, pp.80-85, July 1985.) - B. R. Gaines.
*Behaviour structure transformations under uncertainty.*Int. J. Man-Machine Studies**8**, pp.337-365, 1976. - J. Rissanen.
*Modeling by shortest data description*. Automatica**14**, pp.465-471, 1978. - J. D. Patrick.
*An information measure comparative analysis of megalithic geometries*. PhD thesis, Dept. Computer Science, Monash University, 1979. - J. Patrick & C. S. Wallace.
*Stone circle geometries: An information theory approach*. In "Archaeoastronomy in the New World", D. Heggie (ed), CUP, 1982. - M. P. Georgeff & C. S. Wallace.
*A general selection criterion for inductive inference*. European Conf. on Artificial Intelligence, Pisa, Elsevier Science Publ., pp.473-482, September 1984.

(Also [TR32] Dept. Comp. Sci., Monash University, March 1983.) - C. S. Wallace & P. R. Freeman.
*Estimation and inference by compact coding*. J. Royal Stat. Soc. B,**49**(3), pp.240-265, 1987. - C. S. Wallace & P. R. Freeman.
*Single-factor analysis by minimum message length estimation*. J. Royal Stat. Soc. B,**54**(1), pp.195-209, 1992. - L. Allison, C. S. Wallace & C. N. Yee.
*Finite-state models in the alignment of macro-molecules*. J. Mol. Evol.,**35**(1), pp.77-89, July 1992. - T. MacKenzie, D. Platt & T. Dix.
*Modelling errors in restriction mapping*. Hawaii Int. Conf. Sys. Sci.**26**, January 1993. - C. S. Wallace, K. Korb & H. Dai.
*Causal discovery via MML*. Proc. 13th Int. Conf. on Machine Learning, L. Saitta (ed), Morgan Kaufmann, pp.516-524, 1996.

The MML book is an excellent source on MML.

*Transcription
(& mistakes) by L. Allison (2004).
*