How to do Inductive Inference using Information Theory
C. S. Wallace
L. Allison, ed.
I am not talking about the whole of inductive inference at all, rather I will be talking about a subset of it which is arguing or reasoning from particular observations, or data, to general propositions. I know there are lots of other things which are called inductive inference but on the whole I will not be talking about them. This particular problem of course has been well done over.
Before I launch into it I want to make a couple of caveats. Throughout this talk, when I talk about a general proposition such as "all cats have claws," for instance, I will be wanting you to regard the general proposition as being a little bit qualified by common sense and scepticism, in that if I say, or anybody says, "all cats have claws," we must understand this to be a statement which is generally true but there may be the odd cat that does not have claws, and there may be some cats who, from genetic defects, have something on the ends of their feet but we are not sure whether they are properly to be called claws, or not, and so forth.
Now you may say this is begging the question a little bit in that if we say "all men are mortal" then by golly that's what we mean, and we don't admit of any exceptions, and if we did admit of any exceptions then we would not say all men are mortal.
But I want to put to you that even with such an absolute statement as that, when we actually come to use it in deductive reasoning about the real world and combine that assertion or compare it with observable data we are always, if we are sensible, prepared to accept the possibility that what we observe we have observed incorrectly or inaccurately, and that when regarded as a proposition about what we observe, even if the proposition is held in some ideal sense to be absolutely true, we must always regard it as being somewhat approximate and somewhat prone to exception.
OK. Just by way of obvious illustration, supposing we have for instance a very good, very well tested general law that if we put something floating in the water, the buoyancy force upwards equals the weight of the displaced water -- never significantly challenged since Archimedes. But realistically, if we are going to use this general assertion to deal with the buoyancy force that we have actually measured, and the weight of water that we have actually measured, we will be dead lucky if we get agreement to within 0.1%. And even if we are extra careful we will have to realise well, surface tension does affect things a little bit, and there could have been waves on the water, and if that boat is actually moving at even 1mm/sec then you have hydrodynamic forces to worry about as well. And so in practice the law, while entirely usable as stated in an unqualified way like this, is not so interpreted. It is interpreted with a little bit of caution and likewise the business about cats having claws. There is always the odd exception.
|
I want to make the point further that even before we consider the induction of general propositions, we really have to account for a more primitive form of induction -- the invention of names, common nouns, verbs. I regard the use of a noun like "cat" as the outcome of inductive reasoning, in that we observed a large number of particular animals, or particular small things scurrying around outside, and have come to the conclusion, through a process of induction, that a number of these are sufficiently similar that it is worth regarding them as forming a class and inventing a name for that class and that name is cats. Now that is an act of induction. It leads to a conclusion that there are things called cats that are worth talking about and, like any induction, it is fallible. But before we have got through this step we cannot even think about the inductive inference of general propositions because we cannot form a general proposition without common nouns and common verbs. Before we can have a universally quantified term in a proposition we must have invented the thing which is to be universally quantified. We cannot talk about "all cats" until we have decided there are cats and what we mean by cats.
So much for background. I will in fact attempt to give some account of that primitive sort of inference as well as the more ordinary sort. What do we start off with? I am going to try and simplify it. We have got some data or observations. We have got some background knowledge and, in the light of that backgound knowledge, we think that [data] D requires some explanation because it is to some extent surprising. It is certainly not implied by our background knowledge. I will suppose however that it does not contradict our background knowledge. So when I say surprising here I mean not that it causes us to jump in our chairs but that we find it unpredictable. Our background knowledge gave us no necessary reason to expect such data. So, having collected enough such data we may infer a theory about it, by some inductive process, which I will call H, a hypothesis. And we will certainly want that our background knowledge, which for our purposes I am not going to challenge, this is accepted knowledge.
The background knowledge and the data does not contradict the theory and we would like that the background knowledge together with the theory, at least roughly, implies the data, or renders it less surprising -- gives us some credible account of what we have observed. So that we hope the data will be less surprising if we assume the truth of this hypothesis.
Before I can go further, I have to introduce the idea that data, or the results of observations, or whatever, certainly data that can be communicated between one human being and another must be represented in some sort of observation language. We have a message which tells us what that data might be. It could be a sentence in English. It could be a roll of magnetic tape that a computer could read, that has come out of an instrument -- it could be whatever. But it is a message in some sort of code or language. I will call the language in which the data is presented an observation language. And the way in which we choose to represent the data may make use of such background knowledge as we have to shorten or make the representation of the data reasonably efficient. For instance, such things as the circumstances of the observations, what sort of instrument we have used, any well-known laws of nature held to be applicable, can be thrown in to help us reduce the volume of data without losing any information from it. Let us suppose that having done that, we end up with a message of some length [|D|] which tells us what we have observed.
|
I will now want to define a particular way of recoding what our data is which I will call an explanation in the hope that it corresponds, at least roughly, to one of the normal English senses of the word.
Suppose that we have a sender, S, who knows the background information and the data, and a receiver, R, who knows the background only. And we suppose that we would like to form a message which S can send to R and which will tell R the whole of the data -- give him every bit of detail about that data. There are many ways you could perhaps form such a message. I will want to concentrate on a particular way of doing it which I will call an explanation message. And an explanation message will have two parts. The first part will be a statement of some hypothesis that we may have inferred from inspecting the data. The second part is just a statement of the data but restated in a new code or language (not the observation language) which we expect to be efficient if H were true. Thus we could omit from the second part of the message anything in D which is in fact implied by the background knowledge and H. It would not be necessary to include that in the second part of the message.
Well that's what I call an explanation message then: A statement of a hypothesis followed by a statement of the data couched in terms which assume the truth of the hypothesis.
Now comes the interesting bit. We will regard an explanation, and the hypothesis on which it is based, as acceptable if the length of the explanation is shorter than the original length of the data as stated in the observation language. Now the length of that explanation is of course the sum of the lengths of the first part which states the hypotheses, and the second part which states the data given the hypothesis, or assuming the hypothesis. Now all of this would be great stuff, and we will say that out of all explanations of the data we will prefer the one being the shortest explanation.
So essentially this is just some form of Ockham's razor or the parsimony principle -- nothing very surprising. And nothing very useful unless we can actually reason sensibly about what these lengths actually are. So to procede any further, other than just giving a sort of motherhood principle that explanations should not be too long winded, we need to have some way of systematically, and accurately, and uncontroversially, of attaching a length to a message -- measuring the lengths of messages.
Well the ground work for one way of doing this was, of course, established by Shannon in his theory of information, and he showed that using optimal encoding, or the best possible language, the length of a message announcing some event or occurence, X, would necessarily be minus the logarithm of the probability of that occurence. So something which is improbable will require a long message to state it. Something which is very much expected anyway can be announced in a short message.
When we make the appropriate substitution in this we find that the length of an explanation is the length required to state the hypothesis which depends on the probability of that hypothesis, whatever that may mean. The second part of the length depends on the probability of the data given the hypothesis. And using a bit of Bayes's theorem, and whatever, we end up with this being minus the log of the joint probability of hypothesis and data.
Well it comes as no surprise then if we say actually we are talking about Bayes's theorem, and the probability of the hypothesis that occurs here we can think of as a Bayesian prior probability for a hypothesis -- what probability the receiver would attach to the hypothesis if he doesn't know anything about the data but only has his background knowledge.
Well, we will say that not only do we require an explanation to be shorter than the original statement of the data for it to be acceptable at all, but out of all such explanations, that meet that criterion, we would prefer, and infer from the data that one that gives the shortest explanation of the data. This is essentially the Bayesian choice rule of choosing the inference which has the highest posterior probability given the data because that can be written as the joint probability [H&D] divided by the probability of the data which doesn't change if we contemplate different hypotheses. So you might say, well I've got this far, ten minutes down the track and so far we have caught up with Bayes. So what? Well there is a difference.
The line I'm plugging differs from Bayesian acceptance theory in that we can cope quite well in this context with something Bayesian theory doesn't cope with very well at all, and that is the situation that the hypothesis we infer from the data might have some free real-valued parameters. We should write it as h(θ) where we don't know the value of θ and θ itself has to be inferred. For example, our hypothesis might be that there are things called electrons, and that these have certain properties, and that this is explaining something that we see going on in a vacuum with a hot wire in it. But to flesh out that hypothesis we would have to give values for the properties of the electron, in particular its mass, its electronic charge, its spin -- whatever happens to be relevant. These are real-valued numbers. Our hypothesis has to advance values for these. We cannot very well have a prior probability for any particular value in a continuum. All we can do -- deal with in Bayesian terms -- is a probability density, and unfortunately when we go back and interpret this in terms of probability densities we end up with a posterior probability density over hypotheses, and that does not lead, in any acceptable way, to a way of choosing a particular hypothesis or a particular estimate for the mass of the electron, or its charge, or whatever. However, when we do it in terms of message lengths, it all comes out in the wash and we do end up with (well the fact that we may have real-valued parameters causes one to scratch one's head over some algebra, but you get) some sensible answers coming out at the end.
OK. So in one sense all we have done is restate the Bayesian view of inductive inference but try to tidy it up and extend its application to the estimation of real-valued parameters, which Bayesian inference does have some problem with.
If that was all we had done it would not be terribly interesting. However there is another way of looking at the question of message lengths, an alternative to Shannon's theory, where we consider what happens if our receiver of the explanation, instead of regarding R as a person or agent with some prior probability beliefs about possible hypotheses, and background knowledge, we regard it as a universal Turing machine.
|
Now, a universal Turing machine is basically just a computer. The particular universal Turing machine I want to discuss can read things recorded on an input tape -- a message. It can output onto another tape, and it has a work tape, or many of them for that matter, of essentially infinite length, on which it can record and read back anything it chooses. OK, so it is a computer with a nice big disc or tape attached to it, and the ability to read data in from the outside world. Having read the input tape it cannot rewind it; it cannot go back and have another look at the input. And having output, it cannot recall the output and change it -- that's it.
Now, you will recall that our data is presented in some observation language or code. For any such observation language or code there is a trivial way of rewriting it in the binary language of 0s and 1s, and this is done in telephones and increasingly in TV and whatever. And so we will think of our computer also as just inputting and outputting 0s and 1s -- strings if 0s and 1s. In our terms now, this is going to be the receiver of the message. The message will be fed to it as a strings of 0s and 1s on the input tape and, to prove that the computer has actually understood that message and recovered that data from it, we are going to require the computer to output a copy of the data in the original observation language, again just as a string of 0s and 1s. So now an explanation is an input tape which when fed into a Turing machine causes it to output an output tape which is the priginal form of the data. And an explanation is acceptable if the input tape that we find is shorter than the output tape which is produced and, of all the possible explanations, we will prefer the shortest.
So what has changed? Well there is no mention in any of this model, on the face of it, of probabilities. And in particular, there appears to be no mention of prior probabilities of hypotheses or whatever. So what's happened? Well the probabilities are there but they are inherent in the design of the universal Turing machine. You can have lots and lots and lots of different universal Turing machines, all looking like this but with different internal works inside the black box in the middle. Now in one sense what's inside the black box does not matter very much -- which is why they are called universal Turing machines -- in that whatever is inside the guts of one Turing machine, you can make it behave like any other universal Turing machine by feeding it an appropriate input. You can reprogram it to imitate any other machine, and this is why they are called universal. But how much input you have to give the computer to make it imitate another one will depend on its internal design and on the internal design of the one you want it to imitate. So in a sense then, if we think of message lengths, a la Shannon, as reflecting the logarithm of probabilities, the length of the tape that we have to shove into a computer to make it behave like another computer, or to make it behave as if it accepted the hypothesis H, is a measure of the probability of the hypothesis H or, if you like, the prior probability of the computer we are imitating. The longer the input we have to stuff in to make the computer accept hypothesis H the less probable that hypothesis is for this paricular computer. So the choice of computer does determine the choice of our prior probabilities. This all sounds a bit weird but let's put it another way.
Consider an explanation. It contains two parts -- I've drawn it out as a long piece of tape here. The first part is in some sense a statement of a hypothesis. Now I will say a little more about just how a hypothesis is stated. The second part is the data recoded in a code or language which will be efficient if the hypthesis was true. Alright, well what does H really amount to here? When the Turing machine, T, has read H but before ir has started to read this part, when it has read H it is in effect a new and different Turing machine, one which is prepared to accept input in a code which would be efficient if H were to be true. So what is H if we think of it from a Computer Science point of view? H is a program which will cause the Turing machine to decode the code we would lke to use for recoding data if we believe H to be true. To give a simple example, supposing that we have got some data which we have obtained by picking up lots of small objects, getting their masses in a balance, applying forces to them, and seeing how fast these objects accelerate. So our data is a whole lot of triples, each recording the mass, the force applied, and the acceleration produced. And we could record these in some observational language just as a series of numbers.
Now having looked at this, and being at least as clever as Newton, we suddenly realise that, by golly, on the whole the force is very closely equal to the mass times the acceleration. So we advance an explanation which begins with an assertion of Newton's second law of motion, force = mass × acceleration, and then encode the data in a way which would be efficient if that were true. And if it were true, what is an efficient way of encoding a force - mass - acceleration triple? All you have to do is record the mass and acceleration then the computer itself can multiply those two together and get the force. OK, so we didn't have to put it here. Well that isn't quite true because the measured force probably won't exactly equal the mass times acceleration -- we have to allow for a little bit of error here, but the errors will be small and won't take up much room.
So what goes on the front here? Well basically a little program to read in triples of numbers, multiply the first in each triple by the second, add on a small correction which is the third, and spit out the result. A pretty elementary form of Newton's second law of motion but that is all we need -- something which decodes a code which would be efficient were the hypothesis true.
Well at that point I ran out of transparencies so I'll stop and revert to a blackboard.
Let me try to answer "what can we say about this view of inference to suggest that it is worth serious consideration?" Well there's a few things that we can say which, I think, merit its consideration. The first one is that an H which is accepted as the result of looking at some data is automatically falsifiable in Popperian terms. Any H, which allows the given data to be encoded in a way shorter than its original representation, must be such that there could be some data that could not be so encoded and indeed the encoding assuming H would be longer than the naive representation. This is an immediate consequence of Shannon's theory of information. So any hypothesis accepted as a basis of this process is subject to rejection by data which is at least conceivable, even if we do not observe it. There is observable data which could cause us to reject the hypothesis. That's nice.
Second part is that it does improve, in a certains sense, in two directions, I think, on Bayesian inference: It can handle continua of theories, as I've mentioned before, without any problems. The other point however is that Bayesian inference has a little bit of a problem about it. We have the rule that we would like to say
- writes on the board:
- P(H|D) = P(H) P(D|H) / P(D)
Kevin Korb will give you one answer as to how to get out of this. I will give you another and I say then that the thing which embodies our prior probabilities is the universal Turing machine, T, which is the receiver of our message, of our explanation.
Why did we choose that particular Turing machine? Well because our background knowledge and our prior experience had led us to expect that was an appropriate Turing machine to have. It incorporates our prior beliefs about the way the world behaves. But where did these beliefs come from? Well at some previous stage, we must have been more ignorant, and we had an earlier Turing machine representing our more primitive beliefs about the state of the world but we found that in explaining lots of data to this [earlier] Turing machine, lots of the time we had to always incorporate the same little stuff up the front of our hypotheses. That whenever we were talking about physical quantities, it seemed to be a darned good idea to start the first part of the explanation, the hypothesis stating part, with something which told the Turing machine how to multiply numbers together. Turing machines don't ecessarily know that -- you have to tell them, it's a little subroutine.
And so we would say, after a little while, well by golly it looks to me in any situation where there are quantitative observations it's a good chance that multiplication will be involved, so I'll build myself a new Turing machine which now has multiplication built in, and instead of having to put it in a rather long piece of tape to tell it how to do multiplication everytime I wanted to do something like that in decoding the data, I've just got to put in something very short. So I have changed the prior beliefs incorporated into the Turing machine -- it now knows about multiplication. And a little later on, I've probably got a quite well educated Turing machine which knows about division and sines and cosines and Bessel functions and linked lists and . . . oh . . . all sorts of stuff, all there buried into the design of the Turing machine. and all making it much easier and shorter to explain new data.
So I have explained how a more primitive Turing machine gets updated when we find that we are constantly having to give it the same little bit of input and all of our hypotheses, or many of them, tend to incorporate the same little bit of digital machinery. Well we just build it into another Turing machine.
|
Where did this more primitive Turing machine come from? Well it came from the same kind of process, and even more primitive Turing machine. And we have some sort of regress but is it infinite? Well the answer is no! We come to a simplest possible universal Turing machine. No matter how you define Turing machines, there is a simplest possible Turing machine. If you try to simplify it any further it ceases to be universal; it looses the capacity to imitate all Turing machines. There is in fact even some little bit known about what the simplest universal Turing machine looks like and it is really remarkably simple. It has something of the order of 100 internal states and that's all, which is extraordinarily simple. And it knows nothing; I mean it really knows nothing. You would have to feed a tape of hundreds and hundreds and hundreds of binary digits into it to tell it how to add two numbers together. So in my account at any rate, one is led rather naturally to an expression of complete ignorance.
Complete ignorance is defined by that state of knowledge represented by the simplest possible universal Turing machine. Notice incidentally that if we start with that simplest possible universal Turing machine, we don't need, strictly, ever to build smarter ones. All we need do is feed this a tape which will cause it to imitate smarter ones. There's no difference, except in how fast it will run, between a very simple Turing machine which has been educated, and a Turing machine which has been built smart. That's the whole idea of universality.
We do have that second advantage then, that I think there is a natural representation of total ignorance [writes on board]. Total ignorance has some slightly surprising features when defined in this way. If we think of a number, we may say oh well, it's a likely sort of number or an unlikely sort of number. If we have a machine which is totally ignorant then the sort of numbers like, oh I don't know, 3145926 is much more probable than this one, 3145922, because this one [the former] can be generated by a shorter input than that one because this one is just the first umpteen digits of π, and that is generated by a very easy little program, whereas this one [the latter] hasn't got any pattern to it at all, and so would take a good deal more input to generate. So total ignorance is not quite what we would expect.
It is possible that the world is unkind to us and we go and collect some data which is entirely chaotic. We look at the sky and there's patterns of stars on it with no real sense to it at all. That doesn't stop naive people from seeing patterns in it. But if we adopt this view of inductive inference, we will find that random data is indeed inexplicable: there will be no acceptable explanation of it, no hypothesis about the data which enables the data to be recoded concisely. And that's nice.
It means that this account of inference, inductive inference, suggests that what we accept as good inferences are really reflecting something which is really going on in the data -- some real regularity or pattern in the way the world is behaving. And we have the assurance that if the data was indeed random, then with extremely high probability, going to one as the amount of data increases, we would find no explanation of the data. So that's nice. What else is nice?
If the data has indeed any computable regularity then, in the limit, as the amount of data increases, the hypothesis preferred by this rule [MML] will converge to a representation of that regularity. There are even some slightly weaker results that if there is some regularity in the data, but it cannot be represented by a computable function, then the data may still be explicable, at least roughly, and that our hypothesis will converge to that computable function which is closest to the uncomputable one from which the data is really observed.
OK, so one has to take any asymptotic results like this with a grain of salt because it doesn't say how much data is going to be necessary to achieve this convergence. But at least we have the assurance that the process, as we get more and more data, is going to get us closer and closer to hypotheses that describe genuine regularities in the real world and we will not get stuck with some hypothesis about the real world which is simply wrong.
Finally, I have presented this essentially as a normative view of inductive inference. I am prepared to argue that it is, at least approxumately, a descriptive view of the kind of inductive inference we are talking about, although I would not attempt necessarily to say that the inductive inferences of a particulat thinker or scientist in a particular situation necessarily follow my prescriptions. What I would say is that those inductive inferences which a scientific culture ends up generally accepting could all satisfy the criteria which I have put up here -- they all do indeed allow the raw data we get from the outside world to be recoded in a considerably more concise fashion. Now obviously I have not got time to argue the point but I have illustrated it, for instance with Newton's laws of motion. You can think of dozens of examples for yourself if you like.
|
Finally I have said nothing about how we go about finding -- given data -- how we go about finding a good hypothesis. What I've talked about is testing whether the hypothesis is acceptable and whether or not it is better than some other hypothesis, given the data. And indeed there's not an awful lot I can say about finding the hypothesis in general.
In the most general case where we are including among the set of conceivable hypotheses all computable functions, or all computable rules, then one can show that there is no algorithm for finding the best hypothesis. If there was, it would be able to solve the halting problem in Computer Science which is known to be undecidable.
However, in sufficiently restricted cases one can conduct a systematic, and sometimes even reasonably efficient, search over a constrained set of hypotheses about data, and get a computer to automate this search, and come up with at least a good hypothesis, if not necessarily the best. This has been done in quite a number of problem domains, all of course fairly simple, but in many of them reaching conclusions which were not obvious to human beings and which in some cases have significantly refined the best theories that humans have previously come up with. The most recent, for example, in which David Dowe was using a program which we developed a while back, into which was fed numbers characterising properties of days, like how windy they were, how much rain there'd been, and whatever, and whether or not there had been a bushfire, and getting the computer to work out a hypothesis of this data in the form of a rule predicting whether or not there was likely to be a bushfire given the properties of the day. And the rule it came up with by automating this process indeed works better than the rules currently employed by the CFA.
|
You can get it to learn simple grammars given sentences in a simple language. The best explanation of those sentences is the grammar from which they are drawn; you can learn that and so on and so forth.
Well I'll stop there. But that is the final argument: It works.
-- Applause --
Questions
- Q: [inaudible]
- A: Indeed, that is where it all started. The simplicity of a hypothesis is well represented . . . we think, rather the complexity of a hypothesis is represented by the length of a string required to say what the hypothesis is. That seems fair enough. The strength of a hypothesis does seem to be well measured by how much it condenses the representation of the data. A hypothesis which was so strong that it implied another would reduce the representation to zero. What's left, the second part of the message, mostly is in fact those aspects of the data which are not explained by the hypothesis. So clearly, the less of the data there's left, the stronger the hypothesis. So I think the whole approach is designed to combine, . . . to trade-off complexity versus explanatory power of the hypothesis. And the whole idea of reducing things to message lengths is to automatically get quantities which are directly comparable, and additive.
- Q: [inaudible]
- A: There is some connection with maximum entropy information but I am not entirely sure that it has been sorted out properly. I'm not confident about talking about it.
- Q: [inaudible]
- A: I don't think it is necessary to appeal to possibility theory to make this hang together. If you like, you can practically erase the word "probability" from the whole argument, and just deal entirely with the lengths of binary strings, and how these are encoded and decoded, and all the convergence results still hold. Probability is, if one likes, an artefact that we come along with and look at this and say, well something which has taken a long thing to describe, I think, must be unlikely, I don't really believe it. This story is so long I am not going to believe it. These message lengths reflect subjective probabilities; I'm not saying they reflect objective probabilities in the outside world.
- Q: ?
- A: The language used to describe the hypothesis is determined in the Turing machine model; it is determined by the choice of Turing machine. In the Shannon-based model the optimum language for enunciating a hypothesis is determined by one's prior probabilities -- the prior beliefs about hypotheses. And indeed it is language dependent; the lengths of the statements of the hypotheses do depend on these prior beliefs, certainly. And I'd certainly make the point that there would be some observations of data which to, let us say, a physicist would be explicable. The best explanation a physicist would attach to it would have a hypothesis couched in language which would be totally unintelligible to a hunter-gatherer. And the hunter-gatherer might require an explanation, probably a less powerful one, using a totally different language, or might indeed find the data inexplicable completely. There's nothing you could tell the hunter-gatherer which will encode the data for him, shorten the data. Of course given enough data, even the hunter-gatherer -- you'll tell him all the laws of physics if necessary. But for a limited amount of data, it my be inexplicable to some person who lacks the prior beliefs -- or prior language, it's the same thing -- and yet be quite easily, and well, explicable to somebody else.
- Q: ?
- A: The different Turing machines will require different inputs in order to absorb the same hypothesis. And as I was suggesting, a very simple Turing machine might not even know how to multiply two quantities together, and so a part of the statement of the hypothesis that later uses multiplication for the Turing machine would have to include a description of how to do multiplication whereas it would not be required for another Turing machine.
- Q: ?
- A: Although not very different, because you can say how to multiply quite briefly really to any Turing machine. It takes a few hundred binary digits.
- Q: . . . your system to some extent compliments
the entropy principle . . .
looking at the entire data what is
the optimum hypothesis we can infer?
- A: Yes, that's right. But indeed what one is saying is really, if one wants to talk about Huffman encoding, what probabilities should I use in doing my Huffman coding? That's what the hypothesis is. And we choose those probabilities, or it may be a Markov model, or whatever, so that the Huffman coding will work as well as it can. It will cause the maximum reduction in the volume of the data. So we are looking for a hypothesis which implies a data distribution of low entropy. In other words, a strong hypothesis. But then the data has got to actually fit the hypothesis or otherwise we'll end up with a long message.
- Just finally, I had hoped somebody would say here is an obvious paradox which wrecks it. Come on, philosophers are supposed to be good at this.
- Q: ?
- A: Are you saying, can we have an acceptable explanation but [one that is] useless from the point of view of prediction?
- Q': ?
- A: For a given Turing machine we want it to produce an output D. There is no algorithm for finding the shortest input which will cause it to output that D. There will be such an input but there is no algorithm for finding it.
On the other question, is any good explanation good by these rules which does not have predictive implications? The answer is no. Anything which is good as an explanation according to the tests I have suggested can in fact be used to make predictions, Specifically, if we think about the Turing machine model, which is easiest, supposing we have indeed observed some data D. And we have found an input I which is a hypothesis followed by something else (data given the hypothesis) which causes the machine to output D. Alright, what we do is set up the machine, we stick that input in so that the input has been entirely read and the output has therefore been produced. And now we stick in some more binary digits and ask what comes out? Well, what comes out are possible future events according to the hypothesis that we've told the Turing machine about. And the more digits which you have to put in to make it output a prediction that the sky will fall, then the less probable it is that the sky will fall, according to the hypothesis which the Turing machine now accepts. And if we find that a very short number of extra digits is required to make it output the pseudo-prediction that Australia has gone broke, we'll say that has a very high probability given the hypothesis. So any hypothesis that is accepted by these rules necessarily makes predictions, which may or may not be true. It does not leave you in the same state, as far as far as predictions are concerned, as you were before the hypothesis was advanced. - A: Are you saying, can we have an acceptable explanation but [one that is] useless from the point of view of prediction?
- Q: ?
- A: I didn't use the word "truth". I don't intend to. Well I do say things like, we will encode the data in a language which would be efficient if the hypothesis was true. Now I'm not concluding that the hypothesis is true, but I'm not concluding that the hypothesis is true simply because it gives a good explanation. All I'm saying is we'll accept it for the time being until we find cause to switch to a different hypothesis.
- Q: ?
- A: Well that would be my view, but it is not one accepted by everybody.
- Q: ?
- A: Our Turing machine is doing things like that. We can put into the hypothesis assumptions, and then the Turing machine will make deductions from those assumptions which will enable it to output things about the data. The Turing machine is doing deductive work in this, but whether or not that works has got nothing really to do with truth, I think, but I don't really care one way or the other.
- Q: ?
- A: I know [???] . . . never been able to quantify it, and until you can quantify it you don't know if it is nonsense or not, really. And one of the values of this is it is quantifiable: You end up with numbers.
- Q: ? . . . totally random data.
- A: Well quite a lot of the data which was fed in was of dubious relevance. We did not know if it was relevant or not. Most of it, it said, no it was not relevant. It did not include it in the predicion function. So in this process, if in fact some of the data has got nothing to do with the rest of the data, then the hypothesis really won't attempt to recode that extraneous data at all. It will just be copied, and will have to appear in full in the shortest input because it is not reducible. In practice, a large amount of extraneous data, or irrelevant data, makes the search a lot harder and can confuse the searching processes to make them less reliable. But it does not really affect, well unless you have an awful lot of it, it does not affect the outcome.
- Q: ?
- A: Obviously one is limited, as always, by the space of models one is prepared to search and if the model does not include the outcome of the comparison between two variables then you're stuck. And indeed this will be a defect of pretty well all current decision-tree programs, that they just look at one variable at a time and they don't look at combinations or variables. They can come close to it, and given enough data they will approximate it by saying, if variable A is less than three then I should see if variable B is greater than three, and if variable A is less than four then I should see if variable B is greater than four. Obviously inefficient, but given enough data you'll get there.
- Q: . . . genetic algorithm . . .
- A: We'd like to do that and include much bigger search spaces -- include larger possible sets of decision functions. We have gone beyond simple trees. Jon Oliver produced a version where the decision function is an arbitrary directed [acyclic] graph. It has disjunctions in it as well as conjunctions. But you'd really like the thing to learn is sine(X) > log(Y)? And I have no idea how to do that, no idea.
- Q: Are you assuming that nature is simple?
- A: Yes. Oh yes. The whole thing here is that again taking the Turing machine model, we are assuming that the regularities which exist in nature approximate functions which can be easily approximated by a Turing machine. Well so far we have won. It has been true. No, I shoudn't say that. What I should say is that all of the regularities which have in fact been observed and which are now accepted are readily computable by a Turing machine. There may be others which are not but, if so, I suggest we won't find them. And I cannot think of any which we have not found.
- Q: ?
- A: Yes, you are a Turing machine.-) OK, there are obvious things where we've got no idea whether this is true or not. But it is simple for us to recognise the faces of our children, even if they are standing on their heads. Nobody has the faintest idea how to do this in a computer under bad lighting conditions and whatever. That does not mean it is necessarily difficult for a Turing machine, it just means we don't know the right programs yet. And so I think, yes, I am assuming simplicity, and I'm assuming a particular sort of simplicity. It is very difficult to define any other sort of simplicity we can quantify, I think. Now I take your point that we should be the measure of all things, rather than some Turing machine, but at least, at the moment, there is no strong argument for saying that what we do mentally goes beyond what is conceptually possible for a Turing machine to do, as far as I know.
- Q: . . . to find the hypothesis in the first place.
- A: If they [two hypotheses] gave the same message length, then we would have no basis for prefering one to the other and indeed we should continue to entertain both of them as possibilities until more data came up.
- Q: . . . combine the two hypotheses ...
- A: If there is some meaningful way to do that. Certainly you can combine them for predictive purposes.
- Q: ?
- A: You can compare the message lengths, and indeed this is routinely done, and it seems to give sensible results in real cases. The hypotheses being compared can be formally entirely different with quite different sets of parameters; it does not matter
- Q: ?
- A: Assuming you are using the same data.
Kevin Korb then wound up proceedings and discussion moved to the tea room.
This seminar was presented by Chris Wallace (CSW) on 10 August 1993, and was transcribed and edited during November 2011 by L. Allison. (CSW also gave a talk on the History of MML on 20/11/2003.)
Selected Reading
- J. Oliver & C. S. Wallace, Inferring Decision Graphs, TR 92/170, Dept. of Comp. Sci. (later Faculty of IT), Monash University,
- and in,
Proc. of the 1992 Aust. Joint Conf. on Artificial Intelligence,
pp.361-367, 1992.
- C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Springer Verlag, 2005. [*]
- C. S. Wallace & D. M. Boulton, An information measure for classification, the Computer Journal, 11(2), pp.185-194, 1968. [*]
- -- 1st practical application of information theory to
inductive inference, and
classification,
yielding a useful, working, computer program (Snob).
- C. S. Wallace & P. R. Freeman. Estimation and inference by compact coding, J. of the Royal Statistical Soc. series B, 49(3), pp.240-265, 1987, [*]
- -- generalized some of the algebra.
- C. S. Wallace & J. D. Patrick, Coding Decision Trees, Machine Learning, 11, pp.7-22, 1993,
- -- introduced the tree code and message length calculations for decision trees; tests demonstrate good generalization and resistance to over-fitting.
[*] These three make good general references for Minimum Message Length (MML) inference, including as they do, one of the first papers (1968), an important generalization (1987, & it's in statistical terms), and the MML book (2005).
Also see all CSW's [publications], and [MML] -- L.A.