False Oracles and SMML Estimators

C. S. Wallace, Proc. Int. Conf. on Information, Statistics and Induction in Science (ISIS), pp.304-316, 1996

(based on TR 128, Dept. Computer Science, Monash U., 1989)

1. Introduction

Recently there has been interest in a general method of statistical estimation and inference variously known as Minimum Message Length (MML) (Wallace & Freeman, 1987) or Minimum Description Length (MDL) (Rissanen 1987). The method is an application of a general approach to inductive inference first proposed by Solomonoff (1964), and closely related to the notion of Algorithmic Complexity due to Kolmogorov (1965). The method is Bayesian, and is based on the idea of choosing an hypothesis and estimates of its parameters which together allow a concise encoding of the data. The method is of great generality, and in those specific estimation problems which have so far been addressed using MML, gives estimates similar, but often superior in some respect such as bias, to those obtained by the Maximum Likelihood (ML) method. In particular, MML handles more gracefully problems where the number of unknown parameters increases with sample size, and can automatically select an appropriate hypothesis from a family of hypotheses of increasing complexity.

This page was created from what I believe to be the 1996 source file, hopefully without adding too many errors, to make this MML paper more readily accessible -- LA 2008

NB. I have used θ for θhat and φ for φhat

The consistency of MDL for a wide range of problems has been proved by Barron & Cover (1991), who also derive the ultimate rate of convergence of the estimates with increasing sample size for a range of problem types. Unfortunately, little is known in general about the behaviour of MML estimates for modest sample sizes. For instance, there have been no general results on the bias or variance of estimates, or on the error rates of the inbuilt hypothesis selection. Wallace and Freeman (op cit) and Rissanen (op cit) have argued that the estimates are close to sufficient, in the sense of capturing all the relevant information available in the data, but the arguments are outside conventional statistical theory.

In this paper, we attempt to show that MML can be expected to have good behaviour for modest samples. Consistency arguments prove good behaviour in the limit of large samples. We have not entirely avoided the need to appeal to a limit, but rather than considering the behaviour as an unbounded volume of data is used in a single estimation, we consider the behaviour as the method is applied to an unbounded set of (possibly unrelated) estimation problems, each using only limited data. That is, we consider the expected consequences of adopting the strategy of using MML for all desired inferences. Sections 2 and 3 discuss a proposed criterion of estimate quality for a single estimation problem. Section 4 extends the criterion to cover the behaviour of a method in a sequence of problems, and relates it to conventional criteria such as bias. These sections also define an “optimal” method according to this criterion. Sections 6, 7 and 8 then show that the MML method approaches this optimum.

Of the several criteria which are commonly used for judging the quality of statistical estimators, none is entirely satisfactory, at least by itself. Some, such as consistency, are applicable only to families of estimators for an unbounded range of sample sizes. Others, such as bias and expected variance, are tied to particular parameterizations of the problem, and may give different judgements if the parameters to be estimated are subjected to functional transformation. Others again, such as sufficiency, can be satisfied by estimators with severe deficiencies. In this paper, we propose yet another criterion and investigate some of the properties of estimators satisfying it. We give a constructive proof that such estimators generally exist, and also show that Strict Minimum Message Length (SMML) estimators (Wallace & Freeman 1987) in some circumstances closely approach satisfaction of the criterion.

The framework of the discussion is Bayesian.

We wish to estimate the value of an unknown parameter θ within a space Θ, given a prior density h(θ) on Θ, an observed datum x within a space X, and a probability density f(x|θ) (x∈X, θ∈Θ). At times we use the joint density p(θ,x) = h(θ)f(x|θ), the marginal density of x, r(x) = dθ p(θ, x), and the posterior density g(θ|x) = p(θ, x) / r(x). In some parts of the discussion we prefer to treat x as a discrete datum within a countable set X.

2. Empirical Criteria

Definitions

The above definitions are intended to provide a framework for the empirical comparison of estimators, i.e. comparison based on the observable behaviour of estimators rather than on their structure. Of course, the outcome of a single trial using some criterion is a function of the value of x realized in the trial, and so the preference for one estimator over the other will be subject to the accidents of the particular trial. Later, we will discuss a class of estimation problems for which a single trial might be sufficient to determine a preference reliably. However, we now retreat somewhat from the ideal of a purely empirical comparison, and define the expected outcome of a trial.

c( , , )
c( , )

The expected value of a trial of estimators T0 and T1 using criterion c is the expected value of c(x, T0(x), T1(x)), written c(T0, T1), where the expectation is taken over the joint distribution of x and θ. Where one or both of the estimators are stochastic, the expectation is also taken over their value distributions.

For T0, T1 honest and deterministic, the estimators are independent of θ and the expectation is simply over the marginal distribution of x.
c(T0, T1) = dx r(x) c(x, T0(x), T1(x))
For T0, T1 honest, T1 stochastic, T0 deterministic,
c(T0, T1) = dx dφ r(x) G1(φ|x) c(x, T0(x), φ)
and for T0, T1 honest and stochastic
c(T0, T1) = dx dθ dφ r(x) G0(θ|x) G1(φ|x) c(x, θ, φ)
For the special case that one estimator, T0 say, is the oracle Tt and T1 is deterministic,
c(Tt, T1) = dx h(θ) f(x|θ) c(x,θ, T1(x))
and for T0 the oracle, T1 stochastic
c(Tt, T1) = dx dφ h(θ) f(x|θ) G1(φ|x) c(x,θ, φ)
Note that in all cases, for c fair, c(T0, T1) = - c(T1, T0).

3. False Oracles

A false oracle is an estimator Tf such that for all fair criteria c, and for Tt the oracle, c(Tf, Tt) = 0.  That is, no fair criterion will be expected to prefer the oracle to a false oracle, or vice versa.

3.1 Theorem:

For Tf a false oracle, Tt the oracle, T1 honest and c a criterion not necessarily fair c(Tf, T1) = c(Tt, T1).
 
Proof: Consider the function
c*(x,a,b) = c(x,a,T1(x)) - c(x,b,T1(x)),     a,b ∈ Θ
where, if T1 is stochastic, we have the same realization of T1(x) in both places. Since T1 is honest, c* is not a function of θ. Also, c*(x,a,b) = -c*(x,b,a). Hence c* is a fair criterion, so by definition of Tf,
c*(Tt, Tf) = 0
But
c*(Tt, Tf) = E{c(x, Tt(x), T1(x)) - c(x, Tf(x), T1(x))}
= E{c(x, Tt(x), T1(x))} - E{c(x, Tf(x), T1(x))}
= c(Tt, T1) - c(Tf, T1)    --Q.E.D.

This theorem shows that, while we cannot in practice compare an estimator with the oracle, since the oracle is unavailable, we can instead compare the estimator with a false oracle and get the same result in expectation.

3.2 Corollary: With the same conditions as for the theorem,
c(T1, Tf) = c(T1, Tt)
Also,
c(Tf, Tt) = c(Tf, Tf) = c(Tt, Tf)

3.3 Corollary: With the same conditions as for the theorem, the value on a trial of c(x, Tf(x), T1(x)) is as good an estimate of c(Tt, T1) as is the value on a trial of c(x, Tt(x), T1(x)) = c(x, θ, T1(x)).

Proof: Define ck(x, θ, φ) = (c(x, θ, φ))k
Then ck is a criterion, so ck(Tf, T1) = ck(Tt, T1)
But ck(T0, T1) is the kth moment of the sampling distribution of c(x, T0(x), T1(x)).
Hence c(x, Tf, T1) and c(x, Tt, T1) have the same sampling distribution.

3.4 Theorem:

False oracles exist in sufficiently regular cases.
 
The proof is constructive:
Let Tp be the stochastic estimator described by the density
Gp(θ|x) = g(θ|x) = h(θ) f(x|θ) / r(x) = p(θ,x) / r(x).
For any criterion c,
c(Tp, Tt) = dx dφ h(θ) f(x|θ) g(φ|x) c(x,φ,θ)
= da dx db p(a,x) p(b,x) c(x,b,a) / r(x),     (a = θ ∈ Θ, b = φ ∈ Θ)
 
c(Tt, Tp) = dx dφ h(θ) f(x|θ) g(φ|x) c(x,θ,φ)
= db dx da p(b,x) p(a,x) c(x,b,a) / r(x),     (b = θ, a = φ)
 
The expressions for c(Tp, Tt) and c(Tt, Tp) differ only in the order of integration. Hence, for sufficiently regular h, f and c, they are equal,
c(Tt, Tp) = c(Tp, Tt)
But, if c is fair, c(Tt, Tp) = - c(Tp, Tt)
Hence c(Tp, Tt) = 0 for all fair c, so Tp is a false oracle.

3.5 The Merit of False Oracles

False oracles deserve to be considered good estimators in that no fair criterion is expected to show a preference for the true parameter value over the estimate of a false oracle. Indeed, the above results show that no test based only on the information available to the estimator can reliably distinguish an oracle from a false oracle. With the more conventional criteria applied to estimators (some of which will be further discussed below) it is often the case that an estimator satisfying one criterion (e.g. lack of bias) will not satisfy others (e.g. maximum likelihood). However, for the empirical criteria discussed here, we have shown that there exists a class of estimators which give results as good as the true values by all empirical tests, since c(Tf, Tt) = c(Tt, Tf) for all c, fair and otherwise.

Note that it is not true that all criteria, or even all fair criteria, will prefer a false oracle to an honest estimator which is not a false oracle. Suppose x and θ are commensurable, and that x is distributed as N(θ,1). The fair criterion

c(x, θ, φ) = (x - θ)2 - (x - φ)2

will always prefer the estimator θ = x to a false oracle. It will also, of course, always prefer θ = x to the oracle θ = θ, so the import of its preference is dubious.

The construction of the false oracle Tp amounts to a rule for characterizing the posterior distribution of θ by a value (estimate) drawn randomly from the posterior. It may seem a strange rule compared with, say, choosing the mean, median or mode of the posterior, but it has the advantage of being invariant under arbitrary measure-preserving transformations of the parameter space.

4. Sequence Trials

We now attempt to define a class of estimation problems, and associated trials, which allow a comparison between our ‘empirical’ criteria and some more conventional ones.

A sequence problem is an ordered sequence of independent instances of the same basic problem. For a basic problem described by Θ, x, h(θ), f(x|θ), a sequence problem of length J has parameter space ΘJ and data space XJ. The data are an ordered sequence of values {xj, j=1 ... J} from which an estimator must form an estimate which is an ordered sequence {θj, j=1 ... J}, where θj is an estimate of the true value θj of the parameter in the jth instance of the basic problem. The prior density for the true sequence parameter {θj, j=1 ... J} is j h(θj) and the conditional data density is j f(xjj). Note that we assume the instances are wholly independent: data from early instances do not lead to revision of the prior or conditional data density in later instances.

An estimator T for a sequence is a function from XJ into ΘJ. We do not require that the estimator have the form of J instances of the same function applied in turn to {xj}. The ‘black box’ model of an estimator is not required to produce any output until it has the entire data sequence, so θj may depend on all components of {xk, k=1 ... J}, not just xj.

In a sequence trial of estimators T0, T1 using criterion c, c is a function c({xj, θj, φj, j=1...J}) which operates on the data sequence and the two estimate sequences to yield a single value, which need not be expressible as a combination of J applications of a simpler function.

Since a sequence problem and sequence trial are just special cases of the problems and trials defined earlier, all definitions and results carry over, with only the notational difference that now the symbols Θ, X, h( ) and f( ) refer to the component basic problems of the sequence rather than to the sequence problem as a whole. Note that, since the basic problems are independent, the posterior density of j, j=1...J} given {xj, j=1...J} is j g(θj|xj). Thus the false oracle Tp of theorem 3.4, which picks an estimate sequence randomly from this posterior density, may be considered to choose randomly and independently a value from each basic posterior g(θj|xj).

4.1 Sequence Criteria:

Using sequence trials, we can define empirical criteria corresponding to some conventional criteria. For D a function from Θ to the real line, v∈Θ, δ a small constant and J large, define

Dv{xj, θj, φj, j=1...J} = Σj∈S D(θj) / NS

where S is the set {j; |φj - v| < δ} and NS is the cardinality of S. Dv is the average value of D(θj) over those components of the sequence trial in which φj approximately equalled v. Its expected value, Dv(T0, T1) approximates the expected value of D(θ) given φ = v.

Letting Dv be the function Bv(θ) = θ - v, call Bv(T0, T1) the bias of T0 relative to T1. It is, of course, a function of v. For T1 the oracle, Bv(T0, Tt) is simply the expected value of θ - θ, given θ = v. Hence Bv(T0, Tt) corresponds to the usual definition of the bias of T0. While Tt is unavailable, we have Bv(T0, Tf) = Bv(T0, Tt) for Tf a false oracle. Hence the bias of T0 can be estimated by measuring the bias of T0 relative to Tf in a sequence trial. Letting T0 be the false oracle Tp,

Bv(Tp, Tt) = Bv(Tp, Tp) = dx dθ f(x|v) g(θ|x) (θ - v)

This is not, in general, zero. However, note that Bv(Tt, Tp) = Bv(Tp, Tt). The ‘bias’ of the oracle relative to a false oracle Tf equals the bias of Tf, so empirical measurement of relative bias does not serve to distinguish false and true oracles. Also, note that Bv(Tt, Tt) ≠ Bv(Tt, Tf), since Tt is not honest. Of course, Bv(Tt, Tt) = 0 for all v.

Similarly, letting Dv be the function Vv(θ) = (θ - v)2, the usually-defined variance of T0 is Vv(T0, Tt), and can be empirically measured by the value of Vv in a sequence trial of T0 and a false oracle Tf. For the false oracle Tp,

Vv(Tp, Tt) = Vv(Tp, Tp) = dx dθ f(x|v) g(θ |x) (θ - v)2

This expression shows that the variance of a false oracle is the variance of the convolution over x of f and g. For h(θ) slowly varying, it is about twice the variance of a conventional Bayes estimator with quadratic loss function.

5. SMML Estimators

Strict Minimum Message length (SMML) estimation has been described elsewhere (Wallace & Freeman 1987). We use the same Bayesian framework defined by Θ, X, h(θ), f(x|θ) etc. as before, but initially it is convenient to consider x to be a discrete variable, so X is a countable set, and f(x|θ) and r(x) are probabilities rather than densities.

The SMML estimator θ = m(x) has a range Θ* = {θi, i=1, 2, ...} = {θ : θ = m(x), x∈X} which is a countable subset of Θ. Define Xi = {x : m(x) = θi} and qi = q(θi) = ΣXi r(x) for all i. Then m( ) minimizes

I1 = - ΣX r(x) log[ q(m(x)) f(x|m(x)) ]
= - Σi [ qi log qi + ΣXi r(x) log f(x | θi) ]

The estimator is derived by considering the expected length of a message encoding x and decodeable given knowledge of X, Θ, f(x|θ) and h(θ). By standard coding theory, the expected length cannot be less than I0 = - ΣX r(x) log r(x), where the base of logarithms depends on the coding alphabet, e.g., 2 for binary coding. However, we consider a message having the form of a statement of an estimate θ, followed by an encoding of x using a code which would be optimal (i.e. have least expected length) were θ = θ. Using optimal coding for the statement of θ, the length of the message encoding x is L(x) = - log q(m(x)) - log f(x|m(x)). (Actually, the length of a real message might exceed L(x) by as much as one symbol, since real messages must have integral lengths. However, we neglect this rounding-up effect.) The first term of L(x) gives the length of the encoding of the estimate θ = m(x). Note that q(θi) is the unconditional probability that θi will be the estimate. The second term is the length of the encoding of x using a code optimal if θ = m(x). I1 is the expected value of L(x).

Since I0 is independent of the estimator, the SMML estimator minimizes

I1 - I0 = - ΣX r(x) log [q(m(x)) f(x|m(x)) / r(x)]

If the discretization of the data is so fine that we can consider X as a continuum, we can revert to the framework of the earlier sections, and the SMML estimator minimizes

I1 - I0 = X r(x) log[q(m(x)) f(x|m(x)) / r(x)]dx,
where
q(θi) = Xi r(x)dx.

Θ* remains a countable subset of Θ, but I1 and I0 are not individually definable.

Minimization of I1 - I0 requires a compromise in the size of the region Xi within which all datum values lead to the same estimate θi. Large regions have large values of qi, leading to a short statement of θi, but if the region is too large, it will contain values of x for which f(x|θi) is much less than f(x|θm) where θm is the maximum-likelihood estimate for x, so the expected length of the statement of x will be greater than for smaller regions.

We now show that in a sequence of basic estimation problems such as was described in section 4, the SMML estimator approximates a false oracle. Two cases are considered. In the first, which applies to basic problems with approximately quadratic log-likelihoods, a fairly exact analysis is possible. The second presents a less exact treatment of a more general sequence problem. Throughout, j indexes the component problems of a sequence of J basic problems, and i indexes the members of Θ*, the range of an SMML estimator.

6. A Simple Case

Assume that for all j, f(xjj) = (1 / (√(2 π) σ)) e-(xjj)2 / 2σ2, σ known, and h(θj) = a,  a constant over a large range. If SMML estimation were applied separately to each problem of the sequence, we would obtain for each j an estimator θj = m(xj) with the same form for every j. We drop the j suffix and consider one estimator m(x). Its range is simply a set of equally-spaced values of θ, with spacing s say. The SMML estimator maps x into the member of the set most nearly equal to x. Since, over the range of interest, r(x) = h(θ) = a, we have that for every θi ∈ Θ*, Xi is the interval {x: θi - s/2 < x < θi + s/2} and qi = as.

The estimator minimizes the expected value of - log q(m(x)) - log f(x|m(x)) + log r(x). Since all of the regions Xi have the same pattern, the expectation can be taken over any one of them, giving

I1 - I0
= - log(as) + 1/2 log(2 π) + log σ + 1/s { θi-s/2θi+s/2 dx (x - θi)2 / (2 σ2) } + log a
= - log s + (1/2) log(2 π) + log σ + s2 / (24 σ2)
Minimization with respect to s gives
s2 = 12 σ2,
I1 - I0 = (1/2) log (2 π e / 12).
The result of applying this estimator to all J problems of the sequence is therefore an expected message length increment for the whole sequence given by
I1 - I0 = (1/2) J log(2 π e / 12).

This result is not the best that can be achieved.

An SMML estimator for the whole sequence problem minimizes the expected value of
- log q(m(x)) - Σj log f(xj|mj(x)) + log r(x)
where now the estimator m(x) maps the sequence or vector of data x = (x1, x2 ... xj ...) into a vector
θ = (θ1, θ2 ... θj ...) = (m1(x), m2(x), ..., mj(x), ...),
and r(x) = aJ.
Thus the SMML estimator minimizes the expected value of
L(x) = - log q(m(x)) + (J/2) log (2 π) + J log σ + (m(x) - x)2 / (2 σ2) + J log a

Its range Θ* is a countable set of vectors or points in J-space, {θi}. Assume these points form a lattice. Then q(θi) = qi is the same for all i, and for all i the region Xi within which m(x) = θi is the set of x values closer to θi than to any other member of Θ*, i.e., the Voronoi region of θi. For Θ* a lattice, the Voronoi regions of all its members have the same size and shape. Let s be the volume of Xi for all i. Then qi = saJ for all i. The expectation of L(x) can be taken over any Voronoi region, giving

I1 - I0 = - log (saJ) + (J/2) log (2 π) + J log σ + E(x - θi)2 / (2 σ2) + J log a.

The expectation E(x - θi)2 over the Voronoi region of θi depends on the volume s and the shape of the region. To separate these effects, write E(x - θi)2 = J kJ s2/J, where kJ depends only on the shape of Xi, i.e. the geometry of the lattice. Then

I1 - I0 = - log s + (J/2) log (2 π) + J log σ + JkJ s2/J / (2 σ2)
Minimization with respect to s gives
s2/J = σ2 / kJ,  
I1 - I0 = (J/2) log (2 π e kJ)

Minimization with respect to the geometry of the lattice requires choosing a lattice which, for a given density of points, minimizes the mean squared distance of all points from the nearest lattice point. Lattices chosen in this way are called ‘quantizing lattices’.

Use of the single-problem SMML estimator separately for each problem is equivalent to choice of a rectangular lattice, for which kJ = 1/12 for all J. However, for J > 1, lattices exist with kJ < 1/12. For instance, for J = 2, a lattice with hexagonal rather than square Voronoi regions has kJ = 0.080175.. < 1/12 and hence leads to a smaller value for I1 - I0. Note that in the resulting SMML estimator, θ1, is a function of both x1 and x2. If x1 is held constant and x2 is changed, θ1 may assume in turn two or more different values, all however close to x1. Similarly, θ2 is a function of both x1 and x2, but is always close to x2.

For large J, the optimum quantizing lattice is not known. However, Zador (1982) has shown that
Γ(J/2+1)2/J Γ(2/J+1) / (J π) > kJ > Γ(J/2+1)2/J / ((J+2) π)

The upper bound is the value for a random set of points (not a lattice) and the lower bound would be achieved only if the Voronoi region were a J-sphere.

Using Stirling’s approximation in the form
log Γ(y+1) = (y + 1/2) log y - y + 1/2 log(2 π) + O(1/y),
 
log kJ > (2/J) [(J/2 + 1/2) log(J/2) - J/2 + 1/2 log(2 π)] - log(J+2) - log π + O(1/J2)
> - log(2 π e) + (log J + log π - 2) / J + O(1 / J2)
Similarly, and using log Γ(1+y) = - γ y + O(y2), where γ is Euler’s constant,
log kJ < - log(2 π e) + (log J + log π - 2 γ) / J + O(1/J2)

Hence (1/2) log(J π) - γ > I1 - I0 > (1/2) log(J π) - 1,   to order 1/J.

We see that, whereas use of a rectangular lattice, i.e. repeated use of the one-problem SMML estimator, gives I1 - I0 = (1/2) J log (2 π e / 12) proportional to J, the optimum SMML estimator has at worst I1 - I0 = (1/2) log (J π) - γ.

Although the shape of the optimum Voronoi region for large J is unknown, the close approximation of the optimum lattice to the behaviour of spherical regions strongly suggests that the optimum Voronoi region is very nearly spherical for large J.

Making this assumption, the equation s2/J = σ2/kJ for its volume s gives for its radius ρ
ρJπJ/2 / Γ(J/2 + 1) = s = σJ / kJJ/2

Using the lower bound as an approximation for kJ, ρ = σ √(J+2)

If we consider one component of θ and x, i.e. one estimation problem of the sequence, the estimate θj is a function of all components of x, not just of xj. If the other components are thought of as random variables, all that is known of x is that it lies in the subspace defined by xj, and all that is known of θ is that it lies somewhere in a sphere of radius ρ centred on x. Thus, θj can be regarded as a random variable generated by the projection of a random point in this sphere onto the j axis. The density of θj is therefore proportional to

2 - (θj-xj)2}(J-1)/2 = {(J+2)σ2 - (θj-xj)2}(J-1)/2

For any J, this distribution has mean xj and standard deviation σ. For large J, it closely approximates N(xj, σ2), which is the posterior density of θj given xj.

The SMML estimator thus produces estimates for the individual problems of the sequence which have approximately the behaviour of values chosen at random from the individual posterior densities. That is, it approximates the false oracle Tp.

The results for this simple case are readily generalized to any sequence of estimators in which the log likelihood for the sequence of data has an approximately quadratic behaviour near its maximum, and where the prior for the parameter sequence is slowly-varying. It is not necessary that all basic estimation problems in the sequence be instances of the same problem, or even that they require estimation of the same number of parameters. The results depend only on the assumption of a slowly-varying priors, and approximately quadratic log-likelihood.

7. A More General Case

We now describe a construction for a near-optimal encoding of a data sequence which does not require smooth or quadratic behaviour in the log prior and log likelihood functions. As before, we consider a sequence of J independent instances of the same basic problem described by Θ, X, h(θ) and f(x|θ). We assume x is a discrete variable.

In this construction, each estimate θj will be selected from a set Θ* of equally-spaced values {θi, i=1, 2 ...} whose spacing δ is small compared with the expected estimation error of each θj. Index i is used for indexing members of Θ*, and index j for indexing problems of the sequence. We assume Σi δ h(θi) = 1.

The message is constructed as J segments. Basically, each segment comprises two parts. The first states an estimate θj of θj, selecting θj from Θ*, and has length - log(δ h(θj)). The second encodes xj using a code optimal if θj = θj, and has length - log f(xj| θj). The total length lj of segment j is

lj = - log(δ h(θj) f(xj|θj)) = - log(δ p(θj, xj))

However, the segments are combined into a complete message in such a way that the length of the message is less than Σj lj.

As δ is small compared with the expected estimation error of θj, there will be several members of Θ*, separated by multiples of δ, any of which would give a near-minimal value of lj. The basis of the construction is use of the freedom to choose any one of several more-or-less equally good values for θj so as to convey information about the following segment, j+1.

For definiteness, assume binary coding, so logarithms are to base 2.

The construction proceeds backwards, in decreasing order of j. Let MJ+1 be the empty string, of length LJ+1 = 0. Suppose that a string Mj+1 of length Lj+1 has been constructed encoding estimates and data for problems j+1, ..., J. To form a string Mj covering in addition problem j, proceed as follows:

Steps (a) to (f) are repeated until the final complete message M1 has been formed.

We must show that M1 can be decoded using knowledge only of h and f, and of the construction algorithm. Decoding proceeds in order of increasing j.

The segment S1 for problem 1 appears in full as the prefix of M1. It is easily decoded. Its first part is a Huffman or similar coding of θ1 based on the known distribution δ h(θi) over Θ*. Having decoded θ1, the second part can be decoded, as it is a Huffman or similar coding of xi based on the now-known distribution f(x|θ1) over X.

Having decoded s1, we can delete it from the front of M1, leaving M-2.

Now, knowing x1, we can compute the distribution ui1 over Θ*, and hence reconstruct the code C1. Suppose the decoded estimate θ1 has index k in Θ*. Then we can select w1k from C1, and prepend it to M-2 giving M2.

Since S2 appears in full as the prefix of M2, it too can now be decoded, and so on until the entire sequence of data (and estimates) has been recovered.

At the end of the decoding process, we will be left with a string M+J+1 comprising all the Σj tj random digits appended in steps (c) of the coding algorithm.

It will be useful in what follows to imagine that the coding process began with these randomly-chosen digits already in place, i.e. with MJ+1 = M+J+1, LJ+1 = Σj tj. The result of the coding is then unchanged, but in step (c) it would never be necessary to append additional digits, and at every stage of the coding process we have L-j+1 = Lj+1 + log ukj.

For all j,
Lj = L-j+1 + lj = Lj+1 + log ukj - log(δ p(θk, xj))
where, of course, k is a function of j.

Now, ukj = δ h(θk) f(xj|θk) / r*(xj) = δ p(θk, xj) / r*(xj)

Hence,
Lj = Lj+1 - log r*(xj)
L1 = LJ+1 - Σj log r*(xj)
We now aim to show that the sequence of estimates yielded by this construction is close to an SMML estimate for the sequence. To do so, we will try to show that I1, the expected value of L1 over the marginal distribution of x, is not much greater than the minimum possible value
I0 = - ΣX r(x) log r(x).
In a long sequence, datum value x may be expected to occur with relative frequency r(x). Hence the expected value of the second term in L1 is
- J ΣX r(x) log r*(x)

Now, r*(x) = δ Σi p(θi, x) = r(x) δ Σi g(θi|x) where the sum is over a set Θ* of points equally spaced in Θ at intervals of δ. For sufficiently regular g( ), δ Σi g(θi|x) approximates the integral dθ g(θ|x) = 1 with an error which depends on the function g and the size of δ.

Further, r*(x) = δ Σi h(θi) f(x| θi)

For sufficiently regular distributions,
ΣX r*(x) = δ Σi h(θi) ΣX f(x| θi) = δ Σi h(θi) = 1
so r*(x) is a proper distribution. Hence,
- ΣX r(x) log r*(x) ≥ - ΣX r(x) log r(x)
and
I1 - I0 = E(LJ+1) - J ΣX r(x) log (r*(x) / r(x))
= E(LJ+1) - J ΣX r(x) log Σi δ g(θi|x)

The amount by which the sum Σi δ g(θ|x) departs from one may, for regular g( ) be expected to be proportional to some small power of δ. Then we can write

I1 - I0 = E(LJ+1) + J A δn

to order δn where A is a constant depending on r( ) and g( ) but not on J.

Now consider E(LJ+1) = E(Σj tj). In the first stage of the message construction, forming MJ, as many random digits will be required in step (c) as there are digits in the word wkJ. In effect, wkJ is selected from CJ by finding a match between a word of CJ and the prefix of a long random binary string. The probability that a word of length l will match is just 2-l. The length of wiJ is - log uiJ. Hence the expected value of tJ is the expected value of - Σi uiJ log uiJ.

Now uiJ is δ h(θi) f(xJ|θi) / r*(xJ), or approximately δ g(θi|xJ). Thus the expected value of tJ is approximately - log δ + B, where B is a constant approximating the expected entropy of the posterior distribution g( ).

It is unlikely that tj ≠ 0  (j<J). For tJ-1 ≠ 0, we require that MJ of length LJ = - log(δ p(θJ, xJ)) be so short that we cannot find in it a prefix matching a word of CJ-1. The length of a word wi,J-1 of CJ-1 is approximately - log(δ g(θi|xJ-1)) so we will fail to find a prefix when g(θi|xJ-1) < p(θJ,xJ). Since g(θ|x) = p(θ,x) / r(x), we expect the ratio g(θi|xJ-1) / p(θJ, xJ) to be of order 1/r(x), which is typically much greater than 1. We therefore approximate E(Σj tj) by E(tJ).

Thus, to order δn,
I1 - I0 = B - log δ + JA δn
Choosing δ to minimise I1 gives δn = 1/(nJA),
I1 - I0 = B + (1/n) + (1/n) log (nJA).

Thus we find that this construction, like the more exact but less general construction of section 6, gives a value for I1 - I0 of order log J for sufficiently regular h( ) and f( ).

As this value for I1 - I0 is less than is achieved by the use of a single-problem SMML estimator on each basic problem of the sequence, which gives I1 - I0 of order J, the construction must be close to achieving the shortest possible expected length for a message using estimates, and so its estimate sequence must be similar to that given by the SMML sequence estimator.

It is worth remarking that this construction, like that of section 6, does not require that every problem of the sequence have the same form.

We now show that the construction imitates a false oracle. Mj+1 is a near-optimal binary coding of the data sequence {xj+1, ..., xJ}. When random variates from some distribution are encoded as a binary string using an optimal code for that distribution, the resulting binary string is a realization of a Bernoulli sequence with p=0.5. Thus Mj+1, (extended if necessary by appended random digits) is such a realization. The probability that a given binary string of length l is a prefix of Mj+1 is 2-l. So the probability that step (c) will select θj = θi is 2-length(wij) = uij.

But uij = δ p(θi, xj) / r*(xj) and r*(xj) approximates r(xj).

Hence uij approximates δ g(θi|xj). If we consider the data sequence {xj+1, ..., xJ} to be a realization of their joint marginal distribution, the probability that this realization will, for some xj, cause the selection of the value θi as the estimate θj of θj, is just the posterior probability of θi given xj. In this sense, θj is a random selection from the (quantized) posterior distribution of θj given xj, as is made by the false oracle Tp.

8. Discussion

We have defined a class of ‘empirical’ criteria for the comparison of estimators based on their performance on data or sequences of data. For sequences, there are empirical criteria which estimate the bias and variance of estimators, and any other conventional measure of performance expressible as the expectation of some function of the true parameter and the estimated parameter. We have shown the existence of estimators called false oracles which no empirical criterion can distinguish from the oracle (whose ‘estimates’ are the true values). A false oracle can be constructed by randomly selecting an estimate from its posterior distribution.

The SMML estimator for a sequence has, in two cases, been shown to imitate a false oracle. In the first case, applicable whenever the parameters have slowly-varying priors and approximately quadratic log likelihoods, the SMML estimator is based on a quantizing lattice. The estimate θj of the j-th parameter is a (discontinuous) function of all the data {x1, x2, ..., xj, ..., xJ} but, if one regards the data values other than xj as being randomly drawn from their respective marginal distributions, θj is effectively sampled from a distribution approximating its posterior given xj.

In the second, more general case, we have shown that a near-optimal coding of a data sequence, (and so a near-SMML estimator) is obtained when each estimate θj is sampled from its posterior distribution given xj by using the coded data xj+1...xJ as a pseudo-random source of binary digits.

In both cases, a pseudo-random selection of θj is made using (some of) the data other than xj as a source of “random” information. Thus the SMML estimator is not truely a false oracle. A criterion based on knowledge of the “pseudo-random number generator” inherent in both coding algorithms could empirically distinguish the SMML estimator from an oracle. However, since in both cases the functional dependence of θj on data other than xj is quite obscure, for all practical purposes one may expect the statistical behavior of the SMML estimator for a long sequence to be indistinguishable from that of a false oracle, and hence from that of an oracle.

9. Applications

The analysis of the behaviour of minimum-message-length estimators on long sequences of identical basic problems has practical application. There are practical estimation problems where the model includes not only a fixed set α of general parameters but also a set of ‘nuisance’ parameters n, n=1...N} as numerous as the sample size N. For instance, in a factor analysis model of a multivariate Gaussian population, once the means, specific variances and factor loadings are known or estimated, the estimation of the factor scores becomes a sequence of N identical independent estimation problems. The problem fits the assumptions of section 6, so the analysis allows us to determine the length of the coded message. In particular, the analysis shows that the message length L contains a term (1/2) log(π N) - γ rather than the larger term (N/2) log(2 π e / 12) which is given by an analysis treating the estimation of each factor score separately. The consequent reduction in message length affects the comparison of the factor model against alternative models of the same data, since the minimum-message-length approach prefers the model with the smaller (minimized) value of L (Wallace & Freeman 1991).

A second example is a mixture model, in which each datum xn is thought to be sampled from an unknown member of a set of different distributions {fk(x|αk), k=1...K}. Here, once the distribution parameters αk and relative abundances hk are known or estimated, the estimation of the class (i.e. distribution) k = θn to which each datum belongs is a sequence of identical basic problems each requiring the estimation of a categorical parameter θn.

If each basic problem is considered separately, the shortest encoding is obtained by estimating θn to be that value of k for which hk fk(xnk) is a maximum. That is, datum xn would be assigned to that component of the mixture most likely to contain it. The resulting message length L would then have the form

L = Q{αk, hk, k=1...K} - Σn log [ maxk (hk fk(xnk)) ]

Estimation of the distribution parameters and abundances by minimization of the above form has been found to give inconsistent estimates. However, the analysis of section 7 shows that a shorter coding is possible, and has a message length for large N of the form

L = Q{αk, hk} - Σn log Σk hkfk(xnk) + 0(1)

In the mixture problems so far studied, minimization of the latter form yields consistent estimates for {αk} and {hk}. The smaller value of the latter form also affects the comparison between the lengths for models with different numbers of component distributions, and so affects the estimation of the number of components.

It is perhaps worth mentioning that the coding device described in section 7 has been used successfully in a computer program for file compression.

Acknowledgements: This work was done with the assistance of a study grant from Monash University. The work owes much to the informed comment and assistance of P.R.Freeman.

References

  1. Barron, A.R. & Cover, T.M., Minimum Complexity Density Estimation, IEEE Trans. Inf. Thy., 37(4), pp.1034-1054, (1991).
  2. Rissanen, J., Stochastic Complexity, J. Royal Stat. Soc. B, 49, 3, pp.228-239, (1987).
  3. Wallace, C.S. & Freeman, P.R., Estimation & Inference by Compact Coding, J. Royal Stat. Soc. B, 49, 3, pp.240-252, (1987).
  4. Wallace C.S. & Freeman P.R., Single Factor Analysis by MML Estimation, J. Royal Stat. Soc. B, 54, 1, pp.195-209, (1992).
  5. Zador, P., Asymptotic Quantization Error of Continuous Signals and the Quantization Dimension, IEEE Trans. Inf. Thy., IT-28, pp.139-149, (1982).

© C. S. Wallace.   Also see [MML] & [SMML] — L. Allison, 2008