False Oracles and SMML Estimators
C. S. Wallace, Proc. Int. Conf. on Information, Statistics and Induction in Science (ISIS), pp.304-316, 1996
(based on TR 128, Dept. Computer Science, Monash U., 1989)
1. Introduction
Recently there has been interest in a general method of statistical estimation and inference variously known as Minimum Message Length (MML) (Wallace & Freeman, 1987) or Minimum Description Length (MDL) (Rissanen 1987). The method is an application of a general approach to inductive inference first proposed by Solomonoff (1964), and closely related to the notion of Algorithmic Complexity due to Kolmogorov (1965). The method is Bayesian, and is based on the idea of choosing an hypothesis and estimates of its parameters which together allow a concise encoding of the data. The method is of great generality, and in those specific estimation problems which have so far been addressed using MML, gives estimates similar, but often superior in some respect such as bias, to those obtained by the Maximum Likelihood (ML) method. In particular, MML handles more gracefully problems where the number of unknown parameters increases with sample size, and can automatically select an appropriate hypothesis from a family of hypotheses of increasing complexity.
This page was
created from what I believe to be the 1996 source file,
hopefully without adding too many errors, to make this
MML
paper more readily accessible
-- LA 2008 NB. I have used θ for θhat and φ for φhat |
The consistency of MDL for a wide range of problems has been proved by Barron & Cover (1991), who also derive the ultimate rate of convergence of the estimates with increasing sample size for a range of problem types. Unfortunately, little is known in general about the behaviour of MML estimates for modest sample sizes. For instance, there have been no general results on the bias or variance of estimates, or on the error rates of the inbuilt hypothesis selection. Wallace and Freeman (op cit) and Rissanen (op cit) have argued that the estimates are close to sufficient, in the sense of capturing all the relevant information available in the data, but the arguments are outside conventional statistical theory.
In this paper, we attempt to show that MML can be expected to have good behaviour for modest samples. Consistency arguments prove good behaviour in the limit of large samples. We have not entirely avoided the need to appeal to a limit, but rather than considering the behaviour as an unbounded volume of data is used in a single estimation, we consider the behaviour as the method is applied to an unbounded set of (possibly unrelated) estimation problems, each using only limited data. That is, we consider the expected consequences of adopting the strategy of using MML for all desired inferences. Sections 2 and 3 discuss a proposed criterion of estimate quality for a single estimation problem. Section 4 extends the criterion to cover the behaviour of a method in a sequence of problems, and relates it to conventional criteria such as bias. These sections also define an “optimal” method according to this criterion. Sections 6, 7 and 8 then show that the MML method approaches this optimum.
Of the several criteria which are commonly used for judging the quality of statistical estimators, none is entirely satisfactory, at least by itself. Some, such as consistency, are applicable only to families of estimators for an unbounded range of sample sizes. Others, such as bias and expected variance, are tied to particular parameterizations of the problem, and may give different judgements if the parameters to be estimated are subjected to functional transformation. Others again, such as sufficiency, can be satisfied by estimators with severe deficiencies. In this paper, we propose yet another criterion and investigate some of the properties of estimators satisfying it. We give a constructive proof that such estimators generally exist, and also show that Strict Minimum Message Length (SMML) estimators (Wallace & Freeman 1987) in some circumstances closely approach satisfaction of the criterion.
The framework of the discussion is Bayesian.
We wish to estimate the value of an unknown parameter θ within a space Θ, given a prior density h(θ) on Θ, an observed datum x within a space X, and a probability density f(x|θ) (x∈X, θ∈Θ). At times we use the joint density p(θ,x) = h(θ)f(x|θ), the marginal density of x, r(x) = ∫ dθ p(θ, x), and the posterior density g(θ|x) = p(θ, x) / r(x). In some parts of the discussion we prefer to treat x as a discrete datum within a countable set X.
2. Empirical Criteria
Definitions
- An estimator is a function from X into Θ, which we shall write as θ = T(x).
- A stochastic estimator is an estimator whose value θ = T_{s}(x) for any datum x is a realization of a random variable with probability density G_{s}(θ|x).
- A deterministic estimator is not stochastic.
- An honest estimator is an estimator whose value for given x is independent of the true parameter value θ.
- The oracle is a distinguished estimator T_{t} such that T_{t}(x) = θ, the true parameter value. It is thus a ‘perfect’ estimator, but is of course neither realizable nor honest. Apart from the oracle, all other estimators considered will be honest.
- A trial is an instance of an estimation problem described by Θ, X, h(θ), f(x|θ) yielding a datum x which is then supplied to two estimators T_{0} and T_{1} to give two estimates θ = T_{0}(x), φ = T_{1}(x). In a trial, the estimators are treated as black boxes. That is, the internal mechanisms of the estimators are invisible, and any comparison of the estimators can be based only on the datum x and the two estimates θ and φ. The comparison may involve use of the prior information Θ, X, h(θ) and f(x|θ). Formalizing the notion of a comparison:
- A criterion c is a function from (X, Θ, Θ) to the real line. The outcome of a trial using c is written c(x, θ, φ). For some criteria, negative values of c represent a preference for T_{0} and positive values a preference for T_{1}. That is, for the observed value of x, c(x, θ, φ) < 0 indicates that the criterion c judges θ to be a better estimate than φ, for instance because θ has the higher likelihood. All criteria are assumed to be independent of the true parameter θ.
- A fair criterion is a criterion such that
for all a, b ∈ Θ and x ∈ X,
c(x,a,b) = - c(x,b,a).
This definition is intended to exclude criteria which have an inbuilt preference for one of the two estimators. As the internal workings of the estimators are invisible, no such preference would be rational.
The above definitions are intended to provide a framework for the empirical comparison of estimators, i.e. comparison based on the observable behaviour of estimators rather than on their structure. Of course, the outcome of a single trial using some criterion is a function of the value of x realized in the trial, and so the preference for one estimator over the other will be subject to the accidents of the particular trial. Later, we will discuss a class of estimation problems for which a single trial might be sufficient to determine a preference reliably. However, we now retreat somewhat from the ideal of a purely empirical comparison, and define the expected outcome of a trial.
c( , , ) c( , ) |
The expected value of a trial of estimators T_{0} and T_{1} using criterion c is the expected value of c(x, T_{0}(x), T_{1}(x)), written c(T_{0}, T_{1}), where the expectation is taken over the joint distribution of x and θ. Where one or both of the estimators are stochastic, the expectation is also taken over their value distributions.
- For T_{0}, T_{1} honest and deterministic, the estimators are independent of θ and the expectation is simply over the marginal distribution of x.
- c(T_{0}, T_{1}) = ∫ dx r(x) c(x, T_{0}(x), T_{1}(x))
- For T_{0}, T_{1} honest, T_{1} stochastic, T_{0} deterministic,
- c(T_{0}, T_{1}) = ∫ dx ∫ dφ r(x) G_{1}(φ|x) c(x, T_{0}(x), φ)
- and for T_{0}, T_{1} honest and stochastic
- c(T_{0}, T_{1}) = ∫ dx ∫ dθ ∫ dφ r(x) G_{0}(θ|x) G_{1}(φ|x) c(x, θ, φ)
- For the special case that one estimator, T_{0} say, is the oracle T_{t} and T_{1} is deterministic,
- c(T_{t}, T_{1}) = ∫ dθ ∫ dx h(θ) f(x|θ) c(x,θ, T_{1}(x))
- and for T_{0} the oracle, T_{1} stochastic
- c(T_{t}, T_{1}) = ∫ dθ ∫ dx ∫ dφ h(θ) f(x|θ) G_{1}(φ|x) c(x,θ, φ)
- Note that in all cases, for c fair, c(T_{0}, T_{1}) = - c(T_{1}, T_{0}).
3. False Oracles
A false oracle is an estimator T_{f} such that for all fair criteria c, and for T_{t} the oracle, c(T_{f}, T_{t}) = 0. That is, no fair criterion will be expected to prefer the oracle to a false oracle, or vice versa.
3.1 Theorem:
- For T_{f} a false oracle, T_{t} the oracle, T_{1} honest and c a criterion not necessarily fair c(T_{f}, T_{1}) = c(T_{t}, T_{1}).
- Proof: Consider the function
- c*(x,a,b) = c(x,a,T_{1}(x)) - c(x,b,T_{1}(x)), a,b ∈ Θ
- where, if T_{1} is stochastic, we have the same realization of T_{1}(x) in both places. Since T_{1} is honest, c* is not a function of θ. Also, c*(x,a,b) = -c*(x,b,a). Hence c* is a fair criterion, so by definition of T_{f},
- c*(T_{t}, T_{f}) = 0
- But
- c*(T_{t}, T_{f}) = E{c(x, T_{t}(x), T_{1}(x)) - c(x, T_{f}(x), T_{1}(x))}
- = E{c(x, T_{t}(x), T_{1}(x))} - E{c(x, T_{f}(x), T_{1}(x))}
- = c(T_{t}, T_{1}) - c(T_{f}, T_{1}) --Q.E.D.
This theorem shows that, while we cannot in practice compare an estimator with the oracle, since the oracle is unavailable, we can instead compare the estimator with a false oracle and get the same result in expectation.
- 3.2 Corollary: With the same conditions as for the theorem,
- c(T_{1}, T_{f}) = c(T_{1}, T_{t})
- Also,
- c(T_{f}, T_{t}) = c(T_{f}, T_{f}) = c(T_{t}, T_{f})
3.3 Corollary: With the same conditions as for the theorem, the value on a trial of c(x, T_{f}(x), T_{1}(x)) is as good an estimate of c(T_{t}, T_{1}) as is the value on a trial of c(x, T_{t}(x), T_{1}(x)) = c(x, θ, T_{1}(x)).
- Proof: Define c_{k}(x, θ, φ) = (c(x, θ, φ))^{k}
- Then c_{k} is a criterion, so c_{k}(T_{f}, T_{1}) = c_{k}(T_{t}, T_{1})
- But c_{k}(T_{0}, T_{1}) is the k^{th} moment of the sampling distribution of c(x, T_{0}(x), T_{1}(x)).
- Hence c(x, T_{f}, T_{1}) and c(x, T_{t}, T_{1}) have the same sampling distribution.
3.4 Theorem:
- False oracles exist in sufficiently regular cases.
- The proof is constructive:
- Let T_{p} be the stochastic estimator described by the density
- G_{p}(θ|x) = g(θ|x) = h(θ) f(x|θ) / r(x) = p(θ,x) / r(x).
- For any criterion c,
- c(T_{p}, T_{t}) = ∫ dθ ∫ dx ∫ dφ h(θ) f(x|θ) g(φ|x) c(x,φ,θ)
- = ∫ da ∫ dx ∫ db p(a,x) p(b,x) c(x,b,a) / r(x), (a = θ ∈ Θ, b = φ ∈ Θ)
- c(T_{t}, T_{p}) = ∫ dθ ∫ dx ∫ dφ h(θ) f(x|θ) g(φ|x) c(x,θ,φ)
- = ∫ db ∫ dx ∫ da p(b,x) p(a,x) c(x,b,a) / r(x), (b = θ, a = φ)
- The expressions for c(T_{p}, T_{t}) and c(T_{t}, T_{p}) differ only in the order of integration. Hence, for sufficiently regular h, f and c, they are equal,
- c(T_{t}, T_{p}) = c(T_{p}, T_{t})
- But, if c is fair, c(T_{t}, T_{p}) = - c(T_{p}, T_{t})
- Hence c(T_{p}, T_{t}) = 0 for all fair c, so T_{p} is a false oracle.
3.5 The Merit of False Oracles
False oracles deserve to be considered good estimators in that no fair criterion is expected to show a preference for the true parameter value over the estimate of a false oracle. Indeed, the above results show that no test based only on the information available to the estimator can reliably distinguish an oracle from a false oracle. With the more conventional criteria applied to estimators (some of which will be further discussed below) it is often the case that an estimator satisfying one criterion (e.g. lack of bias) will not satisfy others (e.g. maximum likelihood). However, for the empirical criteria discussed here, we have shown that there exists a class of estimators which give results as good as the true values by all empirical tests, since c(T_{f}, T_{t}) = c(T_{t}, T_{f}) for all c, fair and otherwise.
Note that it is not true that all criteria, or even all fair criteria, will prefer a false oracle to an honest estimator which is not a false oracle. Suppose x and θ are commensurable, and that x is distributed as N(θ,1). The fair criterion
- c(x, θ, φ) = (x - θ)^{2} - (x - φ)^{2}
will always prefer the estimator θ = x to a false oracle. It will also, of course, always prefer θ = x to the oracle θ = θ, so the import of its preference is dubious.
The construction of the false oracle T_{p} amounts to a rule for characterizing the posterior distribution of θ by a value (estimate) drawn randomly from the posterior. It may seem a strange rule compared with, say, choosing the mean, median or mode of the posterior, but it has the advantage of being invariant under arbitrary measure-preserving transformations of the parameter space.
4. Sequence Trials
We now attempt to define a class of estimation problems, and associated trials, which allow a comparison between our ‘empirical’ criteria and some more conventional ones.
A sequence problem is an ordered sequence of independent instances of the same basic problem. For a basic problem described by Θ, x, h(θ), f(x|θ), a sequence problem of length J has parameter space Θ^{J} and data space X^{J}. The data are an ordered sequence of values {x_{j}, j=1 ... J} from which an estimator must form an estimate which is an ordered sequence {θ_{j}, j=1 ... J}, where θ_{j} is an estimate of the true value θ_{j} of the parameter in the j^{th} instance of the basic problem. The prior density for the true sequence parameter {θ_{j}, j=1 ... J} is ∏_{j} h(θ_{j}) and the conditional data density is ∏_{j} f(x_{j}|θ_{j}). Note that we assume the instances are wholly independent: data from early instances do not lead to revision of the prior or conditional data density in later instances.
An estimator T for a sequence is a function from X^{J} into Θ^{J}. We do not require that the estimator have the form of J instances of the same function applied in turn to {x_{j}}. The ‘black box’ model of an estimator is not required to produce any output until it has the entire data sequence, so θ_{j} may depend on all components of {x_{k}, k=1 ... J}, not just x_{j}.
In a sequence trial of estimators T_{0}, T_{1} using criterion c, c is a function c({x_{j}, θ_{j}, φ_{j}, j=1...J}) which operates on the data sequence and the two estimate sequences to yield a single value, which need not be expressible as a combination of J applications of a simpler function.
Since a sequence problem and sequence trial are just special cases of the problems and trials defined earlier, all definitions and results carry over, with only the notational difference that now the symbols Θ, X, h( ) and f( ) refer to the component basic problems of the sequence rather than to the sequence problem as a whole. Note that, since the basic problems are independent, the posterior density of {θ_{j}, j=1...J} given {x_{j}, j=1...J} is ∏_{j} g(θ_{j}|x_{j}). Thus the false oracle T_{p} of theorem 3.4, which picks an estimate sequence randomly from this posterior density, may be considered to choose randomly and independently a value from each basic posterior g(θ_{j}|x_{j}).
4.1 Sequence Criteria:
Using sequence trials, we can define empirical criteria corresponding to some conventional criteria. For D a function from Θ to the real line, v∈Θ, δ a small constant and J large, define
- D_{v}{x_{j}, θ_{j}, φ_{j}, j=1...J} = Σ_{j∈S} D(θ_{j}) / N_{S}
where S is the set {j; |φ_{j} - v| < δ} and N_{S} is the cardinality of S. D_{v} is the average value of D(θ_{j}) over those components of the sequence trial in which φ_{j} approximately equalled v. Its expected value, D_{v}(T_{0}, T_{1}) approximates the expected value of D(θ) given φ = v.
Letting D_{v} be the function B_{v}(θ) = θ - v, call B_{v}(T_{0}, T_{1}) the bias of T_{0} relative to T_{1}. It is, of course, a function of v. For T_{1} the oracle, B_{v}(T_{0}, T_{t}) is simply the expected value of θ - θ, given θ = v. Hence B_{v}(T_{0}, T_{t}) corresponds to the usual definition of the bias of T_{0}. While T_{t} is unavailable, we have B_{v}(T_{0}, T_{f}) = B_{v}(T_{0}, T_{t}) for T_{f} a false oracle. Hence the bias of T_{0} can be estimated by measuring the bias of T_{0} relative to T_{f} in a sequence trial. Letting T_{0} be the false oracle T_{p},
- B_{v}(T_{p}, T_{t}) = B_{v}(T_{p}, T_{p}) = ∫ dx ∫ dθ f(x|v) g(θ|x) (θ - v)
This is not, in general, zero. However, note that B_{v}(T_{t}, T_{p}) = B_{v}(T_{p}, T_{t}). The ‘bias’ of the oracle relative to a false oracle T_{f} equals the bias of T_{f}, so empirical measurement of relative bias does not serve to distinguish false and true oracles. Also, note that B_{v}(T_{t}, T_{t}) ≠ B_{v}(T_{t}, T_{f}), since T_{t} is not honest. Of course, B_{v}(T_{t}, T_{t}) = 0 for all v.
Similarly, letting D_{v} be the function V_{v}(θ) = (θ - v)^{2}, the usually-defined variance of T_{0} is V_{v}(T_{0}, T_{t}), and can be empirically measured by the value of V_{v} in a sequence trial of T_{0} and a false oracle T_{f}. For the false oracle T_{p},
- V_{v}(T_{p}, T_{t}) = V_{v}(T_{p}, T_{p}) = ∫ dx ∫ dθ f(x|v) g(θ |x) (θ - v)^{2}
This expression shows that the variance of a false oracle is the variance of the convolution over x of f and g. For h(θ) slowly varying, it is about twice the variance of a conventional Bayes estimator with quadratic loss function.
5. SMML Estimators
Strict Minimum Message length (SMML) estimation has been described elsewhere (Wallace & Freeman 1987). We use the same Bayesian framework defined by Θ, X, h(θ), f(x|θ) etc. as before, but initially it is convenient to consider x to be a discrete variable, so X is a countable set, and f(x|θ) and r(x) are probabilities rather than densities.
The SMML estimator θ = m(x) has a range Θ* = {θ_{i}, i=1, 2, ...} = {θ : θ = m(x), x∈X} which is a countable subset of Θ. Define X_{i} = {x : m(x) = θ_{i}} and q_{i} = q(θ_{i}) = Σ_{Xi} r(x) for all i. Then m( ) minimizes
- I_{1} = - Σ_{X} r(x) log[ q(m(x)) f(x|m(x)) ]
- = - Σ_{i} [ q_{i} log q_{i} + Σ_{Xi} r(x) log f(x | θ_{i}) ]
The estimator is derived by considering the expected length of a message encoding x and decodeable given knowledge of X, Θ, f(x|θ) and h(θ). By standard coding theory, the expected length cannot be less than I_{0} = - Σ_{X} r(x) log r(x), where the base of logarithms depends on the coding alphabet, e.g., 2 for binary coding. However, we consider a message having the form of a statement of an estimate θ, followed by an encoding of x using a code which would be optimal (i.e. have least expected length) were θ = θ. Using optimal coding for the statement of θ, the length of the message encoding x is L(x) = - log q(m(x)) - log f(x|m(x)). (Actually, the length of a real message might exceed L(x) by as much as one symbol, since real messages must have integral lengths. However, we neglect this rounding-up effect.) The first term of L(x) gives the length of the encoding of the estimate θ = m(x). Note that q(θ_{i}) is the unconditional probability that θ_{i} will be the estimate. The second term is the length of the encoding of x using a code optimal if θ = m(x). I_{1} is the expected value of L(x).
Since I_{0} is independent of the estimator, the SMML estimator minimizes
- I_{1} - I_{0} = - Σ_{X} r(x) log [q(m(x)) f(x|m(x)) / r(x)]
If the discretization of the data is so fine that we can consider X as a continuum, we can revert to the framework of the earlier sections, and the SMML estimator minimizes
- I_{1} - I_{0} = ∫_{X} r(x) log[q(m(x)) f(x|m(x)) / r(x)]dx,
- where
- q(θ_{i}) = ∫_{Xi} r(x)dx.
Θ* remains a countable subset of Θ, but I_{1} and I_{0} are not individually definable.
Minimization of I_{1} - I_{0} requires a compromise in the size of the region X_{i} within which all datum values lead to the same estimate θ_{i}. Large regions have large values of q_{i}, leading to a short statement of θ_{i}, but if the region is too large, it will contain values of x for which f(x|θ_{i}) is much less than f(x|θ_{m}) where θ_{m} is the maximum-likelihood estimate for x, so the expected length of the statement of x will be greater than for smaller regions.
We now show that in a sequence of basic estimation problems such as was described in section 4, the SMML estimator approximates a false oracle. Two cases are considered. In the first, which applies to basic problems with approximately quadratic log-likelihoods, a fairly exact analysis is possible. The second presents a less exact treatment of a more general sequence problem. Throughout, j indexes the component problems of a sequence of J basic problems, and i indexes the members of Θ*, the range of an SMML estimator.
6. A Simple Case
Assume that for all j, f(x_{j}|θ_{j}) = (1 / (√(2 π) σ)) e^{-(xj-θj)2 / 2σ2}, σ known, and h(θ_{j}) = a, a constant over a large range. If SMML estimation were applied separately to each problem of the sequence, we would obtain for each j an estimator θ_{j} = m(x_{j}) with the same form for every j. We drop the j suffix and consider one estimator m(x). Its range is simply a set of equally-spaced values of θ, with spacing s say. The SMML estimator maps x into the member of the set most nearly equal to x. Since, over the range of interest, r(x) = h(θ) = a, we have that for every θ_{i} ∈ Θ*, X_{i} is the interval {x: θ_{i} - s/2 < x < θ_{i} + s/2} and q_{i} = as.
The estimator minimizes the expected value of - log q(m(x)) - log f(x|m(x)) + log r(x). Since all of the regions X_{i} have the same pattern, the expectation can be taken over any one of them, giving
- I_{1} - I_{0}
- = - log(as) + 1/2 log(2 π) + log σ + 1/s { _{θi-s/2}∫^{θi+s/2} dx (x - θ_{i})^{2} / (2 σ^{2}) } + log a
- = - log s + (1/2) log(2 π) + log σ + s^{2} / (24 σ^{2})
- Minimization with respect to s gives
- s^{2} = 12 σ^{2},
- I_{1} - I_{0} = (1/2) log (2 π e / 12).
- The result of applying this estimator to all J problems of the sequence is therefore an expected message length increment for the whole sequence given by
- I_{1} - I_{0} = (1/2) J log(2 π e / 12).
This result is not the best that can be achieved.
- An SMML estimator for the whole sequence problem minimizes the expected value of
- - log q(m(x)) - Σ_{j} log f(x_{j}|m_{j}(x)) + log r(x)
- where now the estimator m(x) maps the sequence or vector of data x = (x_{1}, x_{2} ... x_{j} ...) into a vector
- θ = (θ_{1}, θ_{2} ... θ_{j} ...) = (m_{1}(x), m_{2}(x), ..., m_{j}(x), ...),
- and r(x) = a^{J}.
- Thus the SMML estimator minimizes the expected value of
- L(x) = - log q(m(x)) + (J/2) log (2 π) + J log σ + (m(x) - x)^{2} / (2 σ^{2}) + J log a
Its range Θ* is a countable set of vectors or points in J-space, {θ_{i}}. Assume these points form a lattice. Then q(θ_{i}) = q_{i} is the same for all i, and for all i the region X_{i} within which m(x) = θ_{i} is the set of x values closer to θ_{i} than to any other member of Θ*, i.e., the Voronoi region of θ_{i}. For Θ* a lattice, the Voronoi regions of all its members have the same size and shape. Let s be the volume of X_{i} for all i. Then q_{i} = sa^{J} for all i. The expectation of L(x) can be taken over any Voronoi region, giving
- I_{1} - I_{0} = - log (sa^{J}) + (J/2) log (2 π) + J log σ + E(x - θ_{i})^{2} / (2 σ^{2}) + J log a.
The expectation E(x - θ_{i})^{2} over the Voronoi region of θ_{i} depends on the volume s and the shape of the region. To separate these effects, write E(x - θ_{i})^{2} = J k_{J} s^{2/J}, where k_{J} depends only on the shape of X_{i}, i.e. the geometry of the lattice. Then
- I_{1} - I_{0} = - log s + (J/2) log (2 π) + J log σ + Jk_{J} s^{2/J} / (2 σ^{2})
- Minimization with respect to s gives
- s^{2/J} = σ^{2} / k_{J},
- I_{1} - I_{0} = (J/2) log (2 π e k_{J})
Minimization with respect to the geometry of the lattice requires choosing a lattice which, for a given density of points, minimizes the mean squared distance of all points from the nearest lattice point. Lattices chosen in this way are called ‘quantizing lattices’.
Use of the single-problem SMML estimator separately for each problem is equivalent to choice of a rectangular lattice, for which k_{J} = 1/12 for all J. However, for J > 1, lattices exist with k_{J} < 1/12. For instance, for J = 2, a lattice with hexagonal rather than square Voronoi regions has k_{J} = 0.080175.. < 1/12 and hence leads to a smaller value for I_{1} - I_{0}. Note that in the resulting SMML estimator, θ_{1}, is a function of both x_{1} and x_{2}. If x_{1} is held constant and x_{2} is changed, θ_{1} may assume in turn two or more different values, all however close to x_{1}. Similarly, θ_{2} is a function of both x_{1} and x_{2}, but is always close to x_{2}.
- For large J, the optimum quantizing lattice is not known. However, Zador (1982) has shown that
- Γ(J/2+1)^{2/J} Γ(2/J+1) / (J π) > k_{J} > Γ(J/2+1)^{2/J} / ((J+2) π)
The upper bound is the value for a random set of points (not a lattice) and the lower bound would be achieved only if the Voronoi region were a J-sphere.
- Using Stirling’s approximation in the form
- log Γ(y+1) = (y + 1/2) log y - y + 1/2 log(2 π) + O(1/y),
- log k_{J} > (2/J) [(J/2 + 1/2) log(J/2) - J/2 + 1/2 log(2 π)] - log(J+2) - log π + O(1/J^{2})
- > - log(2 π e) + (log J + log π - 2) / J + O(1 / J^{2})
- Similarly, and using log Γ(1+y) = - γ y + O(y^{2}), where γ is Euler’s constant,
- log k_{J} < - log(2 π e) + (log J + log π - 2 γ) / J + O(1/J^{2})
Hence (1/2) log(J π) - γ > I_{1} - I_{0} > (1/2) log(J π) - 1, to order 1/J.
We see that, whereas use of a rectangular lattice, i.e. repeated use of the one-problem SMML estimator, gives I_{1} - I_{0} = (1/2) J log (2 π e / 12) proportional to J, the optimum SMML estimator has at worst I_{1} - I_{0} = (1/2) log (J π) - γ.
Although the shape of the optimum Voronoi region for large J is unknown, the close approximation of the optimum lattice to the behaviour of spherical regions strongly suggests that the optimum Voronoi region is very nearly spherical for large J.
- Making this assumption, the equation s^{2/J} = σ^{2/kJ} for its volume s gives for its radius ρ
- ρ^{J}π^{J/2} / Γ(J/2 + 1) = s = σ^{J} / k_{J}^{J/2}
Using the lower bound as an approximation for k_{J}, ρ = σ √(J+2)
If we consider one component of θ and x, i.e. one estimation problem of the sequence, the estimate θ_{j} is a function of all components of x, not just of x_{j}. If the other components are thought of as random variables, all that is known of x is that it lies in the subspace defined by x_{j}, and all that is known of θ is that it lies somewhere in a sphere of radius ρ centred on x. Thus, θ_{j} can be regarded as a random variable generated by the projection of a random point in this sphere onto the j axis. The density of θ_{j} is therefore proportional to
- {ρ^{2} - (θ_{j}-x_{j})^{2}}^{(J-1)/2} = {(J+2)σ^{2} - (θ_{j}-x_{j})^{2}}^{(J-1)/2}
For any J, this distribution has mean x_{j} and standard deviation σ. For large J, it closely approximates N(x_{j}, σ^{2}), which is the posterior density of θ_{j} given x_{j}.
The SMML estimator thus produces estimates for the individual problems of the sequence which have approximately the behaviour of values chosen at random from the individual posterior densities. That is, it approximates the false oracle T_{p}.
The results for this simple case are readily generalized to
any sequence of estimators in which the log likelihood for
the sequence of data has an approximately quadratic
behaviour near its maximum, and where the prior for the
parameter sequence is slowly-varying. It is not necessary
that all basic estimation problems in the sequence be
instances of the same problem, or even that they require
estimation of the same number of parameters. The results
depend only on the assumption of a slowly-varying priors,
and approximately quadratic log-likelihood.
7. A More General Case
We now describe a construction for a near-optimal encoding of a data sequence which does not require smooth or quadratic behaviour in the log prior and log likelihood functions. As before, we consider a sequence of J independent instances of the same basic problem described by Θ, X, h(θ) and f(x|θ). We assume x is a discrete variable.
In this construction, each estimate θ_{j} will be selected from a set Θ* of equally-spaced values {θ_{i}, i=1, 2 ...} whose spacing δ is small compared with the expected estimation error of each θ_{j}. Index i is used for indexing members of Θ*, and index j for indexing problems of the sequence. We assume Σ_{i} δ h(θ_{i}) = 1.
The message is constructed as J segments. Basically, each segment comprises two parts. The first states an estimate θ_{j} of θ_{j}, selecting θ_{j} from Θ*, and has length - log(δ h(θ_{j})). The second encodes x_{j} using a code optimal if θ_{j} = θ_{j}, and has length - log f(x_{j}| θ_{j}). The total length l_{j} of segment j is
- l_{j} = - log(δ h(θ_{j}) f(x_{j}|θ_{j})) = - log(δ p(θ_{j}, x_{j}))
However, the segments are combined into a complete message in such a way that the length of the message is less than Σ_{j} l_{j}.
As δ is small compared with the expected estimation error of θ_{j}, there will be several members of Θ*, separated by multiples of δ, any of which would give a near-minimal value of l_{j}. The basis of the construction is use of the freedom to choose any one of several more-or-less equally good values for θ_{j} so as to convey information about the following segment, j+1.
For definiteness, assume binary coding, so logarithms are to base 2.
The construction proceeds backwards, in decreasing order of j. Let M_{J+1} be the empty string, of length L_{J+1} = 0. Suppose that a string M_{j+1} of length L_{j+1} has been constructed encoding estimates and data for problems j+1, ..., J. To form a string M_{j} covering in addition problem j, proceed as follows:
- (a) Compute a distribution over Θ* given by u_{ij} = δ h(θ_{i}) f(x_{j}| θ_{i}) / r*(x_{j}), where r*(x) = δ Σ_{i} h(θ_{i}) f(x| θ_{i}).
- (b) Construct an optimum prefix code C_{j} (e.g. a Huffman code) for this distribution. C_{j} is a set of binary strings, or words, {w_{ij}, i=1, 2,...} in one-to-one correspondence with the members of Θ*, such that no word is a prefix of another, any sufficiently long binary string has a word as prefix, and the length of w_{ij} is approximately - log u_{ij}. The approximation, which arises because the lengths of words are necessarily integral, will be neglected.
- (c) Find the value k of index i for which w_{kj} is a prefix of M_{j+1}. If no such k exists, append randomly-chosen binary digits to the end of M_{j+1} until a match occurs. Let t_{j} be the number of digits appended.
- (d) Delete w_{kj} from the front of M_{j+1},
leaving a string M^{-}_{j+1}
of length
L^{-}_{j+1}
= L_{j+1} + log u_{kj} + t_{j}.
Note that if t_{j} ≠ 0, L^{-}_{j+1} = 0. - (e) Construct a string S_{j} encoding datum x_{j} using estimate θ_{j} = θ_{k}. Its length is l_{j} = - log(δ p(θ_{k}, x_{j})).
- (f) Construct M_{j} by prepending S_{j} to the front of M^{-}_{j+1}.
Steps (a) to (f) are repeated until the final complete message M_{1} has been formed.
We must show that M_{1} can be decoded using knowledge only of h and f, and of the construction algorithm. Decoding proceeds in order of increasing j.
The segment S_{1} for problem 1 appears in full as the prefix of M_{1}. It is easily decoded. Its first part is a Huffman or similar coding of θ_{1} based on the known distribution δ h(θ_{i}) over Θ*. Having decoded θ_{1}, the second part can be decoded, as it is a Huffman or similar coding of x_{i} based on the now-known distribution f(x|θ_{1}) over X.
Having decoded s_{1}, we can delete it from the front of M_{1}, leaving M^{-}_{2}.
Now, knowing x_{1}, we can compute the distribution u_{i1} over Θ*, and hence reconstruct the code C_{1}. Suppose the decoded estimate θ_{1} has index k in Θ*. Then we can select w_{1k} from C_{1}, and prepend it to M^{-}_{2} giving M_{2}.
Since S_{2} appears in full as the prefix of M_{2}, it too can now be decoded, and so on until the entire sequence of data (and estimates) has been recovered.
At the end of the decoding process, we will be left with a string M^{+}_{J+1} comprising all the Σ_{j} t_{j} random digits appended in steps (c) of the coding algorithm.
It will be useful in what follows to imagine that the coding process began with these randomly-chosen digits already in place, i.e. with M_{J+1} = M^{+}_{J+1}, L_{J+1} = Σ_{j} t_{j}. The result of the coding is then unchanged, but in step (c) it would never be necessary to append additional digits, and at every stage of the coding process we have L^{-}_{j+1} = L_{j+1} + log u_{kj}.
- For all j,
- L_{j} = L^{-}_{j+1} + l_{j} = L_{j+1} + log u_{kj} - log(δ p(θ_{k}, x_{j}))
- where, of course, k is a function of j.
Now, u_{kj} = δ h(θ_{k}) f(x_{j}|θ_{k}) / r*(x_{j}) = δ p(θ_{k}, x_{j}) / r*(x_{j})
- Hence,
- L_{j} = L_{j+1} - log r*(x_{j})
- L_{1} = L_{J+1} - Σ_{j} log r*(x_{j})
- We now aim to show that the sequence of estimates yielded by this construction is close to an SMML estimate for the sequence. To do so, we will try to show that I_{1}, the expected value of L_{1} over the marginal distribution of x, is not much greater than the minimum possible value
- I_{0} = - Σ_{X} r(x) log r(x).
- In a long sequence, datum value x may be expected to occur with relative frequency r(x). Hence the expected value of the second term in L_{1} is
- - J Σ_{X} r(x) log r*(x)
Now, r*(x) = δ Σ_{i} p(θ_{i}, x) = r(x) δ Σ_{i} g(θ_{i}|x) where the sum is over a set Θ* of points equally spaced in Θ at intervals of δ. For sufficiently regular g( ), δ Σ_{i} g(θ_{i}|x) approximates the integral ∫ dθ g(θ|x) = 1 with an error which depends on the function g and the size of δ.
Further, r*(x) = δ Σ_{i} h(θ_{i}) f(x| θ_{i})
- For sufficiently regular distributions,
- Σ_{X} r*(x) = δ Σ_{i} h(θ_{i}) Σ_{X} f(x| θ_{i}) = δ Σ_{i} h(θ_{i}) = 1
- so r*(x) is a proper distribution. Hence,
- - Σ_{X} r(x) log r*(x) ≥ - Σ_{X} r(x) log r(x)
- and
- I_{1} - I_{0} = E(L_{J+1}) - J Σ_{X} r(x) log (r*(x) / r(x))
- = E(L_{J+1}) - J Σ_{X} r(x) log Σ_{i} δ g(θ_{i}|x)
The amount by which the sum Σ_{i} δ g(θ|x) departs from one may, for regular g( ) be expected to be proportional to some small power of δ. Then we can write
- I_{1} - I_{0} = E(L_{J+1}) + J A δ^{n}
to order δ^{n} where A is a constant depending on r( ) and g( ) but not on J.
Now consider E(L_{J+1}) = E(Σ_{j} t_{j}). In the first stage of the message construction, forming M_{J}, as many random digits will be required in step (c) as there are digits in the word w_{kJ}. In effect, w_{kJ} is selected from C_{J} by finding a match between a word of C_{J} and the prefix of a long random binary string. The probability that a word of length l will match is just 2^{-l}. The length of w_{iJ} is - log u_{iJ}. Hence the expected value of t_{J} is the expected value of - Σ_{i} u_{iJ} log u_{iJ}.
Now u_{iJ} is δ h(θ_{i}) f(x_{J}|θ_{i}) / r*(x_{J}), or approximately δ g(θ_{i}|x_{J}). Thus the expected value of t_{J} is approximately - log δ + B, where B is a constant approximating the expected entropy of the posterior distribution g( ).
It is unlikely that t_{j} ≠ 0 (j<J). For t_{J-1} ≠ 0, we require that M_{J} of length L_{J} = - log(δ p(θ_{J}, x_{J})) be so short that we cannot find in it a prefix matching a word of C_{J-1}. The length of a word w_{i,J-1} of C_{J-1} is approximately - log(δ g(θ_{i}|x_{J-1})) so we will fail to find a prefix when g(θ_{i}|x_{J-1}) < p(θ_{J},x_{J}). Since g(θ|x) = p(θ,x) / r(x), we expect the ratio g(θ_{i}|x_{J-1}) / p(θ_{J}, x_{J}) to be of order 1/r(x), which is typically much greater than 1. We therefore approximate E(Σ_{j} t_{j}) by E(t_{J}).
- Thus, to order δ^{n},
- I_{1} - I_{0} = B - log δ + JA δ^{n}
- Choosing δ to minimise I_{1} gives δ^{n} = 1/(nJA),
- I_{1} - I_{0} = B + (1/n) + (1/n) log (nJA).
Thus we find that this construction, like the more exact but less general construction of section 6, gives a value for I_{1} - I_{0} of order log J for sufficiently regular h( ) and f( ).
As this value for I_{1} - I_{0} is less than is achieved by the use of a single-problem SMML estimator on each basic problem of the sequence, which gives I_{1} - I_{0} of order J, the construction must be close to achieving the shortest possible expected length for a message using estimates, and so its estimate sequence must be similar to that given by the SMML sequence estimator.
It is worth remarking that this construction, like that of section 6, does not require that every problem of the sequence have the same form.
We now show that the construction imitates a false oracle. M_{j+1} is a near-optimal binary coding of the data sequence {x_{j+1}, ..., x_{J}}. When random variates from some distribution are encoded as a binary string using an optimal code for that distribution, the resulting binary string is a realization of a Bernoulli sequence with p=0.5. Thus M_{j+1}, (extended if necessary by appended random digits) is such a realization. The probability that a given binary string of length l is a prefix of M_{j+1} is 2^{-l}. So the probability that step (c) will select θ_{j} = θ_{i} is 2^{-length(wij)} = u_{ij}.
But u_{ij} = δ p(θ_{i}, x_{j}) / r*(x_{j}) and r*(x_{j}) approximates r(x_{j}).
Hence u_{ij} approximates δ g(θ_{i}|x_{j}). If we consider the data sequence {x_{j+1}, ..., x_{J}} to be a realization of their joint marginal distribution, the probability that this realization will, for some x_{j}, cause the selection of the value θ_{i} as the estimate θ_{j} of θ_{j}, is just the posterior probability of θ_{i} given x_{j}. In this sense, θ_{j} is a random selection from the (quantized) posterior distribution of θ_{j} given x_{j}, as is made by the false oracle T_{p}.
8. Discussion
We have defined a class of ‘empirical’ criteria for the comparison of estimators based on their performance on data or sequences of data. For sequences, there are empirical criteria which estimate the bias and variance of estimators, and any other conventional measure of performance expressible as the expectation of some function of the true parameter and the estimated parameter. We have shown the existence of estimators called false oracles which no empirical criterion can distinguish from the oracle (whose ‘estimates’ are the true values). A false oracle can be constructed by randomly selecting an estimate from its posterior distribution.
The SMML estimator for a sequence has, in two cases, been shown to imitate a false oracle. In the first case, applicable whenever the parameters have slowly-varying priors and approximately quadratic log likelihoods, the SMML estimator is based on a quantizing lattice. The estimate θ_{j} of the j-th parameter is a (discontinuous) function of all the data {x_{1}, x_{2}, ..., x_{j}, ..., x_{J}} but, if one regards the data values other than x_{j} as being randomly drawn from their respective marginal distributions, θ_{j} is effectively sampled from a distribution approximating its posterior given x_{j}.
In the second, more general case, we have shown that a near-optimal coding of a data sequence, (and so a near-SMML estimator) is obtained when each estimate θ_{j} is sampled from its posterior distribution given x_{j} by using the coded data x_{j+1}...x_{J} as a pseudo-random source of binary digits.
In both cases, a pseudo-random selection of θ_{j} is made using (some of) the data other than x_{j} as a source of “random” information. Thus the SMML estimator is not truely a false oracle. A criterion based on knowledge of the “pseudo-random number generator” inherent in both coding algorithms could empirically distinguish the SMML estimator from an oracle. However, since in both cases the functional dependence of θ_{j} on data other than x_{j} is quite obscure, for all practical purposes one may expect the statistical behavior of the SMML estimator for a long sequence to be indistinguishable from that of a false oracle, and hence from that of an oracle.
9. Applications
The analysis of the behaviour of minimum-message-length estimators on long sequences of identical basic problems has practical application. There are practical estimation problems where the model includes not only a fixed set α of general parameters but also a set of ‘nuisance’ parameters {θ_{n}, n=1...N} as numerous as the sample size N. For instance, in a factor analysis model of a multivariate Gaussian population, once the means, specific variances and factor loadings are known or estimated, the estimation of the factor scores becomes a sequence of N identical independent estimation problems. The problem fits the assumptions of section 6, so the analysis allows us to determine the length of the coded message. In particular, the analysis shows that the message length L contains a term (1/2) log(π N) - γ rather than the larger term (N/2) log(2 π e / 12) which is given by an analysis treating the estimation of each factor score separately. The consequent reduction in message length affects the comparison of the factor model against alternative models of the same data, since the minimum-message-length approach prefers the model with the smaller (minimized) value of L (Wallace & Freeman 1991).
A second example is a mixture model, in which each datum x_{n} is thought to be sampled from an unknown member of a set of different distributions {f_{k}(x|α_{k}), k=1...K}. Here, once the distribution parameters α_{k} and relative abundances h_{k} are known or estimated, the estimation of the class (i.e. distribution) k = θ_{n} to which each datum belongs is a sequence of identical basic problems each requiring the estimation of a categorical parameter θ_{n}.
If each basic problem is considered separately, the shortest encoding is obtained by estimating θ_{n} to be that value of k for which h_{k} f_{k}(x_{n}|α_{k}) is a maximum. That is, datum x_{n} would be assigned to that component of the mixture most likely to contain it. The resulting message length L would then have the form
- L = Q{α_{k}, h_{k}, k=1...K} - Σ_{n} log [ max_{k} (h_{k} f_{k}(x_{n}|α_{k})) ]
Estimation of the distribution parameters and abundances by minimization of the above form has been found to give inconsistent estimates. However, the analysis of section 7 shows that a shorter coding is possible, and has a message length for large N of the form
- L = Q{α_{k}, h_{k}} - Σ_{n} log Σ_{k} h_{k}f_{k}(x_{n}|α_{k}) + 0(1)
In the mixture problems so far studied, minimization of the latter form yields consistent estimates for {α_{k}} and {h_{k}}. The smaller value of the latter form also affects the comparison between the lengths for models with different numbers of component distributions, and so affects the estimation of the number of components.
It is perhaps worth mentioning that the coding device described in section 7 has been used successfully in a computer program for file compression.
Acknowledgements: This work was done with the assistance of a study grant from Monash University. The work owes much to the informed comment and assistance of P.R.Freeman.
References
- Barron, A.R. & Cover, T.M., Minimum Complexity Density Estimation, IEEE Trans. Inf. Thy., 37(4), pp.1034-1054, (1991).
- Rissanen, J., Stochastic Complexity, J. Royal Stat. Soc. B, 49, 3, pp.228-239, (1987).
- Wallace, C.S. & Freeman, P.R., Estimation & Inference by Compact Coding, J. Royal Stat. Soc. B, 49, 3, pp.240-252, (1987).
- Wallace C.S. & Freeman P.R., Single Factor Analysis by MML Estimation, J. Royal Stat. Soc. B, 54, 1, pp.195-209, (1992).
- Zador, P., Asymptotic Quantization Error of Continuous Signals and the Quantization Dimension, IEEE Trans. Inf. Thy., IT-28, pp.139-149, (1982).
© C. S. Wallace. Also see [MML] & [SMML] — L. Allison, 2008