Approximations to MML

The general form for MML ("MML87") depends on the determinant of the Fisher information matrix.

MML87: For model parameter(s) θ, prior h(θ), data-space X, data x, and likelihood function f(x|θ),
θ = <θ1, ..., θn>,
F(x, θ)ij = d2/dθij { - ln f(x|θ) },
F(θ) = ∑x:X{ f(x|θ).F(x,θ) }   -- i.e., expectation,
then
msgLen = mmodel + mdata where
mmodel = - ln(h θ) + (1/2)ln |F θ| + (n/2)ln kn  nits,
mdata = - ln f(x|θ) + n/2  nits.
 
Note, k1 = 1/12 = 0.083333, k2 = 0.080188, k3 = 0.078543, k4 = 0.076603, k5 = 0.075625, k6 = 0.074244, k7 = 0.073116, k8 = 0.071682, and kn->1/(2πe) = 0.0585498 as n->∞ [Conway & Sloane '88].
(MML87 requires that f(x|θ) varies little over the data measurement accuracy region and that h(θ) varies little over the parameter uncertainty region.)

Sometimes the Maths for the Fisher is not tractable. It may be possible to transform the problem so that it becomes easier (e.g. as in the use of orthonormal basis functions for polynomial fitting) which is acceptable because MML is invariant. Failing that, the remaining options include:

Gradient2

CSW, csse tea room, 22/5/'01: We take the gradient (a vector), G, of the log-likelihood function and form the matrix GG', i.e. the outer product. This will transform like the square of a density. Assuming that the data are i.i.d., we can then sum over all the observed data to get  γ = ∑k=1..N (GG').  This, again, will transform like the square of a density. So, it can be used as an approximation to the expected Fisher information, as γ will be invariant.

The downside is that we need the amount of data, N, to be at least as large as the number of parameters to be estimated. If not, then the matrix γ will be singular.

This approximation has been used in some versions of SNOB.

(Present csw, dld, la, rdp, 22/5/'01.)

Probability, pr(x|θ),   nlpr(x|θ) = - log pr(x|θ).
Given data x1, x2, ..., xn,
negative log likelihood, L,
L = ∑i nlpr(xi|θ) = ∑i{ - log pr(xi|θ) }
1st derivative of L wrt θ:
dL/dθ = ∑i nlpr'(xi|θ) = ∑i{ - (d/dθ pr(xi|θ)) / pr(xi|θ) }
2nd derivative of L wrt θ:
d2L/dθ2 = ∑i{ - (d2/dθ2 pr(xi|θ)) / pr(xi|θ) + {(d/dθ pr(xi|θ)) / pr(xi|θ)}2 }
~ ∑i{ (d/dθ pr(xi|θ)) / pr(xi|θ) }2
= ∑i{nlpr'(xi|θ)}2
assuming that ∑i{ - (d2/dθ2 pr(xi|θ)) / pr(xi|θ) } is small; note that the expected value is
Ex - (d2/dθ2 pr(x|θ)) / pr(x|θ)
= - (d2/dθ2 pr(x|θ) / pr(x|θ) . p(x|θ) dx
= - d2/dθ2 pr(x|θ) dx
= d2/dθ2 - pr(x|θ) dx     --unless pr is pathological
= d2/dθ2 1     -- !
= 0
If θ = <θ1, ..., θk>, the 2nd derivative becomes the matrix of 2nd derivatives d2L/dθiθj, nlpr' becomes grad pr (may also see the Jacobian, J), and the { }2 becomes the outer product.
-- 2007, LA

Empirical Fisher

The Fisher information matrix contains expected 2nd derivatives of the -log likelihood function with respect to the model parameters. It is possible to estimate these 2nd derivatives, given the data, by perturbing the parameters, individually and in pairs, by small amounts and calculating the changes in the likelihood. This computation is feasible for quite large numbers of parameters.

Unfortunately the resulting matrix is not guaranteed to be positive definite. The gradient2 method described above does not have this (possible) problem.

The empirical Fisher is also not invariant.

(This has been discussed by csw since well before 1991.)

MMLD

MsgLen ~ -log{ R h(θ) dθ } - R h(θ) log f(x|θ) dθ
R h(θ) dθ