|
The general form for MML ("MML87") depends on
the determinant of the
Fisher information matrix.
- MML87: For
model parameter(s) θ,
prior h(θ),
data-space X,
data x, and
likelihood function f(x|θ),
- θ = <θ1, ..., θn>,
- F(x, θ)ij
= d2/dθi dθj
{ - ln f(x|θ) },
- F(θ)
= ∑x:X{ f(x|θ).F(x,θ) }
-- i.e., expectation,
- then
- msgLen = mmodel + mdata where
- mmodel
= - ln(h θ) + (1/2)ln |F θ| + (n/2)ln kn
nits,
- mdata = - ln f(x|θ) + n/2 nits.
-
- Note,
k1 = 1/12 = 0.083333,
k2 = 0.080188,
k3 = 0.078543,
k4 = 0.076603,
k5 = 0.075625,
k6 = 0.074244,
k7 = 0.073116,
k8 = 0.071682, and
kn->1/(2πe) = 0.0585498 as n->∞
[Conway & Sloane '88].
- (MML87 requires that f(x|θ) varies little
over the data measurement accuracy region and that h(θ) varies
little over the parameter uncertainty region.)
Sometimes the Maths for the Fisher is not tractable.
It may be possible to transform the problem so that it becomes easier
(e.g. as in
the use of orthonormal basis functions for polynomial fitting)
which is acceptable because MML is invariant.
Failing that, the remaining options include:
- simplifying assumptions,
- numerical approximations,
- empirical Fisher.
Gradient2
CSW, csse tea room, 22/5/'01:
We take the gradient (a vector), G,
of the log-likelihood function and form the
matrix GG', i.e. the outer product.
This will transform like the square of a density.
Assuming that the data are i.i.d.,
we can then sum over all the observed data to get
γ = ∑k=1..N (GG').
This, again, will transform like the square of a density.
So, it can be used as an approximation to the expected Fisher information,
as γ will be invariant.
The downside is that we need the amount of data, N, to be
at least as large as the number of parameters to be estimated.
If not, then the matrix γ will be singular.
This approximation has been used in some versions of SNOB.
(Present csw, dld, la, rdp, 22/5/'01.)
- Probability, pr(x|θ),
nlpr(x|θ) = - log pr(x|θ).
- Given data
x1, x2, ..., xn,
- negative log likelihood, L,
- L = ∑i nlpr(xi|θ)
= ∑i{ - log pr(xi|θ) }
- 1st derivative of L wrt θ:
- dL/dθ
= ∑i nlpr'(xi|θ)
= ∑i{ - (d/dθ pr(xi|θ)) / pr(xi|θ) }
- 2nd derivative of L wrt θ:
- d2L/dθ2
= ∑i{
- (d2/dθ2 pr(xi|θ)) / pr(xi|θ)
+ {(d/dθ pr(xi|θ)) / pr(xi|θ)}2 }
- ~ ∑i{ (d/dθ pr(xi|θ)) / pr(xi|θ) }2
- = ∑i{nlpr'(xi|θ)}2
- assuming that
∑i{
- (d2/dθ2 pr(xi|θ)) / pr(xi|θ) }
is small; note that the expected value is
- Ex - (d2/dθ2 pr(x|θ)) / pr(x|θ)
- = ∫ - (d2/dθ2 pr(x|θ) / pr(x|θ) . p(x|θ) dx
- = ∫ - d2/dθ2 pr(x|θ) dx
- = d2/dθ2 ∫ - pr(x|θ) dx
--unless pr is pathological
- = d2/dθ2 1 -- !
- = 0
- If θ = <θ1, ..., θk>,
the 2nd derivative becomes the matrix of 2nd derivatives
d2L/dθiθj,
nlpr' becomes grad pr (may also see the Jacobian, J),
and
the { }2 becomes the outer product.
-- 2007, LA
Empirical Fisher
The Fisher information matrix contains expected 2nd derivatives
of the -log likelihood function with respect to the model parameters.
It is possible to estimate these 2nd derivatives,
given the data, by perturbing the parameters, individually and in pairs,
by small amounts and calculating the changes in the likelihood.
This computation is feasible for quite large numbers of parameters.
Unfortunately
the resulting matrix is not guaranteed to be positive definite.
The gradient2 method described above does not have
this (possible) problem.
The empirical Fisher is also not invariant.
(This has been discussed by csw since well before 1991.)
MMLD
MsgLen ~ -log{
∫R
h(θ) dθ } -
|
∫R h(θ) log f(x|θ) dθ
|
∫R h(θ) dθ |
|
|