Normal

In this page: maximum likelihood (ML), ML-estimators, MML, Fisher information, MML-estimators, measurement accuracy -- lloyd

Maximum Likelihood

The negative log likelihood, L, for `n' observations assumed to come from a normal distribution, N_mu,sigma, is:

              n         1            (x_i-mu)²
L = -log{ PROD  ----------------.exp(--------) }
            i=1 sqrt(2 pi) sigma     2 sigma²

  n             n                  1       n
= -.log(2 pi) + -.log(sigma²) + -------.SUM (x_i-mu)²
  2             2               2sigma²  i=1

Maximum Likelihood estimator for mu

Differentiating with respect to mu

d L      1       d       n
---- = -------.-----{ SUM  (x_i-mu)² }
d mu   2sigma²  d mu   i=1


= (n.mu - (x₁+ ... + x_n)) / sigma²

Setting this to zero gives the maximum likelihood estimator for mu

mu_ML = (x₁+ ... +x_n)/n

i.e. the (sample-) mean.

Maximum Likelihood estimator for the variance (& sigma)

Differentiating L w.r.t. v = sigma²:

d L      n     1      n
---  =  --- - ----.SUM  (x_i-mu)²
d v     2.v   2.v²   i=1

setting this to zero:

          n
v_ML  =  SUM  (x_i-mu_ML)²/n
         i=1

the maximum likelihood estimate for the variance v = sigma².

Note that if n=0, the estimate is zero, and that if n=2 the estimate effectively assumes that the mean lies between x₁ and x₂ which is clearly not necessarily the case, i.e. v_ML is biased and underestimates the variance in general.

Minimum Message Length (MML)

Wallace and Boulton (1968) derived the uncertainty region for the [normal distribution] from first principles. Later it was seen to be a special case of a general form using the [Fisher] information.

Fisher Information

The off-diagonal term of the Fisher information is given by the expectation of:

  d²L
-------- =  - (n.mu - (x₁+ ... +x_n)) / v²
d mu d v

and in expectation (i.e. on average), this is zero.

The second derivative of L w.r.t. mu is:

 d²L
-----  =  n/v  =  n/sigma²
d mu²

The second derivative of L w.r.t. v is:

 d²L        n     1     n
----  =  - ---- + --.SUM  (x_i-mu)²
d v²       2.v²   v³  i=1

and in expectation this is

   n     n v
- ---- + ---  =  n/(2.v²)  =  n/(2.sigma⁴)
  2.v²    v³

The Fisher information is therefore

n/(2.v³)  =  n²/(2.sigma⁶)

(Note, the above is with respect to mu and v. Now v = sigma², so d v / d sigma = 2.sigma.
To calculate the Fisher information with respect to mu and sigma, the above must be multiplied by (d v / d sigma)² , which gives
2.n²/sigma⁴,
as can also be confirmed by forming d L / d sigma and d² L / d sigma² directly. [--L.A. 1/12/2003])

Minimum Message Length Estimators

msgLen = -log(h(mu,v)) + L +(1/2).log(F) + constant
= -log(h(mu,v)) + (n/2)log(2pi) + (n/2)log(v) + (1/2v).SUM(x_i-mu)² + (1/2)log(n²/2) - (3/2)log(v) + constant	--h --L --F

differentiate w.r.t. mu:

d msgLen        d                  n
--------  =  - ----(log h(mu,v)) + -.(mu-(x₁+...+x_n)/n)
d mu           d mu                v

and w.r.t. v:

d msgLen        d                 n-3    1
--------  =  - ---(log h(mu,v)) + --- - ---SUM (x_i-mu)²
d v            d v                2.v   2v²

If the prior is h(mu,v) ~ 1/v, (improper) then d h/d mu = 0 and

mu_MML = (x₁+ ... +x_n)/n = mu_ML

With such a prior, d h/d v ~ -1/v², so

d msgLen     1   n-3    1
--------  =  - + --- - ---.SUM (x_i-mu)²
d v          v   2.v   2v²


  n-1    1
= --- - ---.SUM (x_i-mu)²
  2.v   2v²

set to zero:

v_MML = {SUM_i=1..n (x_i-mu)²}/(n-1)

This use of a divisor of (n-1), rather than n, is also a "well known" but (there) ad-hoc correction for the bias in v_ML, however here it is derived in a justified way for MML.

Measurement Accuracy

In the case of continuous distributions, such as N_mu,sigma, the likelihood function is a probability density function. To turn it into a genuine probability, it must be multiplied by the measurement accuracy. e.g. If observations are measured to two decimal places, say, then the probability of an observation x = x₀ . x₁ x₂ +/- 0.005 is N_mu,sigma(x)*0.01. Assuming sigma>>0.01, it can be seen that, if it is included, this measurement accuracy "passes through" the calculations above untouched, not affecting the estimators. It does however affect the overall message length.

MML v. SMML

MML is an approximation to strict minimum message length (SMML) inference. As cautioned elsewhere, if MML's simplifying assumptions (i.e. h(params) nearly constant over uncertainty region & likelihood function nearly constant over uncertainty region and over measurement accuracy) do not hold then either more accurate approximations should be used or the above equations must only be used with reservations. This is simply a matter of common sense.

Notes

C. S. Wallace & D. M. Boulton. An Information Measure for Classification. The Computer Journal 11(2) pp.185-194, August 1968.
See also the Special Issue on Clustering and Classification, The Computer Journal, F. Murtagh (ed), 41(8), 1998.