Multiple Factor Analysis by MML Estimation
C. S. Wallace
Abstract: Previous work by Wallace and Freeman (1992) on the application of Minimum Message Length (MML) estimation to the Factor Analysis model of a multivariate Gaussian population is extended to allow several common factors. The extension is not trivial, raising problems in the choice of an appropriate prior for the factors, and in the evaluation of the Fisher information of the model. However, the resulting complications appear to cancel out, giving an estimator similar to the singlefactor case.
The estimator has been extensively tested on simulated data, and compared with the maximum likelihood and Akaike Information Criterion (AIC) estimator. The MML estimator is found to be substantially more accurate, to provide consistent estimates of factor scores, and to recover the number of common factors more reliably than a likelihoodratio test among maximum likelihood models.
Keywords: Minimum Message Length, MML, factor analysis, multivariate Gaussian.
Also see the
MML [book] s.6.9.1,
p.300. This page was (roughly) converted from troff to html by LA because the Proc. of the 14th Biennial Statistical Conf., Brisbane (1998) conference paper is v. hard to get. Note, θ hat is represented by θ, etc.. There are doubtless mistakes in the conversion. I certainly have doubts as to the use of 'b' v. β so please read it with that in mind. Also see [TR218]. 
1. Introduction
1.1 The model
We have developed and tested a Minimum Message Length (MML) estimator for a version of the well known Factor Analysis model for a multivariate population, here assumed to have a Gaussian density. The work extends an earlier work (Wallace and Freeman 1992), which addressed the restricted problem with a single factor. Here, we consider models with an unknown number of factors. In some respects the earlier work generalizes in a straightforward way, and we assume familiarity with it. For consistency, we retain and extend the perhaps unusual notation of the singlefactor paper. Hereafter, that paper is referred to as WF.
The data is a sample of N data vectors {x_{n}, n=1...N}, each of K components {x_{nk}, k=1...K}, independently drawn from the population. A version of the Factor model is:

(1.1)x_{nk} = μ_{k} + Σ_{j} v_{nj} a_{kj} + σ_{k} r_{nk} (j = 1...J < K)
where parameter μ_{k} is the mean in dimension k, parameter σ_{k} is the "specific variation" in dimension k, parameters {a_{j}, j=1...J} are a set of J "Factor load" vectors each of dimension K, parameters {v_{j}, j=1...J} are a set of J "score" vectors each of dimension N, and the NK variates {r_{nk}, n=1...N, k=1...K} are regarded as having independent N(0,1) distributions (Harman 1967).
The conceptual model underlying factor analysis is that there exist some hidden, or latent, variables which have a linear effect on the K variables actually measured. The aim of a factor estimation method is to discover, to the extent revealed by the data, the number J of these variables, their values {v_{nj}} for each member of the sample, and their effects on the measured variables, represented in the model by the factor load vectors {a_{j}}. The model further assumes that each measured variable is subject, independently of all other variables, to some Normallydistributed variation or measurement error, r_{nk}, which is not affected by the hidden variables. The magnitude of this "specific" variance is modelled by the set of parameters {σ_{k}}.
In Maximum likelihood (ML) treatment of the factor model, the model actually estimated is less complete. As ML is incapable of consistent simultaneous estimation of both factor loads and factor scores, the latter are treated not as parameters but as random variables with N(0,1) distributions, and the inferred model contains no inference of their individual values.
The model (1.1) contains ambiguities. First, the factor load vector a_{j} and factor score vector v_{j} appear in the model only via their product. Thus the data cannot reveal their individual lengths. (By "length", we mean throughout the Euclidean vector length √(a^{T}a) = a.) This ambiguity is conventionally resolved by assuming the scores {v_{nj}} to be parameters with independent unit Normal priors. An alternative resolution is to constrain the scores so that Σ_{n} v_{nj}^{2}=N for all j. This alternative is mathematically more difficult and will not be pursued.
A more troublesome ambiguity is that the data cannot reveal the individual factor loads. For each load vector a_{j}, define the scaled load t_{j} given by t_{kj} = a_{kj} / σ_{k} (k=1...K). We will call the vectors t_{j} the "true latent vectors" (TLVs). Assuming Normal distributions of scores, the latent variable loads affect the population density only through the matrix Σ_{j} t_{j} t_{j}^{T}, so the data cannot distinguish among the infinitely many sets of TLV vectors giving the same matrix. It is conventional and convenient to design the estimator to estimate a set of orthogonal latent vector (OLV) loads {β_{j}} such that Σ_{j} β_{j} β_{j}^{T} = Σ_{j} t_{j} t_{j}^{T}, β_{j}^{T} . β_{l≠j} = 0. We similarly require the estimated score vectors to be orthogonal.
1.2 MML
Minimum Message Length (Wallace and Freeman 1987) or MML is a form of Bayesian model selection and estimation principle. Using the result from Information Theory that the most efficient binary encoding of an event or value of probability P has length  log_{2} P binary digits, MML considers a particular efficient encoding of the given data x as a message having two parts. The first part specifies a model probability distribution over data P(x  θ), and specifies estimates θ of parameters θ. The second part, of length  log_{2} P (x  θ), encodes x using a code efficient if θ = θ, e.g., a Huffman or Arithmetic code (Witten et al, 1987). The length of the first part depends on the prior parameter density h(θ), but also on the precision, or roundoff quantum, δ, with which the parameter is specified, and is approximately  log_{2} (h(θ) δ). The total message length is minimized by an appropriate choice of δ, which may be a function of the sample size and of θ. In sufficiently regular cases, the optimum δ is close to 1 / √{F(θ)}, where F() is the Fisher information. Hereafter, we use natural logs for convenience.
MML chooses the model and parameter estimates which minimize the total message length. Among its virtues are its invariance under nonlinear transformations of the parameter space, and its avoidance of the overfitting to which Maximum Likelihood (ML) is prone. For convergence theorems see Barron and Cover (1991), Wallace (1989).
If the set of models contemplated is a family of models of increasing order and number of parameters, as is the case in the Factor problem, MML chooses the order and parameter values in one step. If regarded simply as an orderselection process, it shares with Schwartz (1978) the property of penalizing the loglikelihood by an amount of order (1/2)log(sample size) for every free parameter, rather than the constant unit penalty of AIC. MML differs from both in that it estimates parameters as well as selecting an order. Note that Bozdogan (1987) argues that AIC should similarly depend logarithmically on sample size.
Both Bozdogan and Schwartz thus argue that the choice of order should be based on maximizing a quantity like (log likelihood)  (D/2) log N where N is the sample size and D the number of free parameters, but their arguments ignore terms of lower order.
For certain parameters of the factor model, (μ_{k}, a_{kj}, σ_{k}), the roundoff quanta used in MML vary as 1/√N, so the message length contains a contribution of order (1/2)log N for each such parameter. However, for certain other parameters (the factor scores v_{nj}) the effective sample size is only K, the number of components of each data vector. MML automatically accounts for this difference in the nature of the parameters, and also includes relevant terms of order one which are ignored by Bozdogan and Schwartz.
The aim of this paper is not to show how prior information can be included in a Factor analysis (although the work presented here could be extended to do so) but rather to show that even when an uninformative prior is assumed, MML gives results different from, and superior to, the results from maximizing the likelihood.
1.3 Related Work
Press and Shigemasu (1989) surveyed several Bayesian approaches to Factor analysis. A property shared by them all is that point estimates of factor scores and loads are obtained as either modes or means of the posterior density, or, in most methods, modes of a marginal posterior density. Such estimators are not invariant under nonlinear transformations of the parameter space. For instance, they would give different results were the factor load vectors to be specified in Polar rather than Cartesian coordinates.
Press and Shigemasu described a new formulation which allows an explicit expression for the marginal posterior of factor scores. Adopting its mean as an estimate, they find a posterior density for the loads, and are able to give credibility intervals for elements of the load vectors. The method appears to require an informative prior for the loads, and thus bypasses questions of the number and identification of latent factors.
Akaike (1987) describes his Information Criterion (AIC) as a means of choosing a model which has maximal predictive accuracy in the sense of minimizing the expected KullbackLeibler distance between the population and model distributions. He proposed AIC, basically 2(log likelihood  number of estimated parameters), as a criterion for estimating the number of factors in a factor model, but recognised that the possible existence of improper ML solutions is a problem. He proposed use of a Normal prior for factor loads to suppress these solutions, but the proposal is not really Bayesian, as the "prior" is a function of sample size, and hence cannot be a representation of prior knowledge (or ignorance) about the model. In section 6.3 we show evidence that MML, although not specifically designed to do so, yields smaller KullbackLeibler distances than does AIC.
2. Priors
In this paper, we attempt to adopt priors which are as nearly as possible uninformative.
In considering the choice of a prior for the latent variable effects (the "factor loadings" a_{j}) the estimate of σ_{k} for each measured variable may be taken, in default of any other measure, as determining a natural scale for that variable. The effects of hidden variable j on the measured variables, when expressed in these natural scales, are given by the vector t_{j} and the entire factor model is most "naturally" expressed in terms of measured and hidden variables represented in the "naturally scaled" space. We therefore suppose the prior density over the length and direction of a factor load vector to be colourless in the scaled space. In particular, we suppose t_{j} to have a spherically symmetric prior density, and each of the J such vectors to be an independent realization of this density.
This assumption does not carry over to the priors assumed for the estimated OLV vectors {β_{j}}. If the "true" scaled effects of the hidden variables are given by the TLV vectors {t_{j}}, we have no reason to suppose that these vectors will be mutually orthogonal. Indeed, we assume their directions to be independently uniformly distributed, this being the vaguest possible prior for their directions.
Pro tem, assume the TLVs to have independent spherically symmetric prior densities of Kdimensional Normal form with scale parameter ρ. If a set of J vectors {u_{j}} are drawn from this density, the joint density of their lengths {u_{j}} is
 (1/ρ)^{J} ∏_{j=1...J} H_{K}(u_{j}/ρ) where

(2.1)H_{K}(z) = (K / {2^{K/2} (K/2)!}) z^{K1} e^{z2 / 2}
If now a set of J mutuallyorthogonal vectors {c_{j}} is constructed such that
 Σ_{j} c_{j} c_{j}^{T} = Σ_{j} u_{j} u_{j}^{T}
the joint distribution of the lengths {c_{j}} is proportional to (Muirhead, 1982, p.107)

(2.2)(1 / ρ)^{J} ∏_{j=1...J} [ H_{K}(c_{j} / ρ) ∏_{l<j} c_{j}^{2}  c_{l}^{2} / (c_{j} c_{l}) ]
The density (2.2) has the same form as (2.1) save for a "correction" factor which suppresses the density if two or more vectors have similar length, and inflates the density when one or more vectors are small.
Thus, starting from the rather vague prior over the TLVs, namely that they have independent sphericallysymmetric Normal densities, we are led to conclude that an appropriate prior for the OLVs actually estimated shows their lengths not to be independently distributed. It has been suggested that one might prefer to model the OLV prior directly, presumably assuming the OLV lengths to be independent. However, our conceptual model is that the observed data are linearly affected by several distinct hidden variables, and there seems no prior reason to suppose that the effects of these variables on the measured variables should bear any relation to one another such as orthogonality. If this argument is accepted, then we expect the lengths of the OLVs rarely to be close to equality. A pair of OLVs can have equal length only in the unlikely coincidence that a pair of hidden variables (TLVs) happens to have equal lengths and to be orthogonal. Form (2.2) expresses the prior belief that such coincidences are unlikely.
The lengths of the v and β vectors for an OLV are confounded in the model, as only their product affects the likelihood. Thus, the estimator really estimates the product v_{j} β_{j} = v_{j} β_{j}. We therefore consider that the model quantity most closely analogous to the magnitude of an OLV is v_{j} β_{j} (or v_{j} β_{j} / N), and hence adopt for the joint prior density of the lengths of {β_{j}} and {v_{j}} a form which is the assumed density of the "true" load and score vectors, times a "correction" factor of the form above (2.2) depending on the v_{j} β_{j} products of the OLVs:

(2.3)h(β_{j}, v_{j}) = C (1/ρ)^{J} ∏_{j=1...J} [ H_{K}(β_{j}/ρ) H_{N}(v_{j}) ∏_{l<j} v_{j}^{2} β_{j}^{2}  v_{l}^{2} β_{l}^{2} / {v_{j} β_{j} v_{l} β_{l}} ] 2^{J} J!
where C is a normalization constant independent of ρ.
The factor 2^{J} J! is included because the model is unchanged by simultaneous negation of β_{j} and v_{j}, and the labelling of the OLVs is arbitrary.
An estimator was constructed using the prior (2.3). It performed well in trials on simulated data provided the lengths of the load vectors were not greatly dissimilar. However, when the lengths ranged over a ratio of five or more, the estimator tended to overestimate the shortest and underestimate the longest. This bias arises because, for large K, the multivariate Normal prior implies a TLV length prior (2.1) with a relatively narrow peak at ρ √K. Even if ρ is treated as a hyperparameter to be estimated from the data, (2.1) still expresses a prior expectation that all TLVs will have similar lengths. We therefore sought an alternative TLV prior with vaguer implications for TLV lengths.
Unfortunately, the Normal TLV prior is the only form for which we were able to find an explicit expression for the corresponding OLV prior. As a compromise, we adopted the following OLV prior for load and score lengths:
 h(β_{j}, v_{j}) = C_{NKJ} ∏_{j=1...J} [ T_{K}(β_{j}) H_{N}(v_{j}) ∏_{l<j} v_{j}^{2} β_{j}^{2}  v_{l}^{2} β_{l}^{2} / {v_{j} β_{j} v_{l} β_{l} √[(1+β_{j}^{2}) (1+β_{l}^{2})]} ] 2^{J} J!

(2.4)where T_{K}(β) = {2 Γ ((K+1) / 2)} / {√π Γ(K/2)} . {β^{K1}} / {(1+β^{2})^{(K+1)/2}}
and C_{NKJ} is a normalization constant discussed later.
This form assumes a very diffuse load length prior, and retains a "correction" factor similar to that of the Normal form (2.3). Simple geometric arguments show that the joint OLV length density derived from any smooth, independent, TLV length density will exhibit a "correction factor" of roughly this form. The prior (2.4) is a generalization of the "Cauchy" prior used in WF, and is far more diffuse than the Normal form.
The joint prior density of the directions of the OLVs is assumed to have the uniform value

(2.5)∏_{j=1...J} 1/S_{Kj+1}
 where S_{D} = D π^{D/2} / (D/2)!
is the surface area of the unit Dsphere, and we similarly assume the joint prior density of the directions of the score vectors to be uniformly

(2.6)∏_{j=1...J} 1/S_{Nj+1}
As in WF, we assume the specific variances {σ_{k}} to have independent prior densities proportional to 1/σ_{k} in some finite range, and the means {μ_{k}} to have independent uniform priors in some finite range.
The assumed priors import little useful information into the analysis. Those for means and specific variances have the conventional "uninformative" forms, the priors of load and score vectors presume no preference for any direction, and the prior for load strengths expresses only the vaguest expectation that they will be of the same order as the specific variations.
3. The Information Matrix
MML chooses the model and parameter estimate θ which minimize
 T(θ) =  log[(h(θ) / √I(θ)) Prob(dataθ)]
where h(θ) is the prior density of the parameter vector θ, and 1/√I(θ) is an approximation to the volume in θspace spanned by the expected estimation error. I(θ) is the determinant of the matrix of expected partial second derivatives of T with respect to the components of θ. For many estimation problems, I(θ) is well approximated by the Fisher Information, but for the factor model, the derivatives of h() cannot be neglected. As in WF (section 3), we take into account the derivatives of log h() with respect to the factor scores. We depart from WF by choosing to express the model in terms of the scaled OLV components {β_{kj}} rather than the unscaled component {σ_{k} β_{kj}}. Equation 4.2 of WF gives the sensitivity determinant for a single OLV β and score vector v = (v_{1}, ..., v_{N}) as
 I_{1}(θ) = (2N)^{K} (Nv^{2}  (Σ_{n} v_{n})^{2})^{K} (1 + β^{2})^{N2} / ∏_{k} σ_{k}^{6}
 where β = β, v = v
Using Σ_{n} v_{n} = 0 from WF p.198, and changing to scaled components gives

(3.1)I_{1}(θ) = 2^{K} N^{2K} v^{2K} (1+β^{2})^{(N2)} / ∏_{k} σ_{k}^{4}
The calculation in WF generalizes directly to J>1 factors giving

(3.2)I_{2}(θ) = 2^{K} N^{2K} ∏_{j=1...J} [ v_{j}^{2K} (1+β_{j}^{2})^{(N2)} ] / ∏_{k} σ_{k}^{4}
Note that (3.2) is the determinant of a matrix which includes derivatives with respect to all K components of each OLV β_{j} and all N components of each score vector v_{j}. Thus, it includes the sensitivity of T to differential changes of {β_{j}} and {v_{j}} which do not preserve the mutual orthogonality of OLVs and score vectors. However, in computing (3.2), use has been made of the fact that the derivatives are evaluated for mutuallyorthogonal vectors. The question of enforcing the derivatives to preserve orthogonality is addressed below. First, we switch to the polar parameterization of the model, using length and direction rather than Cartesian components for the OLVs and score vectors. In this change of parameters, the sensitivity determinant transforms as the square of a density, giving

(3.3)I_{3}(θ) = 2^{K} N^{2K} ∏_{j} [ v_{j}^{2(K+N1)} (1+β_{j}^{2})^{(N2)} β_{j}^{2(K1)} ] / ∏_{k} σ_{k}^{4}
Consider some pair of OLVs β_{j} and β_{l}, and their associated score vectors v_{j} and v_{l}. The expression (3.3) contains contributions from the sensitivity of T to changes in direction of β_{j}, β_{l}, v_{j} and v_{l}. Let us consider changes in the directions of β_{j} and β_{l} corresponding to differential rotations of these vectors in the plane containing them both. Let θ_{j} be an angle giving the direction of β_{j} in the plane, and θ_{l} be an angle giving the direction of β_{l} in the plane. Similarly, let φ_{j}, φ_{l} be angles giving the directions of v_{j}, v_{l} in the plane containing them both. Then I_{3}(θ) contains a contribution due to the expected second differentials of T with respect to θ_{j}, θ_{l}, φ_{j} and φ_{l}. It can be shown that this contribution amounts to a multiplicative factor of I_{3} given by

(3.4)v_{j}^{2} (1 + β_{j}^{2}) v_{l}^{2} (1 + β_{l}^{2}) v_{j}^{2} β_{j}^{2} v_{l}^{2} β_{l}^{2}
and that there are no crossderivatives with other parameters of the model with nonzero expectation.
As noted above, (3.4) arises from the unconstrained variation of the four parameters θ_{j}, θ_{l}, φ_{j} and φ_{l}. In fact, the four parameters are constrained by the requirement of orthogonality, so that there are really only two parameters, say θ and φ, with

(3.5)θ_{j} = θ, φ_{j} = φ, θ_{l} = θ + π / 2, φ_{l} = φ + π / 2.
When T is expressed in terms of these two parameters, and its second differentials with respect to them calculated, it is found that together they contribute to the sensitivity determinant a multiplicative factor of the form

(3.6)(v_{j}^{2} β_{j}^{2}  v_{l}^{2} β_{l}^{2})^{2}
rather than the form (3.4).
The above expression shows that when factors j and l have equal size, a rotation can be applied to the OLVs and score vectors which has no effect on the likelihood or the priors, and hence no effect on T. The reader might be surprised by the fact that this degeneracy becomes apparent only when the orthogonality constraint is taken into account, since such rotations are possible, and have no effect on T, whether or not the OLVs are explicitly constrained to be orthogonal. The reason that the degeneracy is not apparent in the expression I_{3}(θ) is that I_{3}(θ) indicates the effects of perturbations of the parameters on T only to second order. When the OLVs and score vectors are not explicitly constrained, simultaneous variation of two OLVs and two score vectors is required to produce a perturbation having no effect on T, and so the degeneracy would be revealed only by a fourthorder analysis treating differentials up to the fourth. Once the orthogonality constraint is applied, however, the simultaneous variation involves only two, rather than four, dimensions of parameter space, and so is revealed by a secondorder analysis using second partial differentials.
Modifying I_{3}(θ) by replacing a factor of form (3.4) by a factor of form (3.6) for every pair of OLVs gives the final sensitivity determinant as

(3.7)I(θ) = 2^{K} N^{2K} ∏_{j} [ (v_{j}^{2} + (1+β_{j}^{2}) / ρ^{2}) v_{j}^{2(K+N2J)} (1 + β_{j}^{2})^{(NJ1)} β_{j}^{2(KJ)} ] . [ ∏_{j} ∏_{l<j} (v_{j}^{2} β_{j}^{2}  v_{l}^{2} β_{l}^{2})^{2} ] / ∏_{k} σ_{k}^{4}
4. The MML Estimator
In this section, σ_{k} , β_{j} , v_{j}, etc. refer to estimates rather than "true" parameter values.
The MML estimator is chosen to minimize MessLen = (1/2) log I  log(prior density)  log(likelihood).
Omitting constant terms, we have

(4.1)MessLen = (N1) Σ_{k} log σ_{k} + (1/2) (N+K1) Σ_{j} log(1 + β_{j}^{2}) + (1/2) (KJ+1) Σ_{j} log v_{j}^{2} + (1/2) Σ_{j} v_{j}^{2} + (1/2) Σ_{n} Σ_{k} (y_{nk}  Σ_{j} v_{nj} β_{kj})^{2}
where {y_{nk}} are the scaled data {(x_{nk}  μ_{k}) / σ_{k}} and we assume that the estimate μ_{k} = Σ_{n} x_{nk} / N. (The proof of this equality follows the proof in WF.) Equation (4.1) is a simple generalization of the equation for MessLen given in WF p.198.
Note that the factors of the form (v_{j}^{2} β_{j}^{2}  v_{l}^{2} β_{l}^{2}), which appear both in the prior and in I, cancel out and do not appear in MessLen.
This cancellation is not an accident, as these factors arise from a singularity in the mapping of TLVs onto OLVs which affects both the prior and I in similar ways. Consider the example of a model with only two latent vectors (J=2) in K dimensions. The space of possible TLVs has 2K dimensions, but the space of possible OLVs has only 2K1, as the OLVs lose one degree of freedom by being constrained to be orthogonal. For each OLV pair, there is a onedimensional manifold of TLV pairs which map onto the OLV pair. Normally, the mapping is unique: such a manifold of TLV pairs maps to a single OLV pair. However, consider a set of TLV pairs in which all pairs lie in the same fixed plane of Kspace, both TLVs of all pairs have the same fixed length, and the two TLVs of each pair are orthogonal. This set is a 1dimensional manifold, as there is only one degree of freedom: the angular orientation of the TLV pair in the fixed plane.
The matrix
 ∑_{j} t_{j} t_{j}^{T}
has the same value for all pairs in the set. However, rather than mapping into a single OLV pair, such a special set of TLV pairs maps into a 1dimensional manifold of OLV pairs identical to the set of TLV pairs. Normally, the prior density in a 1dimensional manifold of TLV pairs condenses onto a single OLV pair, but for a special TLV set, as above, the prior density is spread over a manifold of OLV pairs, so these OLV pairs acquire a zero prior density.
Similarly, it is normally the case that if two OLV pairs differ, then they are images of two different TLV manifolds, giving rise to two different data distributions. However, if the two OLV pairs are members of the image of the same special TLV set, they give rise to the same data distribution, and so the perturbation which changes one pair into the other (the θ of (3.6)) has no effect on T, and so I is zero. An OLV pair is in the image of a special TLV set just when the two OLVs have equal length, so just in this case the prior density of the pair drops to zero, and I becomes zero.
Also see the [book] p.302 
We now proceed to describe the MML estimator.
 Define y_{n} as the Kvector {y_{nk}, k=1...K},
 w_{nk} as x_{nk}  μ_{k} = σ_{k} y_{nk},
 Y as the K × K matrix Σ_{n} y_{n} y_{n}^{T}
 and introduce variables {R_{j}, Q_{j}, j=1...J} defined in 4.3, 4.5 below.
Then the estimates minimizing MessLen satisfy (Wallace and Freeman 1992):

(4.2)σ_{k}^{2} = Σ_{n} w_{nk}^{2} / (N1+ Σ_{j} R_{j} β_{kj}^{2})

(4.3)Q_{j} = 1 + β_{j}^{2} + (KJ+1) / v_{j}^{2}

(4.4)v_{j}^{2} = β_{j}^{T} Y β_{j} / Q_{j}^{2}

(4.5)R_{j} = v_{j}^{2} + (N+K1) / (1 + β_{j}^{2}) or R_{j} = N+J2

(4.6)β_{j} = Y β_{j} / Q_{j} R_{j}
Equations (4.2) to (4.6) may be solved by functional iteration. After each iteration, the revised vectors {β_{j}} are reorthogonalized by evaluating

(4.7)β_{j} (new) = β_{j}  Σ_{l<j} β_{l} (β_{l} . β_{j}) / β_{l}^{2} (j = 2...J)
To start the iteration one may set R_{j} = v_{j} = N (all j), and set the β_{j} vectors parallel to the J dominant eigenvectors of the data correlation matrix, with squared lengths equal to the eigenvalues.
As in WF, the iteration may drive an OLV to zero if the data are consistent with a model having fewer than J factors. In that case, the iteration is restarted with reduced J.
The iteration does not directly require or yield estimates of the score vectors. These may be calculated after convergence as

(4.8)v_{nj} = Σ_{k} y_{nk} β_{kj} / Q_{j}
5. Message Length
The full calculation of the message length is not needed in estimating the parameters of a model with a specified number of OLVs. It is needed for the comparison of models with differing numbers of OLVs, and can be used for comparison of a factor model with a model of different functional form.
The full expression for the message length of data vector x using a model with estimated vector parameter θ, prior density h(θ), senstivity matrix I(θ) and probability law P(xθ) is (Wallace and Freeman 1987).

(5.1)MessLen =  log[ h(θ) / √I(θ) ]  log P(xθ)  (1/2) D log(2π) + (1/2) log(Dπ)
where D is the number of scalar parameters, i.e. the dimension of θ.
Using (2.4, 2.5, 2.6, 3.7 and 4.1), we obtain

(5.2)MessLen = (K/2  2J) log 2 + K log N + {1/2} JN log (2 π)  log(J!)
  J log S_{N} + ∑_{l=0...J1} log S_{N}l  J log(S_{K} / S_{K+1}) + ∑_{l=0...J1} log S_{Kl}
 + {1/2} (KJ+1) Σ_{j} log v_{j}^{2} + {1/2} (N+K1) Σ_{j} log(1 + β_{j}^{2})
 + (N1) Σ_{k} log σ_{k} + {1/2} Σ_{j} v_{j}^{2} + {1/2} Σ_{n} Σ_{k} y_{nk}^{2}
 + {1/2} Σ_{j} v_{j}^{2} β_{j}^{2}  Σ_{j} β_{j}^{T} Y β_{j} / Q_{j}  {1/2} D log (2 π) + {1/2} log (D π)  log C_{NKJ}
where D = 2K + J (N+KJ+1), S_{m} is defined in (2.5), y_{nk} in (4.1), and Y is defined before (4.2). The finite ranges used in normalizing the priors for {μ_{k}} and {σ_{k}} are omitted from (5.2) as they affect all models equally.
The normalization constant C_{NKJ} (from (2.4)) is not known in closed form. Numerical integration of (2.4) for various N, K and J has shown that log C_{NKJ} depends little on N for N>>1, and for N≥40 is well approximated by the empirical expresssion

(5.3) log C_{NKJ} ≈ A_{J} + {1/2} J(J1) log (KB_{J} )
where the constants A_{J}, B_{J} are:
J: 1 2 3 4 5 6 A_{J}: 0 0.24 1.35 3.68 7.45 12.64 B_{J}:  0.58 0.64 0.73 0.86 1.21
6. Testing
6.1 Artificial Data
Test runs were made on artificial data sets randomly selected from populations with known TLV (and hence OLV) structure. In each run, 1000 populations were tried, all with the same number K of dimensions, the same number J of TLVs and all having the same set of TLV lengths. With no loss of generality, all specific variances {σ_{k}^{2}} were set to one. The populations differed in that, in each population, the TLV directions were randomly and independently selected from a uniform distribution over the Ksphere. For each population, a sample of constant size N was selected and the covariance matrix about the sample mean computed. This matrix was then analysed by both ML and MML methods, seeking estimated models with J1, J and J+1 OLVs. The MML method would sometimes fail to find a model with the requested number of factors, instead finding a reduced number.
For most runs, we chose K=16, N=300, and J in the range 2 to 5. The sets of factor lengths were chosen arbitrarily, with most runs including at least one factor small enough to give some difficulty in estimating it. Note that the TLV lengths, which were held constant in each of the 1000 trials in a run, were not selected from the assumed prior distribution of TLV lengths. Hence the population OLV lengths, which varied from population to population, were not drawn from the assumed OLV length prior.
In some runs, the population OLVs of each population were generated directly, with a fixed set of equal lengths and random mutuallyorthogonal directions. These "orthogonal" runs were designed to test the MML estimator on populations for which the assumed prior OLV density (2.4) was zero.
The score vectors {v_{j}} were not explicitly estimated. ML cannot consistently estimate them together with the load vectors {β_{j}} and specific variances {σ_{k}^{2}}. MML can do so, but the directions of the score vectors were eliminated from the iteration equations (4.24.7), and only the covariance matrix made available to the two estimators.
The estimated OLVs and specific variances estimated by the two methods for each population were compared with the population values using criteria described below, and the average error criteria for the run compared.
6.2 Error Criteria
We chose criteria which are meaningful even if the estimated number of OLVs J differs from the population number J. Recall that all test populations have σ_{k} = 1 for all k.
S_{1} = Σ_{k} log σ_{k} is a measure of general over or underestimation of specific variance (and so under or overestimation of OLV effects.)
S_{2} = Σ_{k} (log σ_{k})^{2} is a simple measure of error in estimation of specific variances. Both S_{1} and S_{2} are zero for exact estimates.
In comparing ML and MML estimators, S_{1} and S_{2} are relevant because both estimators agree in estimating the OLVs as eigenvectors of the σscaled covariance Y. Hence good estimates of {σ_{k}} are essential for their estimation of OLVs.
KL = ∫ dx P(xθ) [log P(xθ)  log P(xθ)] is the nonsymmetric KullbackLeibler distance between the true population data density and that implied by the estimate θ. It is zero for an exact estimate, otherwise positive. It does not require P(xθ) and P(xθ) to have the samea number of parameters, and is invariant under nonlinear transformations of the parameter space.
For each run, the ML and MML estimators were compared on two bases. First ("fixed J"), we compare the estimates produced when each method was asked to find a model with the true number of J of OLVs. The MML method sometimes returned a model with fewer OLVs. Second ("chosen J"), we compared the results when the number of OLVs was estimated. The three ML models for J = J + 1, J and J1 were compared on the Akaike Information Criterion (AIC) (Bozdogan 1987):
 AIC = 2 (log likelihood  number of free parameters)
 =  log likelihood + J ( 2K  J + 1 )/2 + 2K
 and the model with least AIC chosen. Note that the number of free parameters here is not the D of (5.2), since ML does not estimate the factor scores. Of the three (or sometimes fewer) MML models, that with the smallest message length (5.2) was chosen.
6.3 Results on Artificial Data
All runs showed MML to give the lower average values for all three error criteria on both fixed J and chosen J bases. ML typically underestimates the specific variances (S_{1} < 0), with the bias increasing with increasing J. MML tends to overestimate the specific variances. Again, the bias increases with J, but is typically less than half as large as the ML bias.
The criterion S_{2}, measuring squared error in {σ_{k}}, was typically less than half as large for MML as for ML.
The KL criterion was typically about 0.01 to 0.03 less for MML. This may seem a small difference. However, a difference of 0.02 shows that if the log likelihoods of the ML and MML models were compared on another sample with N=300 from the same population, we would expect the MML model to have the greater likelihood, by a factor of about exp(300 x 0.02) approx 400. Thus, a likelihood ratio test on a second sample would be expected to lead to a rather decisive rejection of the ML model in favour of the MML.
Overall, the chosen J and fixed J comparisons gave similar results, with the MML advantage increasing with increasing J.
Using the "chosen J" basis, we compared the estimates J of the number of factors. Generally, ML was far more likely than MML to find J > J, and, particularly when one or more population TLVs was small, MML more frequently found J < J. For instance, in a run with K=16, N=300, and four TLVs each of length 1.0, the distributions of J in the 1000 trials of the run were:
J 2 3 4 5 ML 0 323 565 112 MML 88 649 261 2
This result might suggest that ML is superior to MML in estimating the number of TLVs. However, closer examination shows the apparent ML advantage to be an illusion. When the 347 trials were examined in which ML estimated J = 4 and MML estimated J < 4, it was found that the MML models, despite having only 3 or fewer OLVs, were still the more accurate by all three criteria. Moreover, in these cases if we form ML models with 3 factors, they are on average about as accurate as the ML models with four factors. Thus, it appears that the fourth factors found by ML conveyed little about the population, and were as much misleading as informative.
For runs with strong population TLVs of markedly differing lengths, it is possible to identify estimated OLVs with population OLVs with little confusion. In several such runs, additional error criteria were assessed. One summed the squared differences between the (unscaled) population and estimated OLVs, another summed the squared sine of the angles between population and estimated OLV directions. When J != J, an appropriate allowance in the error criteria was made. According to these two additional error criteria, MML was more accurate than ML. In a typical run, with J=4 and TLV lengths of 1, 2, 3 and 6, MML gave average values of 0.81 and 0.34 for the two criteria, while ML gave 1.11 and 0.50.
Runs in which the population TLVs had equal length and were mutually orthogonal (and hence gave equallength OLVs) showed no unusual behaviour, despite the fact that such OLV sets have zero prior density in our scheme. The superiority of MML remained evident.
Details of the above tests are available in a Technical Report (Wallace 1995).
6.4 An Example
Press and Shigemasu (1989) analyse some aptitude data from Kendall (1980) with N=48, K=15, using their explicit Bayesian method. They assume four factors, with an informative prior on the load vectors based on experience with similar data. After determining by unspecified methods some hyperparameters appearing in their model priors, they find estimates of the factor scores and loads. (We denote their results "P&S".) The same data (p.282) was analysed by ML and our MML methods, attempting to find models with 3, 4 and 5 factors.
The three ML models gave AIC values of 512.0, 543.6 and 540.8 respectively, weakly preferring four factors over five. MML found no 5factor solution, and gave message lengths for 3 and 4 factors respectively 195.3 and 197.1 less than the nofactor message length. Thus MML also prefers four factors, but the posterior odds in favour of four factors over three is only e^{1.8} or about 6:1.
The agreement among all three fourfactor models was assessed by evaluating the RMS sine of the angles (in unscaled space) between corresponding load vectors of two models, where the correspondence was chosen to minimize the RMS sine. The angles given by these RMS sines for each pair of models, and the average cosines between matched loads, were:

P & S vs. MML: 36°, av. cosine 0.80 P & S vs. ML : 42°, av. cosine 0.71 MML vs. ML : 38°, av. cosine 0.75
There is reasonable agreement between the MML and PressShigemasu results, with little doubt as to the correspondence between load vectors. The ML model is hard to reconcile with either of the other two, and has one load vector deviating by at least 65° from all loads of the other two models. This result, details of which are given in Wallace (1995) suggests the two Bayesian methods, despite their different priors and approaches, recover similar views of the data. The weak MML preference for four, rather than three, factors, together with the behaviour of ML on artificial data, gives grounds for doubting that the aberrant ML factor has any validity.
7. Discussion
7.1 The Impact of the Priors
The MML method has been shown by extensive testing to be more accurate than Maximum Likelihood estimation when the data is drawn from a population consistent with the Normal Factor model. It may be asked whether the improved accuracy is partly or wholly the result of the prior information assumed in MML and other Bayesian methods. To some extent, the assumption of a proper prior for TLV lengths must help suppress the improper solutions sometimes observed using ML, as these improper solutions are characterized by large β_{j} and small σ_{k}. However, the effect of our prior is small. It is very diffuse, having infinite second moments, and at least for N>>K, its effect on the message length is swamped by the effects of the sensitivity determinant. We have found experimentally that even when an improper flat prior for TLV lengths is assumed, MML still outperforms ML, and still gives on average larger estimates of specific variance despite the 1/σ prior.
The assumed prior density for TLV lengths could have been of little assistance to the MML method on the test runs reported in section 6.3, which mostly used data drawn from populations with TLV lengths of 1.0 or 2.0. Under the assumed prior, with K=16, the prior probability of a TLV length being 2.0 or less is only about 0.06, and for lengths 1.0 or less is only 0.001. Hence the prior was misleading rather than helpful under the test conditions. On the same data, a modified MML method assuming the Kdimensional Normal prior (2.1) was also tested. This prior has a hyperparameter ρ determining the length scale of the prior, and the method estimated ρ from the data. The modified method gave almost the same estimates as the original MML method, but because factor lengths around 1.0 are less severely penalized by this prior (with ρ estimated at about 0.25), the minimum message length was more frequently achieved with J = J. The result was a further small improvement in all error measures. However, as noted in section 2, use of the Normal prior is unwise unless there are prior grounds for expecting all factors to have roughly similar lengths.
The reason MML suppresses improper models is not that they occur in regions of parameter space having low prior density. Rather, it is that their high likelihood is achieved only in a very small region of parameter space. A message using such an improper solution to encode the data would thus have to specify the parameters to high accuracy (many digits of precision) making the message long.
No other aspect of our assumed priors can be reasonably considered as importing useful information. The prior over load vector directions is uniform, representing total ignorance. The Normal prior assumed for factor scores serves only to disambiguate the confusion between load and score lengths, and is in any case tacitly assumed in the conventional ML Normal model.
7.2 Choice of Number of Factors
MML appears to provide a fairly sound basis for estimating the number of factors. It is conservative, erring on the side of rejecting factors for which the evidence is weak. If an MML solution with J factors yields a correlation matrix among the residuals {y_{nk}  Σ_{j} v_{nj} β_{kj}} whose largest eigenvalue is no larger than expected from a population with J factors, then the data give no evidence for the existence of a (J + 1)th factor. In such cases, the MML functional iteration (4.2...4.7) will drive one load vector to zero if an attempt is made to find (J + 1) factors. This property is proved in WF for the singlefactor problem, and has been observed experimentally for the present method. In rare cases, the iteration may find an additional factor, but the model with the additional factor gives a larger message length and is therefore rejected.
When compared with AIC selection among ML solutions, MML may more frequently select a small number of factors, but none the less still gives a better model of the population density by KL measure.
With the prior here assumed, the MML choice of J may underestimate J when an OLV is small, simply because the prior probability assigned to small (scaled) OLV lengths is small. Experiments reported in (Wallace 1995) showed that, rather than choosing the MML model of smallest message length, one may with advantage choose the MML model with the highest J for which a solution exists. For populations with scaled OLV lengths around 1.0, this scheme may find a model with an additional factor (and slightly higher message length) and on average was slightly better both at estimating J and on our error measures. Where this scheme found a different model, the message length increase over the minimum message length model was on average only about 2.0, which is of little significance. For data with larger factors, both schemes almost always agree. That is, the model with the smallest message length is almost always the model with the largest number of factors found by the MML iteration.
7.3 Assessment of Error
Unlike some other Bayesian methods, MML does not directly compute or use the posterior density. However, the elements of the sensitivity matrix may be used as indications of probable estimation error. For instance, it is easily shown that the probable error in estimating any one factor score v_{nj} is of order 1/√{1+β_{j}^{2}}.
8. Acknowledgement
This work owes much to the assistance of P. R. Freeman. The work was supported by an Australian Research Council grant (A49030439).
References

[1] AKAIKE, H. (1987) Factor Analysis and AIC, Psychometrika
52, 317332.

[2] BARRON, A. R. and Cover, T.M. (1991) Minimum Complexity
Density Estimation, IEEE Transactions on
Information Theory, 37, 10341054.

[3] BOZDOGAN, H. (1987) Model Selection and Akaike's Information
Criterion (AIC): The General Theory and its Analytical
Extensions, Psychometrika 52, 345370.

[4] HARMAN, H. H. (1967) Modern Factor Analysis Second
Edition, Uni. Chicago Press.

[5] KENDALL, M. (1980) Multivariate Analysis, Second Edition,
Charles Griffin Pub., 34.

[6] MUIRHEAD, R. J. (1982) Aspects of Multivariate Statistical
Theory, Wiley, New York.

[7] PRESS, S. J. and SHIGEMASU, K. (1989) Bayesian Inference in Factor
Analysis, in Contributions to Probability and
Statistics, Gleser et. al. (eds.), SpringerVerlag, 271287.

[8] SCHWARTZ, G. (1978) Estimating the Dimension of a Model,
Annals of Statistics 6, 461464.

[9] WALLACE, C. S. AND FREEMAN, P. R. (1987)
Estimation and Inference by Compact Coding,
J.R. Statist. Soc. B 49, 240252.

[10] WALLACE, C. S. AND FREEMAN, P. R. (1992)
Single Factor Analysis by
MML Estimation,
J.R. Statist. Soc. B 54, 195209.

[11] WALLACE, C. S. (1989)
False Oracles and MML Estimation,
Technical Report TR128, Computer Science, Monash University,
Melbourne.
 [12] WALLACE, C. S. (1995) Multiple Factor Analysis by MML Estimation, Technical Report TR218, Computer Science, Monash University, Melbourne.
 [LA: A copy of the TR may also be obtained by writing
to the Faculty of I.T. (Clayton campus),
Monash University, Vic. 3800, Australia.]
 [13] WITTEN, I. H., NEAL, R. M. AND CLEARY, J. G. (1987) Arithmetic Coding for Data Compression, Comm. A.C.M., 30, 520540.