Multiple Factor Analysis by MML Estimation
C. S. Wallace
Abstract: Previous work by Wallace and Freeman (1992) on the application of Minimum Message Length (MML) estimation to the Factor Analysis model of a multivariate Gaussian population is extended to allow several common factors. The extension is not trivial, raising problems in the choice of an appropriate prior for the factors, and in the evaluation of the Fisher information of the model. However, the resulting complications appear to cancel out, giving an estimator similar to the single-factor case.
The estimator has been extensively tested on simulated data, and compared with the maximum likelihood and Akaike Information Criterion (AIC) estimator. The MML estimator is found to be substantially more accurate, to provide consistent estimates of factor scores, and to recover the number of common factors more reliably than a likelihood-ratio test among maximum likelihood models.
Keywords: Minimum Message Length, MML, factor analysis, multivariate Gaussian.
Also see the
MML [book] s.6.9.1,
p.300. This page was (roughly) converted from troff to html by LA because the Proc. of the 14th Biennial Statistical Conf., Brisbane (1998) conference paper is v. hard to get. Note, θ hat is represented by θ, etc.. There are doubtless mistakes in the conversion. I certainly have doubts as to the use of 'b' v. β so please read it with that in mind. Also see [TR218]. |
1. Introduction
1.1 The model
We have developed and tested a Minimum Message Length (MML) estimator for a version of the well known Factor Analysis model for a multivariate population, here assumed to have a Gaussian density. The work extends an earlier work (Wallace and Freeman 1992), which addressed the restricted problem with a single factor. Here, we consider models with an unknown number of factors. In some respects the earlier work generalizes in a straightforward way, and we assume familiarity with it. For consistency, we retain and extend the perhaps unusual notation of the single-factor paper. Hereafter, that paper is referred to as WF.
The data is a sample of N data vectors {xn, n=1...N}, each of K components {xnk, k=1...K}, independently drawn from the population. A version of the Factor model is:
-
(1.1)xnk = μk + Σj vnj akj + σk rnk (j = 1...J < K)
where parameter μk is the mean in dimension k, parameter σk is the "specific variation" in dimension k, parameters {aj, j=1...J} are a set of J "Factor load" vectors each of dimension K, parameters {vj, j=1...J} are a set of J "score" vectors each of dimension N, and the NK variates {rnk, n=1...N, k=1...K} are regarded as having independent N(0,1) distributions (Harman 1967).
The conceptual model underlying factor analysis is that there exist some hidden, or latent, variables which have a linear effect on the K variables actually measured. The aim of a factor estimation method is to discover, to the extent revealed by the data, the number J of these variables, their values {vnj} for each member of the sample, and their effects on the measured variables, represented in the model by the factor load vectors {aj}. The model further assumes that each measured variable is subject, independently of all other variables, to some Normally-distributed variation or measurement error, rnk, which is not affected by the hidden variables. The magnitude of this "specific" variance is modelled by the set of parameters {σk}.
In Maximum likelihood (ML) treatment of the factor model, the model actually estimated is less complete. As ML is incapable of consistent simultaneous estimation of both factor loads and factor scores, the latter are treated not as parameters but as random variables with N(0,1) distributions, and the inferred model contains no inference of their individual values.
The model (1.1) contains ambiguities. First, the factor load vector aj and factor score vector vj appear in the model only via their product. Thus the data cannot reveal their individual lengths. (By "length", we mean throughout the Euclidean vector length √(aTa) = |a|.) This ambiguity is conventionally resolved by assuming the scores {vnj} to be parameters with independent unit Normal priors. An alternative resolution is to constrain the scores so that Σn vnj2=N for all j. This alternative is mathematically more difficult and will not be pursued.
A more troublesome ambiguity is that the data cannot reveal the individual factor loads. For each load vector aj, define the scaled load tj given by tkj = akj / σk (k=1...K). We will call the vectors tj the "true latent vectors" (TLVs). Assuming Normal distributions of scores, the latent variable loads affect the population density only through the matrix Σj tj tjT, so the data cannot distinguish among the infinitely many sets of TLV vectors giving the same matrix. It is conventional and convenient to design the estimator to estimate a set of orthogonal latent vector (OLV) loads {βj} such that Σj βj βjT = Σj tj tjT, βjT . βl≠j = 0. We similarly require the estimated score vectors to be orthogonal.
1.2 MML
Minimum Message Length (Wallace and Freeman 1987) or MML is a form of Bayesian model selection and estimation principle. Using the result from Information Theory that the most efficient binary encoding of an event or value of probability P has length - log2 P binary digits, MML considers a particular efficient encoding of the given data x as a message having two parts. The first part specifies a model probability distribution over data P(x | θ), and specifies estimates θ of parameters θ. The second part, of length - log2 P (x | θ), encodes x using a code efficient if θ = θ, e.g., a Huffman or Arithmetic code (Witten et al, 1987). The length of the first part depends on the prior parameter density h(θ), but also on the precision, or round-off quantum, δ, with which the parameter is specified, and is approximately - log2 (h(θ) δ). The total message length is minimized by an appropriate choice of δ, which may be a function of the sample size and of θ. In sufficiently regular cases, the optimum δ is close to 1 / √{F(θ)}, where F() is the Fisher information. Hereafter, we use natural logs for convenience.
MML chooses the model and parameter estimates which minimize the total message length. Among its virtues are its invariance under non-linear transformations of the parameter space, and its avoidance of the over-fitting to which Maximum Likelihood (ML) is prone. For convergence theorems see Barron and Cover (1991), Wallace (1989).
If the set of models contemplated is a family of models of increasing order and number of parameters, as is the case in the Factor problem, MML chooses the order and parameter values in one step. If regarded simply as an order-selection process, it shares with Schwartz (1978) the property of penalizing the log-likelihood by an amount of order (1/2)log(sample size) for every free parameter, rather than the constant unit penalty of AIC. MML differs from both in that it estimates parameters as well as selecting an order. Note that Bozdogan (1987) argues that AIC should similarly depend logarithmically on sample size.
Both Bozdogan and Schwartz thus argue that the choice of order should be based on maximizing a quantity like (log likelihood) - (D/2) log N where N is the sample size and D the number of free parameters, but their arguments ignore terms of lower order.
For certain parameters of the factor model, (μk, akj, σk), the round-off quanta used in MML vary as 1/√N, so the message length contains a contribution of order (1/2)log N for each such parameter. However, for certain other parameters (the factor scores vnj) the effective sample size is only K, the number of components of each data vector. MML automatically accounts for this difference in the nature of the parameters, and also includes relevant terms of order one which are ignored by Bozdogan and Schwartz.
The aim of this paper is not to show how prior information can be included in a Factor analysis (although the work presented here could be extended to do so) but rather to show that even when an uninformative prior is assumed, MML gives results different from, and superior to, the results from maximizing the likelihood.
1.3 Related Work
Press and Shigemasu (1989) surveyed several Bayesian approaches to Factor analysis. A property shared by them all is that point estimates of factor scores and loads are obtained as either modes or means of the posterior density, or, in most methods, modes of a marginal posterior density. Such estimators are not invariant under non-linear transformations of the parameter space. For instance, they would give different results were the factor load vectors to be specified in Polar rather than Cartesian co-ordinates.
Press and Shigemasu described a new formulation which allows an explicit expression for the marginal posterior of factor scores. Adopting its mean as an estimate, they find a posterior density for the loads, and are able to give credibility intervals for elements of the load vectors. The method appears to require an informative prior for the loads, and thus by-passes questions of the number and identification of latent factors.
Akaike (1987) describes his Information Criterion (AIC) as a means of choosing a model which has maximal predictive accuracy in the sense of minimizing the expected Kullback-Leibler distance between the population and model distributions. He proposed AIC, basically 2(log likelihood - number of estimated parameters), as a criterion for estimating the number of factors in a factor model, but recognised that the possible existence of improper ML solutions is a problem. He proposed use of a Normal prior for factor loads to suppress these solutions, but the proposal is not really Bayesian, as the "prior" is a function of sample size, and hence cannot be a representation of prior knowledge (or ignorance) about the model. In section 6.3 we show evidence that MML, although not specifically designed to do so, yields smaller Kullback-Leibler distances than does AIC.
2. Priors
In this paper, we attempt to adopt priors which are as nearly as possible uninformative.
In considering the choice of a prior for the latent variable effects (the "factor loadings" aj) the estimate of σk for each measured variable may be taken, in default of any other measure, as determining a natural scale for that variable. The effects of hidden variable j on the measured variables, when expressed in these natural scales, are given by the vector tj and the entire factor model is most "naturally" expressed in terms of measured and hidden variables represented in the "naturally scaled" space. We therefore suppose the prior density over the length and direction of a factor load vector to be colourless in the scaled space. In particular, we suppose tj to have a spherically symmetric prior density, and each of the J such vectors to be an independent realization of this density.
This assumption does not carry over to the priors assumed for the estimated OLV vectors {βj}. If the "true" scaled effects of the hidden variables are given by the TLV vectors {tj}, we have no reason to suppose that these vectors will be mutually orthogonal. Indeed, we assume their directions to be independently uniformly distributed, this being the vaguest possible prior for their directions.
Pro tem, assume the TLVs to have independent spherically symmetric prior densities of K-dimensional Normal form with scale parameter ρ. If a set of J vectors {uj} are drawn from this density, the joint density of their lengths {uj} is
-
(1/ρ)J
∏j=1...J
HK(uj/ρ)
where
(2.1)HK(z) = (K / {2K/2 (K/2)!}) zK-1 e-z2 / 2
If now a set of J mutually-orthogonal vectors {cj} is constructed such that
- Σj cj cjT = Σj uj ujT
the joint distribution of the lengths {cj} is proportional to (Muirhead, 1982, p.107)
-
(2.2)(1 / ρ)J ∏j=1...J [ HK(cj / ρ) ∏l<j |cj2 - cl2| / (cj cl) ]
The density (2.2) has the same form as (2.1) save for a "correction" factor which suppresses the density if two or more vectors have similar length, and inflates the density when one or more vectors are small.
Thus, starting from the rather vague prior over the TLVs, namely that they have independent spherically-symmetric Normal densities, we are led to conclude that an appropriate prior for the OLVs actually estimated shows their lengths not to be independently distributed. It has been suggested that one might prefer to model the OLV prior directly, presumably assuming the OLV lengths to be independent. However, our conceptual model is that the observed data are linearly affected by several distinct hidden variables, and there seems no prior reason to suppose that the effects of these variables on the measured variables should bear any relation to one another such as orthogonality. If this argument is accepted, then we expect the lengths of the OLVs rarely to be close to equality. A pair of OLVs can have equal length only in the unlikely coincidence that a pair of hidden variables (TLVs) happens to have equal lengths and to be orthogonal. Form (2.2) expresses the prior belief that such coincidences are unlikely.
The lengths of the v and β vectors for an OLV are confounded in the model, as only their product affects the likelihood. Thus, the estimator really estimates the product vj βj = |vj| |βj|. We therefore consider that the model quantity most closely analogous to the magnitude of an OLV is vj βj (or vj βj / N), and hence adopt for the joint prior density of the lengths of {βj} and {vj} a form which is the assumed density of the "true" load and score vectors, times a "correction" factor of the form above (2.2) depending on the vj βj products of the OLVs:
-
(2.3)h(βj, vj) = C (1/ρ)J ∏j=1...J [ HK(βj/ρ) HN(vj) ∏l<j |vj2 βj2 - vl2 βl2| / {vj βj vl βl} ] 2J J!
where C is a normalization constant independent of ρ.
The factor 2J J! is included because the model is unchanged by simultaneous negation of βj and vj, and the labelling of the OLVs is arbitrary.
An estimator was constructed using the prior (2.3). It performed well in trials on simulated data provided the lengths of the load vectors were not greatly dissimilar. However, when the lengths ranged over a ratio of five or more, the estimator tended to over-estimate the shortest and under-estimate the longest. This bias arises because, for large K, the multivariate Normal prior implies a TLV length prior (2.1) with a relatively narrow peak at ρ √K. Even if ρ is treated as a hyperparameter to be estimated from the data, (2.1) still expresses a prior expectation that all TLVs will have similar lengths. We therefore sought an alternative TLV prior with vaguer implications for TLV lengths.
Unfortunately, the Normal TLV prior is the only form for which we were able to find an explicit expression for the corresponding OLV prior. As a compromise, we adopted the following OLV prior for load and score lengths:
-
h(βj, vj)
= CNKJ
∏j=1...J
[
TK(βj)
HN(vj)
∏l<j
|vj2 βj2
-
vl2 βl2|
/
{vj βj vl βl
√[(1+βj2)
(1+βl2)]}
] 2J J!
(2.4)where TK(β) = {2 Γ ((K+1) / 2)} / {√π Γ(K/2)} . {βK-1} / {(1+β2)(K+1)/2}
and CNKJ is a normalization constant discussed later.
This form assumes a very diffuse load length prior, and retains a "correction" factor similar to that of the Normal form (2.3). Simple geometric arguments show that the joint OLV length density derived from any smooth, independent, TLV length density will exhibit a "correction factor" of roughly this form. The prior (2.4) is a generalization of the "Cauchy" prior used in WF, and is far more diffuse than the Normal form.
The joint prior density of the directions of the OLVs is assumed to have the uniform value
-
(2.5)∏j=1...J 1/SK-j+1
- where SD = D πD/2 / (D/2)!
is the surface area of the unit D-sphere, and we similarly assume the joint prior density of the directions of the score vectors to be uniformly
-
(2.6)∏j=1...J 1/SN-j+1
As in WF, we assume the specific variances {σk} to have independent prior densities proportional to 1/σk in some finite range, and the means {μk} to have independent uniform priors in some finite range.
The assumed priors import little useful information into the analysis. Those for means and specific variances have the conventional "uninformative" forms, the priors of load and score vectors presume no preference for any direction, and the prior for load strengths expresses only the vaguest expectation that they will be of the same order as the specific variations.
3. The Information Matrix
MML chooses the model and parameter estimate θ which minimize
- T(θ) = - log[(h(θ) / √I(θ)) Prob(data|θ)]
where h(θ) is the prior density of the parameter vector θ, and 1/√I(θ) is an approximation to the volume in θ-space spanned by the expected estimation error. I(θ) is the determinant of the matrix of expected partial second derivatives of T with respect to the components of θ. For many estimation problems, I(θ) is well approximated by the Fisher Information, but for the factor model, the derivatives of h() cannot be neglected. As in WF (section 3), we take into account the derivatives of log h() with respect to the factor scores. We depart from WF by choosing to express the model in terms of the scaled OLV components {βkj} rather than the unscaled component {σk βkj}. Equation 4.2 of WF gives the sensitivity determinant for a single OLV β and score vector v = (v1, ..., vN) as
-
I1(θ)
= (2N)K
(Nv2
- (Σn vn)2)K
(1 + β2)N-2 /
∏k σk6
- where β = |β|, v = |v|
Using Σn vn = 0 from WF p.198, and changing to scaled components gives
-
(3.1)I1(θ) = 2K N2K v2K (1+β2)(N-2) / ∏k σk4
The calculation in WF generalizes directly to J>1 factors giving
-
(3.2)I2(θ) = 2K N2K ∏j=1...J [ vj2K (1+βj2)(N-2) ] / ∏k σk4
Note that (3.2) is the determinant of a matrix which includes derivatives with respect to all K components of each OLV βj and all N components of each score vector vj. Thus, it includes the sensitivity of T to differential changes of {βj} and {vj} which do not preserve the mutual orthogonality of OLVs and score vectors. However, in computing (3.2), use has been made of the fact that the derivatives are evaluated for mutually-orthogonal vectors. The question of enforcing the derivatives to preserve orthogonality is addressed below. First, we switch to the polar parameterization of the model, using length and direction rather than Cartesian components for the OLVs and score vectors. In this change of parameters, the sensitivity determinant transforms as the square of a density, giving
-
(3.3)I3(θ) = 2K N2K ∏j [ vj2(K+N-1) (1+βj2)(N-2) βj2(K-1) ] / ∏k σk4
Consider some pair of OLVs βj and βl, and their associated score vectors vj and vl. The expression (3.3) contains contributions from the sensitivity of T to changes in direction of βj, βl, vj and vl. Let us consider changes in the directions of βj and βl corresponding to differential rotations of these vectors in the plane containing them both. Let θj be an angle giving the direction of βj in the plane, and θl be an angle giving the direction of βl in the plane. Similarly, let φj, φl be angles giving the directions of vj, vl in the plane containing them both. Then I3(θ) contains a contribution due to the expected second differentials of T with respect to θj, θl, φj and φl. It can be shown that this contribution amounts to a multiplicative factor of I3 given by
-
(3.4)vj2 (1 + βj2) vl2 (1 + βl2) vj2 βj2 vl2 βl2
and that there are no cross-derivatives with other parameters of the model with non-zero expectation.
As noted above, (3.4) arises from the unconstrained variation of the four parameters θj, θl, φj and φl. In fact, the four parameters are constrained by the requirement of orthogonality, so that there are really only two parameters, say θ and φ, with
-
(3.5)θj = θ, φj = φ, θl = θ + π / 2, φl = φ + π / 2.
When T is expressed in terms of these two parameters, and its second differentials with respect to them calculated, it is found that together they contribute to the sensitivity determinant a multiplicative factor of the form
-
(3.6)(vj2 βj2 - vl2 βl2)2
rather than the form (3.4).
The above expression shows that when factors j and l have equal size, a rotation can be applied to the OLVs and score vectors which has no effect on the likelihood or the priors, and hence no effect on T. The reader might be surprised by the fact that this degeneracy becomes apparent only when the orthogonality constraint is taken into account, since such rotations are possible, and have no effect on T, whether or not the OLVs are explicitly constrained to be orthogonal. The reason that the degeneracy is not apparent in the expression I3(θ) is that I3(θ) indicates the effects of perturbations of the parameters on T only to second order. When the OLVs and score vectors are not explicitly constrained, simultaneous variation of two OLVs and two score vectors is required to produce a perturbation having no effect on T, and so the degeneracy would be revealed only by a fourth-order analysis treating differentials up to the fourth. Once the orthogonality constraint is applied, however, the simultaneous variation involves only two, rather than four, dimensions of parameter space, and so is revealed by a second-order analysis using second partial differentials.
Modifying I3(θ) by replacing a factor of form (3.4) by a factor of form (3.6) for every pair of OLVs gives the final sensitivity determinant as
-
(3.7)I(θ) = 2K N2K ∏j [ (vj2 + (1+βj2) / ρ2) vj2(K+N-2J) (1 + βj2)(N-J-1) βj2(K-J) ] . [ ∏j ∏l<j (vj2 βj2 - vl2 βl2)2 ] / ∏k σk4
4. The MML Estimator
In this section, σk , βj , vj, etc. refer to estimates rather than "true" parameter values.
The MML estimator is chosen to minimize MessLen = (1/2) log I - log(prior density) - log(likelihood).
Omitting constant terms, we have
-
(4.1)MessLen = (N-1) Σk log σk + (1/2) (N+K-1) Σj log(1 + βj2) + (1/2) (K-J+1) Σj log vj2 + (1/2) Σj vj2 + (1/2) Σn Σk (ynk - Σj vnj βkj)2
where {ynk} are the scaled data {(xnk - μk) / σk} and we assume that the estimate μk = Σn xnk / N. (The proof of this equality follows the proof in WF.) Equation (4.1) is a simple generalization of the equation for MessLen given in WF p.198.
Note that the factors of the form (vj2 βj2 - vl2 βl2), which appear both in the prior and in I, cancel out and do not appear in MessLen.
This cancellation is not an accident, as these factors arise from a singularity in the mapping of TLVs onto OLVs which affects both the prior and I in similar ways. Consider the example of a model with only two latent vectors (J=2) in K dimensions. The space of possible TLVs has 2K dimensions, but the space of possible OLVs has only 2K-1, as the OLVs lose one degree of freedom by being constrained to be orthogonal. For each OLV pair, there is a one-dimensional manifold of TLV pairs which map onto the OLV pair. Normally, the mapping is unique: such a manifold of TLV pairs maps to a single OLV pair. However, consider a set of TLV pairs in which all pairs lie in the same fixed plane of K-space, both TLVs of all pairs have the same fixed length, and the two TLVs of each pair are orthogonal. This set is a 1-dimensional manifold, as there is only one degree of freedom: the angular orientation of the TLV pair in the fixed plane.
The matrix
- ∑j tj tjT
has the same value for all pairs in the set. However, rather than mapping into a single OLV pair, such a special set of TLV pairs maps into a 1-dimensional manifold of OLV pairs identical to the set of TLV pairs. Normally, the prior density in a 1-dimensional manifold of TLV pairs condenses onto a single OLV pair, but for a special TLV set, as above, the prior density is spread over a manifold of OLV pairs, so these OLV pairs acquire a zero prior density.
Similarly, it is normally the case that if two OLV pairs differ, then they are images of two different TLV manifolds, giving rise to two different data distributions. However, if the two OLV pairs are members of the image of the same special TLV set, they give rise to the same data distribution, and so the perturbation which changes one pair into the other (the θ of (3.6)) has no effect on T, and so I is zero. An OLV pair is in the image of a special TLV set just when the two OLVs have equal length, so just in this case the prior density of the pair drops to zero, and I becomes zero.
Also see the [book] p.302 |
We now proceed to describe the MML estimator.
- Define yn as the K-vector {ynk, k=1...K},
- wnk as
xnk - μk
= σk ynk,
- Y as the K × K matrix Σn yn ynT
- and introduce variables {Rj, Qj, j=1...J} defined in 4.3, 4.5 below.
Then the estimates minimizing MessLen satisfy (Wallace and Freeman 1992):
-
(4.2)σk2 = Σn wnk2 / (N-1+ Σj Rj βkj2)(4.3)Qj = 1 + βj2 + (K-J+1) / vj2(4.4)vj2 = βjT Y βj / Qj2(4.5)Rj = vj2 + (N+K-1) / (1 + βj2) or Rj = N+J-2(4.6)βj = Y βj / Qj Rj
Equations (4.2) to (4.6) may be solved by functional iteration. After each iteration, the revised vectors {βj} are re-orthogonalized by evaluating
-
(4.7)βj (new) = βj - Σl<j βl (βl . βj) / βl2 (j = 2...J)
To start the iteration one may set Rj = vj = N (all j), and set the βj vectors parallel to the J dominant eigenvectors of the data correlation matrix, with squared lengths equal to the eigenvalues.
As in WF, the iteration may drive an OLV to zero if the data are consistent with a model having fewer than J factors. In that case, the iteration is restarted with reduced J.
The iteration does not directly require or yield estimates of the score vectors. These may be calculated after convergence as
-
(4.8)vnj = Σk ynk βkj / Qj
5. Message Length
The full calculation of the message length is not needed in estimating the parameters of a model with a specified number of OLVs. It is needed for the comparison of models with differing numbers of OLVs, and can be used for comparison of a factor model with a model of different functional form.
The full expression for the message length of data vector x using a model with estimated vector parameter θ, prior density h(θ), senstivity matrix I(θ) and probability law P(x|θ) is (Wallace and Freeman 1987).
-
(5.1)MessLen = - log[ h(θ) / √|I(θ)| ] - log P(x|θ) - (1/2) D log(2π) + (1/2) log(Dπ)
where D is the number of scalar parameters, i.e. the dimension of θ.
Using (2.4, 2.5, 2.6, 3.7 and 4.1), we obtain
-
(5.2)MessLen = (K/2 - 2J) log 2 + K log N + {1/2} JN log (2 π) - log(J!)
- - J log SN + ∑l=0...J-1 log SN-l - J log(SK / SK+1) + ∑l=0...J-1 log SK-l
- + {1/2} (K-J+1) Σj log vj2 + {1/2} (N+K-1) Σj log(1 + βj2)
- + (N-1) Σk log σk + {1/2} Σj vj2 + {1/2} Σn Σk ynk2
- + {1/2} Σj vj2 βj2 - Σj βjT Y βj / Qj - {1/2} D log (2 π) + {1/2} log (D π) - log CNKJ
- - J log SN + ∑l=0...J-1 log SN-l - J log(SK / SK+1) + ∑l=0...J-1 log SK-l
where D = 2K + J (N+K-J+1), Sm is defined in (2.5), ynk in (4.1), and Y is defined before (4.2). The finite ranges used in normalizing the priors for {μk} and {σk} are omitted from (5.2) as they affect all models equally.
The normalization constant CNKJ (from (2.4)) is not known in closed form. Numerical integration of (2.4) for various N, K and J has shown that log CNKJ depends little on N for N>>1, and for N≥40 is well approximated by the empirical expresssion
-
(5.3)- log CNKJ ≈ AJ + {1/2} J(J-1) log (K-BJ )
where the constants AJ, BJ are:
J: 1 2 3 4 5 6 AJ: 0 0.24 1.35 3.68 7.45 12.64 BJ: - 0.58 0.64 0.73 0.86 1.21
6. Testing
6.1 Artificial Data
Test runs were made on artificial data sets randomly selected from populations with known TLV (and hence OLV) structure. In each run, 1000 populations were tried, all with the same number K of dimensions, the same number J of TLVs and all having the same set of TLV lengths. With no loss of generality, all specific variances {σk2} were set to one. The populations differed in that, in each population, the TLV directions were randomly and independently selected from a uniform distribution over the K-sphere. For each population, a sample of constant size N was selected and the covariance matrix about the sample mean computed. This matrix was then analysed by both ML and MML methods, seeking estimated models with J-1, J and J+1 OLVs. The MML method would sometimes fail to find a model with the requested number of factors, instead finding a reduced number.
For most runs, we chose K=16, N=300, and J in the range 2 to 5. The sets of factor lengths were chosen arbitrarily, with most runs including at least one factor small enough to give some difficulty in estimating it. Note that the TLV lengths, which were held constant in each of the 1000 trials in a run, were not selected from the assumed prior distribution of TLV lengths. Hence the population OLV lengths, which varied from population to population, were not drawn from the assumed OLV length prior.
In some runs, the population OLVs of each population were generated directly, with a fixed set of equal lengths and random mutually-orthogonal directions. These "orthogonal" runs were designed to test the MML estimator on populations for which the assumed prior OLV density (2.4) was zero.
The score vectors {vj} were not explicitly estimated. ML cannot consistently estimate them together with the load vectors {βj} and specific variances {σk2}. MML can do so, but the directions of the score vectors were eliminated from the iteration equations (4.2-4.7), and only the covariance matrix made available to the two estimators.
The estimated OLVs and specific variances estimated by the two methods for each population were compared with the population values using criteria described below, and the average error criteria for the run compared.
6.2 Error Criteria
We chose criteria which are meaningful even if the estimated number of OLVs J differs from the population number J. Recall that all test populations have σk = 1 for all k.
S1 = Σk log σk is a measure of general over- or under-estimation of specific variance (and so under- or over-estimation of OLV effects.)
S2 = Σk (log σk)2 is a simple measure of error in estimation of specific variances. Both S1 and S2 are zero for exact estimates.
In comparing ML and MML estimators, S1 and S2 are relevant because both estimators agree in estimating the OLVs as eigenvectors of the σ-scaled covariance Y. Hence good estimates of {σk} are essential for their estimation of OLVs.
KL = ∫ dx P(x|θ) [log P(x|θ) - log P(x|θ)] is the non-symmetric Kullback-Leibler distance between the true population data density and that implied by the estimate θ. It is zero for an exact estimate, otherwise positive. It does not require P(x|θ) and P(x|θ) to have the samea number of parameters, and is invariant under non-linear transformations of the parameter space.
For each run, the ML and MML estimators were compared on two bases. First ("fixed J"), we compare the estimates produced when each method was asked to find a model with the true number of J of OLVs. The MML method sometimes returned a model with fewer OLVs. Second ("chosen J"), we compared the results when the number of OLVs was estimated. The three ML models for J = J + 1, J and J-1 were compared on the Akaike Information Criterion (AIC) (Bozdogan 1987):
- AIC = -2 (log likelihood - number of free parameters)
- = - log likelihood + J ( 2K - J + 1 )/2 + 2K
- and the model with least AIC chosen. Note that the number of free parameters here is not the D of (5.2), since ML does not estimate the factor scores. Of the three (or sometimes fewer) MML models, that with the smallest message length (5.2) was chosen.
6.3 Results on Artificial Data
All runs showed MML to give the lower average values for all three error criteria on both fixed J and chosen J bases. ML typically underestimates the specific variances (S1 < 0), with the bias increasing with increasing J. MML tends to overestimate the specific variances. Again, the bias increases with J, but is typically less than half as large as the ML bias.
The criterion S2, measuring squared error in {σk}, was typically less than half as large for MML as for ML.
The KL criterion was typically about 0.01 to 0.03 less for MML. This may seem a small difference. However, a difference of 0.02 shows that if the log likelihoods of the ML and MML models were compared on another sample with N=300 from the same population, we would expect the MML model to have the greater likelihood, by a factor of about exp(300 x 0.02) approx 400. Thus, a likelihood ratio test on a second sample would be expected to lead to a rather decisive rejection of the ML model in favour of the MML.
Overall, the chosen J and fixed J comparisons gave similar results, with the MML advantage increasing with increasing J.
Using the "chosen J" basis, we compared the estimates J of the number of factors. Generally, ML was far more likely than MML to find J > J, and, particularly when one or more population TLVs was small, MML more frequently found J < J. For instance, in a run with K=16, N=300, and four TLVs each of length 1.0, the distributions of J in the 1000 trials of the run were:
J 2 3 4 5 ML 0 323 565 112 MML 88 649 261 2
This result might suggest that ML is superior to MML in estimating the number of TLVs. However, closer examination shows the apparent ML advantage to be an illusion. When the 347 trials were examined in which ML estimated J = 4 and MML estimated J < 4, it was found that the MML models, despite having only 3 or fewer OLVs, were still the more accurate by all three criteria. Moreover, in these cases if we form ML models with 3 factors, they are on average about as accurate as the ML models with four factors. Thus, it appears that the fourth factors found by ML conveyed little about the population, and were as much misleading as informative.
For runs with strong population TLVs of markedly differing lengths, it is possible to identify estimated OLVs with population OLVs with little confusion. In several such runs, additional error criteria were assessed. One summed the squared differences between the (unscaled) population and estimated OLVs, another summed the squared sine of the angles between population and estimated OLV directions. When J != J, an appropriate allowance in the error criteria was made. According to these two additional error criteria, MML was more accurate than ML. In a typical run, with J=4 and TLV lengths of 1, 2, 3 and 6, MML gave average values of 0.81 and 0.34 for the two criteria, while ML gave 1.11 and 0.50.
Runs in which the population TLVs had equal length and were mutually orthogonal (and hence gave equal-length OLVs) showed no unusual behaviour, despite the fact that such OLV sets have zero prior density in our scheme. The superiority of MML remained evident.
Details of the above tests are available in a Technical Report (Wallace 1995).
6.4 An Example
Press and Shigemasu (1989) analyse some aptitude data from Kendall (1980) with N=48, K=15, using their explicit Bayesian method. They assume four factors, with an informative prior on the load vectors based on experience with similar data. After determining by unspecified methods some hyperparameters appearing in their model priors, they find estimates of the factor scores and loads. (We denote their results "P&S".) The same data (p.282) was analysed by ML and our MML methods, attempting to find models with 3, 4 and 5 factors.
The three ML models gave AIC values of -512.0, -543.6 and -540.8 respectively, weakly preferring four factors over five. MML found no 5-factor solution, and gave message lengths for 3 and 4 factors respectively 195.3 and 197.1 less than the no-factor message length. Thus MML also prefers four factors, but the posterior odds in favour of four factors over three is only e1.8 or about 6:1.
The agreement among all three four-factor models was assessed by evaluating the RMS sine of the angles (in unscaled space) between corresponding load vectors of two models, where the correspondence was chosen to minimize the RMS sine. The angles given by these RMS sines for each pair of models, and the average cosines between matched loads, were:
-
P & S vs. MML: 36°, av. cosine 0.80 P & S vs. ML : 42°, av. cosine 0.71 MML vs. ML : 38°, av. cosine 0.75
There is reasonable agreement between the MML and Press-Shigemasu results, with little doubt as to the correspondence between load vectors. The ML model is hard to reconcile with either of the other two, and has one load vector deviating by at least 65° from all loads of the other two models. This result, details of which are given in Wallace (1995) suggests the two Bayesian methods, despite their different priors and approaches, recover similar views of the data. The weak MML preference for four, rather than three, factors, together with the behaviour of ML on artificial data, gives grounds for doubting that the aberrant ML factor has any validity.
7. Discussion
7.1 The Impact of the Priors
The MML method has been shown by extensive testing to be more accurate than Maximum Likelihood estimation when the data is drawn from a population consistent with the Normal Factor model. It may be asked whether the improved accuracy is partly or wholly the result of the prior information assumed in MML and other Bayesian methods. To some extent, the assumption of a proper prior for TLV lengths must help suppress the improper solutions sometimes observed using ML, as these improper solutions are characterized by large |βj| and small σk. However, the effect of our prior is small. It is very diffuse, having infinite second moments, and at least for N>>K, its effect on the message length is swamped by the effects of the sensitivity determinant. We have found experimentally that even when an improper flat prior for TLV lengths is assumed, MML still outperforms ML, and still gives on average larger estimates of specific variance despite the 1/σ prior.
The assumed prior density for TLV lengths could have been of little assistance to the MML method on the test runs reported in section 6.3, which mostly used data drawn from populations with TLV lengths of 1.0 or 2.0. Under the assumed prior, with K=16, the prior probability of a TLV length being 2.0 or less is only about 0.06, and for lengths 1.0 or less is only 0.001. Hence the prior was misleading rather than helpful under the test conditions. On the same data, a modified MML method assuming the K-dimensional Normal prior (2.1) was also tested. This prior has a hyper-parameter ρ determining the length scale of the prior, and the method estimated ρ from the data. The modified method gave almost the same estimates as the original MML method, but because factor lengths around 1.0 are less severely penalized by this prior (with ρ estimated at about 0.25), the minimum message length was more frequently achieved with J = J. The result was a further small improvement in all error measures. However, as noted in section 2, use of the Normal prior is unwise unless there are prior grounds for expecting all factors to have roughly similar lengths.
The reason MML suppresses improper models is not that they occur in regions of parameter space having low prior density. Rather, it is that their high likelihood is achieved only in a very small region of parameter space. A message using such an improper solution to encode the data would thus have to specify the parameters to high accuracy (many digits of precision) making the message long.
No other aspect of our assumed priors can be reasonably considered as importing useful information. The prior over load vector directions is uniform, representing total ignorance. The Normal prior assumed for factor scores serves only to disambiguate the confusion between load and score lengths, and is in any case tacitly assumed in the conventional ML Normal model.
7.2 Choice of Number of Factors
MML appears to provide a fairly sound basis for estimating the number of factors. It is conservative, erring on the side of rejecting factors for which the evidence is weak. If an MML solution with J factors yields a correlation matrix among the residuals {ynk - Σj vnj βkj} whose largest eigenvalue is no larger than expected from a population with J factors, then the data give no evidence for the existence of a (J + 1)th factor. In such cases, the MML functional iteration (4.2...4.7) will drive one load vector to zero if an attempt is made to find (J + 1) factors. This property is proved in WF for the single-factor problem, and has been observed experimentally for the present method. In rare cases, the iteration may find an additional factor, but the model with the additional factor gives a larger message length and is therefore rejected.
When compared with AIC selection among ML solutions, MML may more frequently select a small number of factors, but none the less still gives a better model of the population density by KL measure.
With the prior here assumed, the MML choice of J may underestimate J when an OLV is small, simply because the prior probability assigned to small (scaled) OLV lengths is small. Experiments reported in (Wallace 1995) showed that, rather than choosing the MML model of smallest message length, one may with advantage choose the MML model with the highest J for which a solution exists. For populations with scaled OLV lengths around 1.0, this scheme may find a model with an additional factor (and slightly higher message length) and on average was slightly better both at estimating J and on our error measures. Where this scheme found a different model, the message length increase over the minimum message length model was on average only about 2.0, which is of little significance. For data with larger factors, both schemes almost always agree. That is, the model with the smallest message length is almost always the model with the largest number of factors found by the MML iteration.
7.3 Assessment of Error
Unlike some other Bayesian methods, MML does not directly compute or use the posterior density. However, the elements of the sensitivity matrix may be used as indications of probable estimation error. For instance, it is easily shown that the probable error in estimating any one factor score vnj is of order 1/√{1+βj2}.
8. Acknowledgement
This work owes much to the assistance of P. R. Freeman. The work was supported by an Australian Research Council grant (A49030439).
References
-
[1] AKAIKE, H. (1987) Factor Analysis and AIC, Psychometrika
52, 317-332.
- [2] BARRON, A. R. and Cover, T.M. (1991) Minimum Complexity Density Estimation, IEEE Transactions on Information Theory, 37, 1034-1054.
- [3] BOZDOGAN, H. (1987) Model Selection and Akaike's Information Criterion (AIC): The General Theory and its Analytical Extensions, Psychometrika 52, 345-370.
- [4] HARMAN, H. H. (1967) Modern Factor Analysis Second Edition, Uni. Chicago Press.
- [5] KENDALL, M. (1980) Multivariate Analysis, Second Edition, Charles Griffin Pub., 34.
- [6] MUIRHEAD, R. J. (1982) Aspects of Multivariate Statistical Theory, Wiley, New York.
- [7] PRESS, S. J. and SHIGEMASU, K. (1989) Bayesian Inference in Factor Analysis, in Contributions to Probability and Statistics, Gleser et. al. (eds.), Springer-Verlag, 271-287.
- [8] SCHWARTZ, G. (1978) Estimating the Dimension of a Model, Annals of Statistics 6, 461-464.
- [9] WALLACE, C. S. AND FREEMAN, P. R. (1987) Estimation and Inference by Compact Coding, J.R. Statist. Soc. B 49, 240-252.
- [10] WALLACE, C. S. AND FREEMAN, P. R. (1992) Single Factor Analysis by MML Estimation, J.R. Statist. Soc. B 54, 195-209.
- [11] WALLACE, C. S. (1989) False Oracles and MML Estimation, Technical Report TR128, Computer Science, Monash University, Melbourne.
- [12] WALLACE, C. S. (1995) Multiple Factor Analysis by MML Estimation, Technical Report TR218, Computer Science, Monash University, Melbourne.
- [2] BARRON, A. R. and Cover, T.M. (1991) Minimum Complexity Density Estimation, IEEE Transactions on Information Theory, 37, 1034-1054.
- [LA: A copy of the TR may also be obtained by writing
to the Faculty of I.T. (Clayton campus),
Monash University, Vic. 3800, Australia.]
- [13] WITTEN, I. H., NEAL, R. M. AND CLEARY, J. G. (1987) Arithmetic Coding for Data Compression, Comm. A.C.M., 30, 520-540.