Mixtures
The HTML FORM below allows the probability density functions for the two normal distributions Nμ1,σ1 & Nμ2,σ2, scaled by w1 & w2 respectively (w1+w2=1), to be plotted with their mixture in the ratio w1 : w2.
You can vary μ1, σ1, μ2,
σ2, & wi and
press the ''plot
'' button to see the result.
-
- Above: w1:w2 mixture of Nμ1,σ1 and Nμ2,σ2 plotted with w1.Nμ1,σ1 and w2.Nμ2,σ2.
The above is a very simple example. It is possible to have mixtures of more than two component classes, mixtures of multi-variate distributions, mixtures of different kinds of distribution, and so on.
The Model
Given S things, each thing having D attributes (measurements), a mixture model attempts to describe the things as coming from a mixture of T classes (clusters):
- The number of classes, T,
- a dictionary of class names,
- the distribution function for each class,
- for each thing, the class to which it belongs (although see fractional assignment later),
- for each thing, the attribute values using a code based on the class to which the thing belongs.
The following are assumed to be common knowledge: The number of things, S, the number of attributes, D, the nature of each attribute.
Number of Classes
The number of classes can be coded in any suitable code for (smallish) integers.
Class Dictionary
This specifies a code word for each class. The choice of classes is described by a multistate distribution.
Class Distributions.
Each class distribution is defined by the distribution parameters for those attributes that are important to it, e.g. mean and standard deviation for a normal distribution, or the value probabilities for a multistate distribution, etc.. A class need not specify all attributes; it can optionally fall back on "general population" class properties.
Fractional Assignment
The discussion above assumed that each thing was assigned wholly to one class or another. If two or more closes overlap strongly, e.g. if they are not several standard deviations apart, such a 'definite assignment' description and code are not optimal.
Nuisance Parameters in Definite Assignment
e.g. Consider a 50:50 mixture of C0=N-1,1 and C1=N1,1 and a thing, ti=0.0. Now, ti could be in either class, and we do not care which. However with definite assignment, as above, we are forced to specify that ti is in class C0, or that it is in class C1, at a cost of 1-bit. Because of the position of ti the subsequent cost of stating its one attribute is the same in either case.
The are actually two alternative (sub-)hypotheses here: H0 that ti is in C0, and H1 that ti is in C1. Since we do not care about H0 v. H1, we should add their probabilities together. This shortens the message length, and similar considerations apply to every thing, tj, that could be in more than one class.
Class memberships have typical characteristics of nuisance parameters: Their number increases in proportion with the amount of data. If classes are close enough, then regardless of the amount of data, an inference method which uses definite assignment (such as the 1968 Snob) will not be able to detect the separate classes. A second problem is that estimates of class parameters are biased under definite assignment because the model, in effect, becomes a mixture of non-overlapping, truncated distributions.
No-Nuisance Coding
There are two ways to look at a method of coding the things efficiently.
The first view is to "borrow bits" from later in the message. The transmitter considers the code for things ti+1,... . If this starts with a '0', ti is coded as being in class C0 otherwise C1. Either way, the receiver decodes ti, then considers the fact that the transmitter had placed it in Ci, where i=0 or 1, and therefore understand i to be the first bit (which need not therefore be transmitted explicitly) of the rest of the message. Thus a bit is saved.
The second view of the matter is to consider the distributions for C0, C1, and their mixture. Thing ti has some probability, p, under class C0. Because of the form of this example, ti also has probability p under C1. It therefore has probability p+p=2p under the mixture; consider code lengths, -log(2p) = -log(p) - 1.
Generalisation
We have been using an example where thing ti has equal probability of coming from C0 and C1. This was only to keep the arithmetic simple. Similar considerations apply when a thing is not exactly mid-way between classes, and when there are two or more attributes, three or more classes, etc..
Benefits of Fractional Assignment.
Using fractional assignment of things to classes, and given enough data, it is possible to distinguish, i.e. infer, classes that are arbitrarily close together, and even classes that have the same mean but different variances. Infered class distribution parameters are also unbiased.
Notes
- C. S. Wallace and D. M. Boulton (1968).
An information measure for classification.
Computer Journal, 11(2),
August 1968, pp.185-194.
Introduced the minimum message length (MML) theory of unsupervised classification (mixture modelling, clustering) and a resulting computer program [SNOB], perhaps the first concrete application of Shannon's theory of communication to inductive inference. The first version (only) of Snob used definite assignment. - C. S. Wallace and P. R. Freeman. Estimation and Inference by Compact Encoding. J. Royal Stat. Soc. B, 49 pp.240-265, 1987 [paper].
- T. Edgoose & L. Allison.
Minimum Message Length Hidden Markov Modelling.
IEEE Data Compression Conf., pp.169-178, 1998.
An extension to sequences, or time series, of multi-variate data. - L. Allison.
Models for machine learning and data mining in functional programming,
J. Functional Programming, 15(1),
pp.15-32, Jan. 2005.
Includes a generalised expectation-maximization (EM) algorithm for mixture-modelling over arbitrary probability distributions; also see [IP]. - L. Allison, Mixture Models, Ch.7 in Coding Ockham's Razor, Springer 2018 [doi:10.1007/978-3-319-76433-7_7].