On the Selection of the Order of a Polynomial Model
C. S. Wallace
Abstract:
A report by Cherkassy, Mulier and Vapnik [1] has compared the
performance of several methods for selecting the order of a polynomial
approximation to a function t(x) given only the values of t(x) for some
set of xvalues, where the given values are corrupted by i.i.d. Gaussian
noise of unknown variance. They compare various "Classical" methods with
a new method based on the concept of VapnikChernovenkis (VC) dimension, and
conclude that the latter gives the most reliable prediction of the value
of t(x) at unseen values of x. This work replicates their investigation
and extends it by including a method based on
Minimum Message Length (MML)
estimation [2,3]. The results largely confirm the previous results, but
show that MML is generally superior to the VC method in terms of average
squared prediction error.

Chris Wallace wrote on this topic
(i) while visiting Royal Holloway College, London, 1997, and
(ii) in Proc. of the 14th Biennial Australian Statistical Conf., Queensland, 1998.
Also^{ }see: C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length, SpringerVerlag, Information Science and Statistics, 2005,
for more on MML in general, and §6.7.2, pp.272275, for function approximation in particular.
1 The problem
 The framework of the problem follows that considered by Cherkassy et al[1] (hereafter referred to as CMV). There is an unknown target function t(x) defined on the interval [1,1]. (CMV use the interval [0,1] but the choice is immaterial.) We are given some training data comprising N pairs {(x_{n}, y_{n}) : n = 1 ... N)} where the xvalues are chosen randomly from the Uniform distribution on [1,1], and for each n,
 y_{n} = t(x_{n}) + e_{n}
 where the noise values {e_{n} : n = 1 ... N} are i.i.d. as Normal_{(0,v)}. The target function and the noise variance v are regarded as unknown, but the xvalues and N are given.
 The task is to construct some polynomial function f(d,x) of degree d which may be used to predict the value of t(x) in the interval 1 < x < 1. In what follows, we will only consider degrees up to 20, and for any degree will consider only the polynomial which minimizes the squared error on the training data,
 S(d) = ∑_{n} (y_{n}  f(d,x_{n}))^{2}
 Thus, the model selection problem is reduced to the choice of the degree d.
 The success of a chosen approximation is measured by its expected squared error, i.e. the average value of (f(d,x)  t(x))^{2} in [1, 1]. Use of the approximation for extrapolation beyond this interval is not considered. Because in this work we wished to study approximations to both regular and irregular target functions, we have estimated this average squared prediction error by a MonteCarlo estimate
 ESPE(f(d, *)) = (1 / M) ∑_{m} (f(d,x_{m})  t(x_{m}))^{2}
 where the test points {x_{m} : m = 1 ... M} were chosen randomly and uniformly in [1, 1]. The number of test points used is the Max (N, 100). Of course, in comparing the performance of different methods on some training data, the same set of test points was used for the estimation of the error of all the methods.
2 The experimental protocol
 An experiment is defined by a target function, the number N of training data points, and a signaltonoise ratio R, where R is defined as the variance of the target function about zero, divided by the noise variance v. In an experiment, 1000 training data sets are generated. For each data set, or case, all leastsquares polynomial approximations of degrees up to 20 are found by standard means, and the training data error S(d) computed for each of the 21 polynomials. (If N < 22, only degrees up to N2 are used.) These error values are then given to each of the selection methods being compared, and each method selects its preferred choice among the polynomials. The test error for a method in the current case is then the estimated prediction error achieved by the polynomial the method has chosen. Note again that all selection methods must choose from the same set of 21 polynomials, and their choices are "scored" on the same set of test data. Thus if two selection methods choose the same degree in some case, they are choosing the same polynomial and will incur the same (estimated) prediction error for this case.
 After the 1000 cases of an experiment have been completed, we obtain
for each selection method:
 Its average ESPE, averaged over all cases;
 The Standard Deviation of its ESPE;
 For each degree, the number of cases in which it chose that degree and its average ESPE on those cases;
 The 5%, 25%, 50%, 75%, 95% and 99% percentiles and the maximum of its ESPE distribution over the 1000 cases.
3 The Selection Methods
 For all methods other than MML, the selection process as described by CMV is to choose the polynomial which minimizes
 g(p,N) S(d)
 where p = (d+1)/N, and g(,) is a function characteristic of the selection method.
 CMV consider three "classical" methods derived from the literature:
 Finite Prediction Error (FPE), Akaike [4]
 g(p,N) = (1+p)/(1p)
 Schwartz's criterion (SCH) [5]
 g(p,N) = 1 + 0.5 ln(N) p / (1p)
 Generalized CrossValidation (GCV), Craven & Wahba [6]
 g(p,N) = 1 / (1p)^{2}
 Finite Prediction Error (FPE), Akaike [4]
 and a fourth, due to Shibata, which we did not include as it seemed uncompetitive on CMV's results.
 CMV introduced a method based on VC dimension [7]:
 g(p,N) = 1 / (1  √(p  p ln(p) + ln(N)/(2N)))
 Note that for VC, g(p,N) diverges for some degree less than N2, which for small N may limit the maximum degree considered by VC to less than the maximum degree considered by other methods.
3.1 Selection by MML
 Minimum Message Length[2] seeks to minimize the length of a "message" which encodes the data (in this problem the training yvalues) by first stating a probabilistic model for the data, and then encoding the data using a code which would be optimal (in Shannon's sense) were the stated model true. Here, the model is assumed to be polynomial with Gaussian noise, so the model description need only specify:
 the degree d of the polynomial,
 the coefficients of the polynomial, and
 the estimated variance (or SD) of the noise.
 The optimal encoding of these features of the model depends on our prior beliefs (or ignorance) about their likely values. In this work, we attempt to make neutral, or uninformative, prior assumptions, specifically:
3.1.1
All degrees from 0 to 20 are considered equally likely a priori, so all degrees will be coded with a code word of length ln(21) 'nits', or "natural bits." As all degrees have the same code length, the coding of the model degree has no influence on the choice of model.
3.1.2
 In coding estimates of the noise variance, v, and the polynomial coefficients, some scale of magnitude may be assumed. Here, we use the variance of the given yvalues to determine such a scale. Define
 V = (^{1}/_{N}) ∑_{n} y_{n}^{2}
 In encoding a polynomial model of degree d, we suppose that the noise and each of the (d+1) degrees of freedom of the polynomial may be expected a priori to contribute equally (in very rough terms) to the observed variance V of the yvalues. Defining u = √(V / (d+2)), we assume the Standard Deviation of the "noise" has a Negative Exponential prior density with mean u, and that each of the coefficients of the polynomial has independently a N_{(0,u2)} prior density. If the coefficients were the usual coefficients of the successive powers of x, the latter assumption would be unreasonable, and highly informative. Instead, we represent a ddegree polynomial as
 f(d,x) = ∑_{j=0..d} (a_{j} Q_{j}(x))
 where the set {Q_{j}(*) : j = 0 ... d} is a set of polynomials, Q_{j}(*) being of degree j, which are orthonormal on the interval [1, 1]. That is, the average on [1, 1] of Q_{j}(x)Q_{k}(x) = 1 if j = k, and zero otherwise. The orthonormal polynomials represent effectively independent modes of contributing to the variance of f(d,*), and it seems reasonable to assume independent prior densities for the coefficients {a_{j} : j = 0 ... d}.
 With these assumptions, the overall prior density for the unknown parameters {s, {a_{j}}} (where s = √v) is
 h(s,{a}) = (1/u) e^{s/u} * ∏_{j} (1/(√(2π) u)) e^{aj2 / (2 u2)}
 The amount of information needed to encode the estimates s and {a_{j}} depends on the precision to which these are stated in the message. As shown in[2], the optimum precision is inversely proportional to the square root of the Fisher information F(s,{a_{j}}), which in this case is given by
 F(s,{a_{j}}) = 2 (N / s^{2})^{d+2} M
 where M is the covariance matrix of the orthonormal polynomials evaluated at the given xvalues:
 M_{j,k} = (1/N) ∑_{n} Q_{j}(x_{n}) Q_{k}(x_{n})
 As the given xvalues are randomly and uniformly distributed in [1, 1], M is typically neardiagonal.
3.1.3 Coding the data values
 Once the model polynomial f(d,*) and the noise SD s have been stated, the given yvalues can be coded simply by their differences from f(), these differences being coded as random values from the Normal density N_{(0,v)}. The message length required for this is then given by
 L2 = (N/2) log (2 π v) + (1/2v) S(d)
3.1.4 The total message length
 As shown in [2], the total message length is approximately given by
 I1 = (1/2) log F(s,{a_{j}})  log h(s, {a_{j}}) + L2  ((d+2)/2) log(2 π) + (1/2) log ((d+2) π)
 where the last two terms arise from the geometry of an optimum quantizing lattice in (d+2)space.
 The noise variance v = s^{2} is estimated as
 v = (1 / (Nd1)) ∑_{n} (f(d,x_{n})  y_{n})^{2}
 and the coefficients {a_{j}} are estimated by conventional leastsquares fit. These estimates do not exactly minimize the message length I1, but the difference is small.
 The MML method selects that degree d which minimizes I1 as calculated above.
4 Test results
The C/Unix programs used to obtain these results are obtainable from the author.
4.1 CMV's test function
 The CMV report contains only lowprecision boxplots of their results, on the single target function
 y = (sin (2 π x))^{2}, for x in [0, 1]
 As our program works in the xinterval [1, 1], we transformed this to
 y = (sin (π (x+1)))^{2}.
 Edited program output is shown below for values of N and signalnoise ratio copying those shown in CMV. The editing is restricted to removing some of the detailed output for all but the first test case, which is unedited to exhibit the information provided in our tests.
 The output provides a picture of the target function, and a copy of the C code statements which evaluate it. Then, each test condition gives an output headed by a line of '=' characters. The next line gives the test conditions: number of replications (here always 1000), number of data points (N), and signal/noise ratio. It also gives the resulting noise Standard Deviation, and the maximum degree of polynomial considered (MaxD) and the possibly smaller maximum degree (MaxD(VC)) used by the VC method.
 Following lines give, in a column for each selection method, the average ESPE(AV), its standard deviation (SD), and percentile points and maximum of the ESPE distribution over the cases. Finally, the average ESPE is shown for those cases where a method has selected a polynomial of particular degree (avERR), together with the number of times it chose that degree (CNT).
 Note that our results may not be directly comparable with CMV's: there may be a difference in our respective definitions of signal/noise ratio. In our work, the "signal" strength is calculated as the variance of the target function about zero, whereas CMV may use the variance about its mean. However, much the same range of test conditions is covered in both works.
 The columns for the different methods are headed by their abbreviated names, MML, VC etc.. The first column, headed BEST, gives the results which would be obtained if in each case, the polynomial were selected which gives the smallest ESPE as calculated on test data. Thus, BEST shows the best results which could possibly be obtained, but these results are of course unrealizable, as they are based on knowledge of the (noisefree) values of the target function at the test xvalues.
 a = sin (π * (x + 1.0)); y = a * a; Target mean = 0.500, SD about mean = 0.354, SD about 0 = 0.612 Minimum = 0.0000, Maximum = 1.0000 ,''', ,''', , , , , ' ' ' ' , ' ' ' , , , , , , , , ' ' ' ' ' ' ' ' , , ' ' , , , , ' ' ' ' ,,' ',,,' ', ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 10, S/N ratio 10.00 NoiseSD 0.061 MaxD 8 MaxD(VC) 4 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0804; 0.1857; 3.5005; 15.8055; 16.3748; 13.0748; SD 0.0561; 0.2633; 24.2763; 63.8077; 65.4801; 56.2695; 5pc 0.0054; 0.0078; 0.1099; 0.0091; 0.0091; 0.0082; 25pc 0.0206; 0.0385; 0.1289; 0.0863; 0.0911; 0.0733; 50pc 0.0822; 0.1236; 0.1471; 0.7974; 0.8637; 0.5827; 75pc 0.1297; 0.1880; 0.4338; 5.4448; 5.5033; 4.2212; 95pc 0.1615; 0.6075; 9.4489; 60.7315; 63.7380; 48.9659; 99pc 0.1886; 1.3700; 73.2666; 306.5231; 378.0335; 277.2285; Max 0.2481; 3.0411; 543.0019; 771.4965; 771.4965; 736.6776;  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 0.137 312; 0.141 222; 0.137 564; ..... 0; ..... 0; ..... 0; 1 0.142 51; 0.281 33; 0.205 32; ..... 0; ..... 0; ..... 0; 2 0.128 38; 0.406 27; 0.815 35; 7.587 2; 7.587 2; 2.365 7; 3 0.132 6; 0.698 23; 6.320 138; 13.099 17; 14.732 12; 12.184 26; 4 0.084 113; 0.303 177; 10.891 231; 39.709 78; 40.725 75; 24.368 113; 5 0.085 8; 0.421 30;   18.996 214; 22.030 210; 15.918 249; 6 0.033 373; 0.106 426;   10.882 340; 10.883 344; 6.819 363; 7 0.028 56; 0.112 52;   18.547 195; 18.577 196; 21.611 138; 8 0.018 43; 0.095 10;   7.068 154; 6.939 161; 5.450 104; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 10, S/N ratio 5.00 NoiseSD 0.122 MaxD 8 MaxD(VC) 4 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0932; 0.1751; 2.0351; 47.3772; 48.5253; 25.5923; 50pc 0.1068; 0.1339; 0.1422; 1.7633; 1.9880; 0.9442; 95pc 0.1656; 0.5032; 6.6747; 168.5623; 177.3814; 109.8585; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 10, S/N ratio 3.30 NoiseSD 0.186 MaxD 8 MaxD(VC) 4 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.1042; 0.1924; 1.5155; 56.3134; 61.4636; 40.5831; 50pc 0.1154; 0.1402; 0.1392; 2.5108; 2.8616; 0.9503; 95pc 0.1789; 0.5076; 3.1223; 230.9929; 252.1516; 163.1735; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 10, S/N ratio 2.50 NoiseSD 0.245 MaxD 8 MaxD(VC) 4 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.1119; 0.1823; 0.7567; 97.0134; 100.8881; 55.3633; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 10.00 NoiseSD 0.061 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0223; 0.0652; 1.1778; 2.2633; 2.3883; 1.5057; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 5.00 NoiseSD 0.122 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0319; 0.0743; 1.9007; 7.2120; 8.0623; 3.9577; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 3.30 NoiseSD 0.186 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0434; 0.0917; 2.6293; 16.1960; 16.7600; 10.8869; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 2.50 NoiseSD 0.245 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0533; 0.1169; 1.8853; 36.5171; 38.5030; 24.9102; 18 ..... 0; ..... 0;   ..... 0; ..... 0; ..... 0; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 30, S/N ratio 10.00 NoiseSD 0.061 MaxD 20 MaxD(VC) 19 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0064; 0.0187; 0.2288; 1.1916; 1.2417; 0.6701; 50pc 0.0021; 0.0047; 0.0043; 0.0099; 0.0116; 0.0053; 95pc 0.0234; 0.0748; 0.3239; 3.6852; 4.2520; 0.9907; 99pc 0.1026; 0.1919; 1.9990; 26.2156; 26.2156; 15.0542; Max 0.1604; 2.1525; 92.2390; 106.6274; 106.6274; 106.6274; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 30, S/N ratio 5.00 NoiseSD 0.122 MaxD 20 MaxD(VC) 19 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0125; 0.0281; 0.2706; 2.1147; 2.2608; 0.9889; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 30, S/N ratio 3.30 NoiseSD 0.186 MaxD 20 MaxD(VC) 19 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0211; 0.0382; 0.2834; 2.7250; 3.3901; 1.4695; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 30, S/N ratio 2.50 NoiseSD 0.245 MaxD 20 MaxD(VC) 19 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0288; 0.0543; 0.3746; 11.0609; 12.3142; 1.8985; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 2.00 NoiseSD 0.306 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0085; 0.0115; 0.0137; 0.3339; 0.3204; 0.3212; 50pc 0.0076; 0.0094; 0.0093; 0.0110; 0.0106; 0.0106; Max 0.0621; 0.1866; 2.0669; 122.5578; 122.5578; 122.5578; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 1.00 NoiseSD 0.612 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0301; 0.0525; 0.1094; 0.1180; 0.0697; 0.0684; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 0.50 NoiseSD 1.225 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0921; 0.1581; 0.1409; 3.2680; 3.1393; 3.1387; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 0.25 NoiseSD 2.449 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.1755; 0.2703; 0.1863; 1.2425; 1.1484; 1.1897; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 0.18 NoiseSD 3.402 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.2394; 0.3867; 0.2454; 13.1019; 0.7891; 0.8816; ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 100, S/N ratio 0.13 NoiseSD 4.711 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.3508; 0.6396; 0.3554; 2.9207; 1.7376; 2.4722; 
 The results clearly show that none of the methods FPE, SCH, or GCV is competitive with VC or MML, except under conditions of low noise and large N, when all methods give similar results. Of the three, GCV comes closest. These three methods will not be further discussed.
 Comparing MML and VC, MML gives lower average errors in all tests except for N = 100, S/N ≤ 0.5. In these tests, VC almost always chose degree 0 or 1, whereas MML often selected a higher degree. Other results on other target functions suggest that VC outperforms MML only when, by virtue of the shape of the target function, low N or large noise, the best approximating polynomial is of degree 0 or 1. Even then, the MML average error is only about twice the VC average, whereas (as in most of the tests shown above), the VC error can be 10 or 20 times the MML error on average, indeed hundreds of times greater on some targets, as will appear below.
 The VC method is based on theory which does not assume that the target function belongs to the model family from which the approximation is drawn (in this case, the polynomials). MML, however, is in part motivated by theory which does make this assumption. It is curious that the only target/test conditions under which VC consistently betters MML are ones where this assumption is largely valid.
 Examination of the percentile points of the error distributions shows that the MML and VC median errors are usually close. The VC average error is increased by a small number of cases giving very large VC errors. For instance, the test with N = 30, S/N = 10 shows VC actually with a slightly smaller median error than MML, but with an average error about 20 times greater.
 The claim in CMV that although VC is based on worstcase bounds, it generally provides the best average error performance, is not supported by these results. Further, although CMV hints that the VC method has been designed to minimize the error exceeded in only 5% of cases, our results show that the 95th percentile of the VC error distribution is usually larger (often much larger) than the MML 95% point. For instance, the N = 10, S/N = 10 test shows 95% points of 9.4 (VC) versus 0.6 (MML).
4.2 A logarithmic function
Here, the target function is a segment of the Log function close to its divergence at zero. It is therefore quite difficult to approximate well with a polynomial of modest degree. The full results are presented.
 y = log (x + 1.01); Target mean = 0.275, SD about mean = 0.927, SD about 0 = 0.967 Minimum = 4.6052, Maximum = 0.6981 ,,,,,,,''''''' ,,,,,''''' ,,,,''''' ,,,''' ,,''' ,,'' ,' ,'' , ,' , ' ' , ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 20, S/N ratio 30.00 NoiseSD 0.032 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0117; 0.0510; 0.5492; 1.4009; 1.4758; 1.0794; SD 0.0263; 0.2047; 8.2916; 11.0360; 11.1183; 10.6686; 5pc 0.0006; 0.0018; 0.0014; 0.0013; 0.0012; 0.0013; 25pc 0.0016; 0.0063; 0.0045; 0.0065; 0.0065; 0.0048; 50pc 0.0038; 0.0146; 0.0132; 0.0227; 0.0246; 0.0163; 75pc 0.0109; 0.0409; 0.0371; 0.1179; 0.1399; 0.0590; 95pc 0.0475; 0.1525; 0.2229; 2.6842; 3.0647; 1.2720; 99pc 0.1138; 0.5033; 7.3907; 23.5344; 24.0526; 13.8202; Max 0.4116; 4.0862; 240.2782; 240.2782; 240.2782; 240.2782;  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 1 0.363 2; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 2 0.175 2; 0.168 13; 0.151 6; 0.156 3; 0.156 3; 0.156 3; 3 0.088 16; 0.069 137; 0.084 54; 0.084 22; 0.069 15; 0.088 32; 4 0.037 41; 0.029 234; 0.035 151; 0.038 68; 0.053 58; 0.040 82; 5 0.029 70; 0.063 216; 0.020 216; 0.030 127; 0.028 114; 0.027 165; 6 0.012 137; 0.047 191; 0.114 192; 0.125 106; 0.155 100; 0.087 146; 7 0.009 189; 0.029 102; 0.871 162; 0.700 149; 0.723 145; 0.600 171; 8 0.006 192; 0.046 69; 2.321 116; 4.370 121; 3.881 137; 4.014 126; 9 0.004 149; 0.102 19; 1.126 73; 1.640 125; 1.678 127; 1.420 118; 10 0.003 104; 0.075 10; 0.775 25; 1.865 90; 1.820 96; 0.888 68; 11 0.003 58; 0.050 7; 0.066 5; 2.735 80; 2.504 88; 4.020 40; 12 0.003 27; 0.089 1;   1.219 40; 2.212 46; 1.063 24; 13 0.002 10; 0.504 1;   1.207 39; 1.246 38; 2.173 13; 14 0.002 2; ..... 0;   1.783 16; 1.586 18; 0.561 7; 15 0.001 1; ..... 0;   0.537 9; 0.537 9; 0.097 3; 16 ..... 0; ..... 0;   0.049 1; 0.251 2; ..... 0; 17 ..... 0; ..... 0;   8.295 3; 8.295 3; 0.229 2; 18 ..... 0; ..... 0;   0.004 1; 0.004 1; ..... 0;
 The above results again show a marked improvement of MML over VC, with the other three methods worse again. Note that the VC error quantiles up to the 75% mark are actually less than the MML quantiles: again it is the presence of a small number of very bad choices which lets VC down. It is interesting that VC's bad choices are not neccessarily of excessive degree. It made poor choices of polynomials of degrees 7, 8 and 9, and even some of degree 6, although MML also used these degrees frequently.
 The very large average VC error for degree 8 suggests that VC sometimes chose a polynomial which had a reasonable fit to the training data, but which had excursions far outside the range of the training data. MML is dissuaded from such choices because, to maintain a good fit to the training data, the coefficients of the estimated polynomial would have to be specified to high relative precision, requiring a long model description.
 Another test on the same target function is shown below, where the training data contained much more information.
::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 300, S/N ratio 10.00 NoiseSD 0.097 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0006; 0.0013; 0.0012; 0.0009; 0.0009; 0.0009;
In this test, MML and VC performed about equally, but were bettered by all the "classical" methods. These methods should not be dismissed in situations when there are abundant training data, although in no case that we have found do the classical methods give notably lower errors than MML. Note also that in the test above, the classical methods fairly often used degree 20, and may well have chosen higher degrees had the program permitted. That is, the limitation to degree 20 may well have been the only thing preventing these methods from the serious overfitting which they exhibit in other tests.
4.3 A Function with Discontinuous Derivative
This target, a shifted version of the absolute value function, was used to test performance on a nonanalytic form.
 y = fabs (x + 0.3)  0.3; Target mean = 0.245, SD about mean = 0.355, SD about 0 = 0.432 Minimum = 0.3000, Maximum = 1.0000 , ,' ,' ,' ,' ,' ,' ,' ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ', ,' ',' ::::::::::::::::::::::::::::::::::::::::::::;;;;; (Moderate data available. MML somewhat better than VC, both better than classical methods.) 1000 Cases, N = 50, S/N ratio 10.00 NoiseSD 0.043 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0007; 0.0024; 0.0093; 0.0963; 0.0965; 0.0531; ::::::::::::::::::::::::::::::::::::::::::::;;;;; (Plentiful data, all methods comparable.) 1000 Cases, N = 300, S/N ratio 3.00 NoiseSD 0.144 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0008; 0.0012; 0.0015; 0.0012; 0.0011; 0.0012; ::::::::::::::::::::::::::::::::::::::::::::;;;;; (Sparse data, MML much better than VC, classical methods poor.) 1000 Cases, N = 20, S/N ratio 10.00 NoiseSD 0.043 MaxD 18 MaxD(VC) 11 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0024; 0.0110; 0.1448; 1.2638; 1.4441; 0.7647; SD 0.0030; 0.0281; 1.6201; 9.0421; 9.1974; 8.2531; 5pc 0.0007; 0.0011; 0.0010; 0.0011; 0.0012; 0.0010; 25pc 0.0011; 0.0025; 0.0019; 0.0028; 0.0035; 0.0021; 50pc 0.0016; 0.0036; 0.0034; 0.0166; 0.0260; 0.0054; 75pc 0.0026; 0.0062; 0.0081; 0.1836; 0.2495; 0.0467; 95pc 0.0062; 0.0444; 0.1684; 4.0199; 4.8170; 1.8431; 99pc 0.0151; 0.1589; 2.0720; 23.9615; 28.7447; 12.3919; Max 0.0444; 0.3620; 33.7324; 238.2127; 238.2127; 238.2127; (Note the huge maximum errors of the classical methods.)  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 1 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 2 0.015 12; ..... 0; 0.080 1; ..... 0; ..... 0; ..... 0; 3 0.006 111; 0.005 510; 0.005 252; 0.006 29; 0.005 17; 0.006 64; 4 0.004 43; 0.032 112; 0.042 87; 0.060 36; 0.060 26; 0.047 54; 5 0.002 369; 0.003 278; 0.012 419; 0.042 180; 0.050 148; 0.026 294; 6 0.002 120; 0.036 59; 0.410 126; 0.439 124; 0.469 118; 0.361 151; 7 0.001 167; 0.045 34; 0.213 72; 2.332 132; 2.336 132; 2.286 132; 8 0.001 103; 0.072 5; 0.250 30; 0.702 135; 1.029 144; 0.551 107; 9 0.001 37; 0.114 1; 5.028 11; 2.095 110; 2.010 124; 2.277 79; 10 0.001 27; 0.018 1; 2.466 2; 0.851 83; 1.010 88; 0.586 48; 11 0.001 9; ..... 0; ..... 0; 2.372 60; 2.677 71; 0.480 29; 12 0.001 2; ..... 0;   3.225 60; 3.159 69; 3.684 22; 13 ..... 0; ..... 0;   2.496 28; 2.318 37; 0.673 12; 14 ..... 0; ..... 0;   2.366 8; 2.021 10; 8.503 2; 15 ..... 0; ..... 0;   4.235 10; 4.235 10; 1.436 3; 16 ..... 0; ..... 0;   7.112 4; 5.719 5; 2.161 3; 17 ..... 0; ..... 0;   0.166 1; 0.166 1; ..... 0; 18 ..... 0; ..... 0;   ..... 0; ..... 0; ..... 0; (Note the large VC errors for degrees 6...9, especially 9.) ::::::::::::::::::::::::::::::::::::::::::::;;;;; (Large amount of noisy data. All methods do well, but VC surprisingly the worst, even at the median.) 1000 Cases, N = 900, S/N ratio 1.00 NoiseSD 0.432 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0018; 0.0029; 0.0040; 0.0026; 0.0025; 0.0026; SD 0.0019; 0.0054; 0.0403; 0.0951; 0.0959; 0.0909; 5pc 0.0008; 0.0013; 0.0025; 0.0011; 0.0011; 0.0011; 25pc 0.0012; 0.0024; 0.0029; 0.0017; 0.0017; 0.0017; 50pc 0.0016; 0.0030; 0.0032; 0.0023; 0.0025; 0.0023; 75pc 0.0022; 0.0035; 0.0038; 0.0032; 0.0032; 0.0032; 95pc 0.0032; 0.0044; 0.0098; 0.0052; 0.0043; 0.0051; 99pc 0.0039; 0.0054; 0.0106; 0.0069; 0.0053; 0.0069; Max 0.0054; 0.0062; 0.0119; 0.0175; 0.0091; 0.0175;  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 1 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 2 ..... 0; ..... 0; 0.010 113; ..... 0; ..... 0; ..... 0; 3 0.003 9; 0.003 588; 0.003 882; 0.003 63; 0.003 243; 0.003 65; 4 0.003 4; 0.004 58; 0.006 1; 0.003 39; 0.003 67; 0.003 40; 5 0.002 319; 0.002 315; 0.004 4; 0.002 375; 0.002 485; 0.002 382; 6 0.002 106; 0.003 17; ..... 0; 0.002 96; 0.003 55; 0.002 95; 7 0.002 162; 0.004 18; ..... 0; 0.002 140; 0.003 90; 0.002 138; 8 0.002 262; 0.004 4; ..... 0; 0.002 90; 0.003 35; 0.002 88; 9 0.002 39; ..... 0; ..... 0; 0.003 41; 0.003 9; 0.003 41; 10 0.002 77; ..... 0; ..... 0; 0.003 56; 0.004 9; 0.003 55; 11 0.002 12; ..... 0; ..... 0; 0.004 30; 0.005 4; 0.004 28; 12 0.001 7; ..... 0; ..... 0; 0.005 15; 0.006 3; 0.005 15; 13 0.001 1; ..... 0; ..... 0; 0.004 18; ..... 0; 0.004 18; 14 ..... 0; ..... 0; ..... 0; 0.005 7; ..... 0; 0.005 7; 15 ..... 0; ..... 0; ..... 0; 0.006 12; ..... 0; 0.006 11; 16 0.001 1; ..... 0; ..... 0; 0.006 4; ..... 0; 0.006 4; 17 ..... 0; ..... 0; ..... 0; 0.006 6; ..... 0; 0.006 5; 18 0.002 1; ..... 0; ..... 0; 0.006 3; ..... 0; 0.006 3; 19 ..... 0; ..... 0; ..... 0; 0.012 3; ..... 0; 0.012 3; 20 ..... 0; ..... 0; ..... 0; 0.010 2; ..... 0; 0.010 2; (VC appears unduly reluctant to use degrees over 3.)
4.4 A Discontinuous function
This discontinuous function gives real problems for a polynomial approximation.
 if (x < 0.0) y = 0.1; else y = 2*x  1; Target mean = 0.050, SD about mean = 0.411, SD about 0 = 0.414 Minimum = 1.0000, Maximum = 1.0000 , ,' , ,' , ,' , ,' ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, , ,' , ' ,' ,' , ' ,' ,' , ,' ::::::::::::::::::::::::::::::::::::::::::::;;;;; 1000 Cases, N = 50, S/N ratio 10.00 NoiseSD 0.041 MaxD 20 MaxD(VC) 20 KEY BEST; MML; VC; FPE; SCH; GCV; AV 0.0225; 0.0482; 2.6485; 4.3638; 4.3657; 4.0354; SD 0.0115; 0.0653; 13.6189; 22.4116; 22.4113; 22.1482; 5pc 0.0091; 0.0124; 0.0142; 0.0131; 0.0131; 0.0129; 25pc 0.0147; 0.0197; 0.0241; 0.0307; 0.0307; 0.0276; 50pc 0.0203; 0.0283; 0.0468; 0.1728; 0.1751; 0.1327; 75pc 0.0277; 0.0524; 0.4985; 1.5407; 1.5407; 1.1032; 95pc 0.0431; 0.1567; 12.2793; 17.2450; 17.2450; 15.9468; 99pc 0.0667; 0.2634; 46.1423; 67.2347; 67.2347; 64.1314; Max 0.0914; 0.9774; 265.7027; 515.2273; 515.2273; 515.2273;  DEG avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; avERR CNT; 0 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 1 ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 2 0.083 3; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 3 0.068 5; ..... 0; 0.063 1; ..... 0; ..... 0; ..... 0; 4 0.053 16; ..... 0; ..... 0; ..... 0; ..... 0; ..... 0; 5 0.040 51; 0.072 34; 0.034 31; ..... 0; ..... 0; ..... 0; 6 0.035 20; 0.115 4; 0.066 4; ..... 0; ..... 0; ..... 0; 7 0.030 92; 0.034 76; 0.038 93; 0.025 5; 0.022 4; 0.026 11; 8 0.029 36; 0.033 3; 3.592 10; 8.971 4; 8.971 4; 8.971 4; 9 0.024 105; 0.057 247; 0.147 125; 0.898 18; 0.898 18; 0.519 32; 10 0.025 32; 0.067 11; 5.823 28; 9.092 20; 9.092 20; 9.608 19; 11 0.021 121; 0.044 120; 0.873 100; 1.886 45; 1.973 43; 1.115 79; 12 0.021 35; 0.050 20; 12.382 46; 14.389 52; 14.670 51; 13.100 56; 13 0.018 118; 0.047 185; 0.685 117; 1.027 115; 1.035 114; 0.869 128; 14 0.019 44; 0.062 32; 14.069 51; 22.168 64; 21.839 65; 20.454 69; 15 0.016 105; 0.034 87; 1.334 84; 1.579 129; 1.628 125; 1.378 134; 16 0.018 38; 0.044 30; 6.240 56; 8.462 64; 8.098 67; 6.298 62; 17 0.014 66; 0.045 75; 1.482 79; 1.947 124; 1.916 126; 1.851 112; 18 0.015 25; 0.052 17; 1.834 51; 3.954 82; 3.954 82; 3.613 79; 19 0.015 61; 0.038 41; 0.583 81; 0.478 158; 0.478 158; 0.523 135; 20 0.016 27; 0.043 18; 5.869 43; 3.113 120; 3.041 123; 3.967 80;
This test shows all methods other than MML making very poor choices among the available polynomials. Even VC gives an average error about 50 times worse than MML, and a median error twice as bad. VC's poor choices are not always in the models of highest degree: it gives poor average errors for all degrees above 7. MML often chose higher degrees, but managed to keep low average errors. Similar results were found on this target function for other test conditions.
5 Discussion
5.1 Comparison with CMV
Our results largely corroborate CMV regarding the comparison of VC and "classical" methods. Except when there is abundant information in the data, VC is more reliable and gives lower median and average errors. CMV claims that their implementation of the VC principle minimizes a guaranteed bound on the error which is exceeded with some specified risk. CMV does not explicitly state what level of risk is accepted in their implementation, but hints in the last paragraph that the risk is 5%, i.e., that they seek to minimize a bound on the 95th percentile of the error distribution. Our results show the VC method usually gives the lowest median (50th percentile) error of all methods, but its 95th percentile often exceeds that of MML. It seems that either the CMV implementation is designed with a 50% risk in mind, or the bound on the 5% risk derived from the VC principle is not very tight.
The CMV report does not show average errors for the methods compared, only the 5, 25, 50, 75 and 95% percentiles. Their failure to compute average errors conceals the tendency of VC to give a small number of extremely large errors. The CMV statement that "... the best worstcase estimates generally imply the best averagecase estimates" (of error) cannot be supported.
5.2 The MML method
It may be thought that the obvious superiority of MML over a wide range of target functions and test conditions is due to its being a Bayesian method, and as such, (unfairly?) using knowledge not available in the data. This is a misconception.
In fact, the implementation of MML tested here was designed to assume as little 'prior' information as possible. The 'prior' assumed for the parameters of the polynomial model is not actually a genuine prior: the scale assumed for the parameter distributions is determined by the variance V of the data yvalues, and so is determined by the data themselves, not any prior beliefs. The only genuinely prior beliefs used in the MML method are the assumptions that (a) all degrees are equally likely, and (b) components of the explained variance (i.e. the coefficients) and the unexplained variance are all likely to be of similar size. These beliefs are misleading rather than helpful with respect to the tests reported here.
In the majority of tests, the best polynomial is of modest degree, and more importantly, the coefficients of the best polynomial decrease in size with increasing power of x. Only in the last, discontinuous, target function is the spectrum of coefficients anything like uniform.
MML does use information not used by the other methods. It uses the variance of the observed yvalues (as a natural scale for encoding estimates of the parameters of the model) and the determinant of the covariance matrix M of the given xvalues (as indicating the precision to which the MML message should state the estimated parameters). This information is inherent in the data, not imported from any prior knowledge of what the best approximating polynomial is likely to be.
If, as our results strongly suggest, use of these features of the given data improves the quality of the approximating model, there seems no reason for refusing to use them. Indeed, use of an inference principle which ignores them seems almost perverse.
There is an argument in favour of using only the information from the errors produced by each of the candidate polynomial models. It can be argued that, if a large, loworder polynomial (e.g. a linear function of x) is added to the target, keeping the absolute noise variance unchanged, then the estimates of the existence and coefficients of higherorder polynomial components should be unaffected. In other words, if much of the variance of the given data can be explained by some loworder polynomial, our estimates of the higherorder components should be based only on what is left unexplained. The classical and VC methods have this "additive" behaviour, whereas MML does not. In the present implementation, the prior used in coding the coefficient of a highorder polynomial component is independent of the values estimated for lowerorder components. To the extent that this prior affects the messagelength cost of including and coding a highorder coefficient, the MML choice of whether or not to choose the higher order will be affected by the size of the lowerorder components of the target function, which contribute to the data variance and hence to the prior used for higher orders.
The MML implementation could easily be modified to restore the "additive" property if this argument is found convincing. The prior used for the coefficient of order 0 could be based on the observed data variance about zero, as at present, but the prior used for the order1 component could then be based on the residual variance of the data about the estimated order0 model, and so on. That is, the prior density assumed for the coefficient a_{j} would be conditional on the values, not only of the observed yvalue variance V, but also on the estimated values of the coefficients of degree less than j. The prior assumed for the unexplained variance v in a model of degree d would be conditional on V and all estimated coefficients. However, we doubt if such a modification would much affect the results on the tests reported here. If anything, the MML performance should be improved, since the above conditional prior on the parameters is in closer accord with with the parameters of the best models than is the independant prior form assumed in this work.
6 Acknowledgements
This work was assisted by an Australian Research Council grant A49330662 and with facilities and assistance provided by Royal Holloway College.
7. References
 [1] V. Cherkassky, F. Mulier and V. Vapnik,
Comparison of VCmethod with classical methods for model selection,
Proc. 1996 World Congress on Neural Networks.
 [2] C. S. Wallace and P. R. Freeman,
Estimation and inference by compact coding,
J. of the Royal Statistical Society series B,
49(3), pp.240265, 1987.
 [3] ??? but there is a bibliography of all
Chris Wallace's publications
[here].
 [4] H. Akaike, Statistical predictor information,
Annals of the Inst. of Statistical Mathematics, 22, pp.203217, 1970.
 [5] G. Schwarz, Estimating the dimension of a model,
Annals of Statistics, 6(2), pp.461464, 1978.
 [6] P. Craven and G. Wahba,
Smoothing noisy data with spline functions,
Numerische Mathematik, 31, pp.377403, 1979.
 [7] V. Vapnik, Statistical Learning Theory, Springer Verlag, 1995.