POlyfit

Curve and Polygon Fitting.

Curve fitting: Given a sequence of `n' points, [<x₀,y₀>, ..., <x_n-1,y_n-1>], what is the optimal piece-wise linear curve that can be fitted to the points? We want "optimal" to mean "most probable" given the data, i.e. having the largest posterior probability. We must therefore formulate a model for the data and the noise / measurement inaccuracy in the data.

Polygon Fitting: The sequence of points is considered to be closed, circular, i.e. <x_n,y_n> = <x₀,y₀>, and we fit a polygon to them.

L
.
A
l
l
i
s
o
n

The data might, for example, come from digitising the boundary of an object in an image. Error can be introduced by original photography, scanning, identification of the boundary, and sampling of points on the boundary.

Noise or measurement error in the data is assumed to be due to some random process, but it can be modelled. e.g. We might assume that if a point "really" comes from a certain line segment then the error perpendicular to the line is modelled by a normal distribution, say.

Hypothesis complexity should clearly be taken into account because as the number of line segments increases towards n-1 it is possible to fit the data with zero error and, without care, this could lead to over-fitting. Minimum message length (MML) inference takes the hypothesis, H, into account with a two-part message: msgLen(H&D) = msgLen(H) + msgLen(D | H). A complex hypothesis with many segments might fit D very well, i.e. P(D | H) is high and msgLen(D | H) is low, but it must also pay for its own cost, i.e. msgLen(H).

Related problems are:

Use of basis functions other than straight lines, e.g.
- Fourier components, as in Patrick (1979), Patrick & Wallace (1982).
Point-sets: we are given an unordered set of points to be described by one or more straight lines etc..
Function approximation: here x_i < x_i+1, and we usually consider the x_i to be exact and only the y_i to contain possible noise or error. e.g.
- Segmentation: fitting piece-wise constant functions
- Approximation by piece-wise straight lines or other basis curves etc.

MML / MDL: Patrick (1979) used Fourier components to describe megalithic stone circles. He used MML to come to the conclusion that they really are rough circles rather than more elaborate geometries that had been proposed, and still are proposed regularly. Georgeff and Wallace (1983, 1984) described a minimum message length criterion for fitting one or more (straight) line segments through a set of data points. Banerjee et al (1996) gave an MDL method for polygon fitting and the Java applet below implements their algorithm. They made the simplifying assumptions that (i) the polygon's vertices (knots) are a subset of the given data points, (ii) the first data point is a vertex, and thereby achieved an O(n²)-time algorithm.

- L. Allison

Notes

W. D. Fisher. On Grouping for Maximum Homegeneity. Jrnl. Am. Stat. Soc. 53 pp.789-798, 1958
Gave an O(s.n²)-time algorithm for finding the optimum (minimum sum of squared errors) segmentations of a sequence of n points into 1, 2, ..., s segments. Did not have a criterion for stopping, i.e. for choosing the best `s'.
J. D. Patrick. An Information Measure Comparative Analysis of Megalithic Geometries. (i.e. stone-circles) Ph.D Thesis, Monash University, 1979
It is a good read. See also ...
J. D. Patrick & C. S. Wallace. Stone Circle Geometries: An Information Theory Approach. In Archaeoastronomy in the Old World, D. Heggie (ed), C.U.P., 1982
C. S. Wallace & M. P. Georgeff. A General Objective for Inductive Inference. TR#32, Dept. Computer Science, Monash University, Australia 3800, March 1983.
[HTML] Fuller version of Georgeff and Wallace (1984). Includes PFSA's.
M. P. Georgeff & C. S. Wallace. A General Selection Criterion for Inductive Inference. European Conf. on Artificial Intelligence, Pisa, pp.473-482, Sept. 1984
Presented minimum message/description length (MML | MDL) methods for the inference of probabilistic finite state automata (PFSA), fitting curves through points, and fitting straight lines through points. See also ...
M. P. Georgeff & C. S. Wallace. A General Selection Criterion for Inductive Inference. TR#44, Dept. Computer Science (later School of Comp. Sci. and Software Eng.), Monash University, Australia 3800, June 1984
S. Itoh. An Algorithm for the Piecewise Linear Approximation of Planar Curves. Proc. IEEE Symp. on Information Theory (ISIT), pp.62, 1990
S. Banerjee, W. Niblack & M. Flickner. A Minimum Description Length Polygon Approximation method. IBM Research report RJ 10007 (89096) pp.19 Dec' 1996
The authors credit Itoh's work (1990) as being "of particular relevance" to their paper.
L. J. Fitzgibbon, L. Allison & D. L. Dowe. Minimum message length grouping of ordered data. Proc. 11th Int. Workshop on Algorithmic Learning Theory, H. Arimura, S. Jain & A. Sharma (eds), pp.56-70, Sydney, Springer Verlag, LNCS, Dec' 2000.
Uses an MML cost criterion to give a stopping condition to Fisher's (1958) algorithm and hence find an optimal segmentation of multivariate data. Means and variances can differ from segment to segment.

(NB. Needs Java on)

Draw shapes in the window (hold the left mouse button down). The faster and slower buttons vary the sampling rate. Closed or open curves can be drawn. The MDL button fits a polygon through the last shape drawn using the MDL fitting method.

NEEDS Sun MicroSystem's JAVA ON!

precomputed example.

[example].