Least-Squares Segmentation of a Series

Given a series, A[1], A[2], ..., A[N], and a number, G<=N, what is the optimal way to partition the series into G segments (i.e. groups, clusters, ...):

A[1..P1], A[P1+1..P2], ..., A[PG-1+1..N]
The aim is to minimise the sum of squared differences (SSD) of values from the means of their respective segments, summed over all of the segments.

SSDi..j = Σk=i..j (A[k] - meani..j)2

= sumSqi..j - (sumi..j)2/(j-i+1)

where meani..j = sumi..j/(j-i+1)

and sumi..j = (Σk=i..j A[k])

and sumSqi..j = (Σk=i..j A[k]2)
- see [μ&σ].
Now note that
sumi..j = sum1..j - sum1..i-1 and that

sumSqi..j = sumSq1..j - sumSq1..i-1
We can compute sum1..i and sumSq1..i, for all i, in O(N)-time.

Now treat SSDi+1..j like an "edge-weight" or a distance, d(i,j) if j>i, in a graph having vertices 0, 1, 2, ..., N, and edges <i,j> for j>i. Seek a path of exactly G edges (i.e. segments) from vertex 0 to vertex N. A dynamic programming approach can be used (Fisher 1958). Let Pk[j] be the minimum total SSD achievable for A[1..j] using exactly k segments.

P1[j] = SSD1..j, for j=1..N

Pk[j] = MINi<j{Pk-1[i] + SSDi+1..j}, for k = 2..G
This suggests an O(G*N2)-time algorithm.

Choosing the Number of Segments.

There are   N-1CG   segmentations of [1..N] into G segments. This gives 2N-1 possible segmentations in total if the maximum allowed number of segments equals N. Least-squares segmentation does not indicate how to choose an optimal value for G. As G increases towards N, the lowest achievable total SSD must decrease towards zero because d(i-1,i) = SSDi..i = 0. Some other technique must be used to decide at what value to stop G. Various statistical techniques can be used including some based on information-theory.

Having a large number of segments is clearly a more complex hypothesis than having a small number of segments. This can be made explicit by introducing a cost for the hypothesis, i.e. the cost of stating G, P1, P2, ..., PG-1, and the mean of each segment. Minimum Message Length [MML] is a general machine-learning technique that takes this approach.

If the total cost of stating a segment can be attributed to the segment (and consequently d(i,j)>0 for all i, j where i<j) then, for example, a variation on Dijkstra's algorithm for single-source, shortest paths can be used to find an optimal segmentation (over all possible values of G) in O(N2)-time.

Notes

© L.A. 1999