500 likes | 774 Views
Change-points and segmented regression- a brief introduction. Marian Scott NERC workshop, University of Glasgow December 2013. What shall we cover?. What do we mean by a changepoint? Segmented or dog leg regression Non-parametric changepoint analysis Other resources.
E N D
Change-points and segmented regression- a brief introduction Marian Scott NERC workshop, University of Glasgow December 2013
What shall we cover? • What do we mean by a changepoint? • Segmented or dog leg regression • Non-parametric changepoint analysis • Other resources
By a changepoint, we mean a point in time, or a value of an explanatory variable before and after which some or all model parameters might change, this can be a shift in a mean, a variance (or both) or some other parameter such as the slope of a simple regression It may be known or unknown; there may be more than 1. Changepoints can occur for a variety of reasons including: A change in the equipment A new regulation A change in emissions What is a changepoint?
It can be considered as a special non-linear model a change-point can be defined as a point that separates a series of observations into two groups, each following a different model. the retrospective change-point problem occurs when all the data have already been collected.- a hypothesis testing and estimation problem. the sequential change-point is a more typical approach in industrial process control- has the system gone beyond its design tolerances? A non-linear regression model
the effects of the change-point (s) may be seen in terms of trend (gradual or abrupt change in the mean), (discontinuities), change in the variation (heteroscedastic vs homoscedastic), changes in model parameters (eg single vs two phase regression). an important defining characteristic is whether the location/position of the change point is known. Is the problem about inference- is there a change point, or about estimation- where is the change-point? what is a change-point?
What is a “change-point? what kind of “change”? three simple examples 1) Step down (up) of the mean level. 2) Temporary increase (decrease) of the mean level. 3) A change in a model parameter, eg the slope.
the hypothesis testing approach • a change in mean at unknown or known time • if known time, then test the hypothesis that the two mean levels are the same (a two-sample t-test but if the data are correlated?) • if unknown time, then needs to be estimated as well – more difficult. Often a sequential testing approach used.
two phase or dog-leg regression • a change in model parameters- at unknown or known time • if known time, then two straight lines, slope changes at time t0 – relatively straightforward • if unknown time, then needs to be estimated as well – more difficult, iterative again
the nature of the problem the statistical model Yi = f(xi) + i • is f known parametrically or non-parametrically? • are there single or multiple change-points? • Are they known or unknown?
How hard is the simple problem of a change in mean • here we imagine a series with two mean levels • 20 observations N(10,22 ) and 20 observations N(20, 22) • our ability to detect a change depends on the size of the change and the variability in the series
the nature of the problem • quite extensive literature for time series analysis, sometimes known as intervention analysis • change-point or discontinuity analysis also developed for spatial problems- edge detection in image analysis
The bent cable (Grace Chiu) Does not need to be a linear regression, and can be extended to have a transition phase between the two regressions.
Type 1: a known change-point Writing the model A constraint to make the regressions meet at the changepoint Using the lm command Segmented or dog leg regression
Data (yi,xi) i=1,….,n, known changepoint c Yi = 1 +1xi for x c Yi = 2 +2xi for x > c To be continuous at c 1 +1c = 2 +2c or 2 = 1 +c(1 - 2 ) To give Yi = 1 +1xi for x c Yi = 1 +c(1 - 2 ) + 2xi for x > c Segmented or dog leg regression
With a known c, this is simply a linear model problem How to fit in R Create an indicator variable z z = ifelse(x>c,1,0) zi = (x-c)*z Modeldog<-lm(y~x+zi) We can then use all the resources of the lm command to investigate model properties. Segmented or dog leg regression
The model definition • The two linear parts are guaranteed to meet at c. Notice that this model uses only three parameters in contrast to the four parameters used in the two separate regression models. A parameter has been saved by insisting on the continuity of the fit at c.
The model definition • We can have more than one knotpoint simply by defining more pairs of basis functions with different knotpoints. • Broken stick regression is sometimes called segmented regression. Allowing the knotpoints to be parameters is worth considering but this will result in a nonlinear model.
Model testing • Choosing between a model with a single straight line vs the broken stick model • Testing the two regression models when the change-point known is based on an F- test, using the residual sums of squares • F=(RSS1 – RSS2)/(RSS2/(T-3)) • RSSi i line model, T= no. of obsvns. • F~F(1,T-3)
unknown time • problem is no longer linear, there is now an extra parameter needed which is the changepoint • solution is numerical, not analytical • can follow a simple algorithm (Julious, 2001) • RSSi i line model, T= no. of obsvns • F=0.5x(RSS1 – RSS2)/(RSS2/(T-4)) • F~F(2,T-4) approximately (but may not be a particularly good approximation)
Segmented regression Within R, the segmented package, fits linear models with unknown changepoint linmod<-lm(y~x) Segmented.mod<-segmented(linmod,seg.z=~x, psi=???) psi is the starting value of the breakpoint
Smooths and discontinuities One of the most common smoothers is local linear regression, thatconsists of solving the least squares problem: w(xi-x;h) is called the kernel function, and is generally a smooth positivefunction which peaks at 0 anddecreases monotonically as (xi-x) increase in size. h is the smoothing parameter. Another smoother widely used is LOESS, that consists of computing a locally-weighted straight line smooth using k nearest neighbours.
Theory of the Discontinuity Test The test used is based on one proposed by Hall & Titterington (1992). At each data point x1, x2,…, xnwe observe the data y1, y2,…, yn where: yi = g(xi)+ei for i = 1,… ,n. (where g is a smooth function) The test statistic is based on the difference between the left (gl(xi)) and the right (gr(xi)) smooths, where: Left Smoother Right Smoother where wjis the weight given by the local linear smoother; I{} is the indicator function. Criteria for detection is based on |gl(xi) - gr(xi)| > 3 standard errors.
Discontinuity test statistic Examples that follow are of SO2 at various monitoring stations round Europe (EMEP)
First Type of Discontinuity First Example of Discontinuity
Second Type of Discontinuity Second Example of Discontinuity Plot of SO4 in precipitation not corrected
Second Type of Discontinuity Second Example of Discontinuity Plot of SO4 in precipitation not corrected
Second Type of Discontinuity Second Example of Discontinuity Plot of SO4 in precipitation not corrected
Second Type of Discontinuity Second Example of Discontinuity Plot of SO4 in precipitation not corrected
Sm.discontinuity • The sm package in R includes a command to carry out the non-parametric discontinuity test that was described earlier. • It does not assume that we know whether a discontinuity exists and where it occurs.
Example:River Nile flow • These data record historical data on the water level of the River Nile. • The variables are: • Volume -Annual volume of the Nile River (discharge at Aswan, 108 m3) • Year -1871-1970 • Cobb, G. (1978). The problem of the Nile: conditional solution to a change-point problem. Biometrika 65, 243-25
Example: The river Nile data • Volume of the river for approx 100 year period. • is there evidence of a change? • if yes, when and in what way?
a non-parametric model for the Nile • a smooth function (LOESS) or non-parametric regression model • OK? • any suggestion that there may be a change-point?
An alternative model for the Nile • two smooth sections, broken at roughly 1900. • different mean levels in the two periods • so modelling the two periods separately
References • Julious S. (2001). Inference and estimation in a changepoint regression problem. JRSS (D), 50(1), 51-61 • Toms J., Lesperance M,. (2003). Piecewise regression : a tool for indentifying ecological thresholds. Ecology 84(8), 2034-2041. • Kamenos N (2010). North Atlantic summers have warmed more than winters since 1353 and the response of marine zooplankton. PNAS. • Bowman, Pope and Ismail (2006)