250 likes | 562 Views
R -programming on Linear Time Series. Hung Chen Department of Mathematics National Taiwan University 12/10/2002. Outline. Overview Build a probability model for the data. Decomposition of signal and noise a trend component a seasonal component a residual (stationary process)
E N D
R-programming on Linear Time Series Hung Chen Department of Mathematics National Taiwan University 12/10/2002
Outline • Overview • Build a probability model for the data. • Decomposition of signal and noise • a trend component • a seasonal component • a residual (stationary process) • Stationary process • ARMA model • Data Analysis • Box-Jenkins approach • Model Selection Criteria • Differencing • Nonstationary
Overview of time series analysis • Goal in analyzing a particular time series is • Make forecast. • Understand the underlying mechanism. • Start with building a probability model for the data. • When do we stop model building process? • The general end result of probability modeling is a method for “reducing” the series to some kind of standard “random noise”. • The point is that when we have such “noise” we have extracted all useful information. • For forecasting, the utility of the “reduction to random noise” notion is that “noise” cannot be predicted except with a probability statement usually in the form of a so-called prediction interval. • We can then reverse the “reduction to random noise” procedure to obtain a prediction interval for the original series. • In regression analysis, what is the noise? • Recall the assumption on the noise. • In regression, how do we reverse the “reduction to random noise” procedure to obtain a prediction interval for the regression?
Overview of time series analysis • To understand the mechanism that generates the series, the notion is that “noise” is not understandable, so all the useful information is in the mechanism whereby the process is reduced to noise. • Issues: • What tools are available to develop the “reduction to noise?” • How do we recognize “noise” when we see it? • Three typical steps in the “reduction-to-noise'' process: • A data transformation such as taking logarithms of the data. • Removing seasonality and trend to obtain a stationary process. • Fit a standard time series model (generally a so-called Auto-Regressive Moving Average or ARMA model). • The “reduction-to-noise” procedure does not always proceed in a linear fashion (e.g. transformation, remove seasonality and trend, and then fit an ARMA model). • One will usually jump around from one attempt after another of trying to develop each of the three components, assessing the results of each attempt, backing up, jumping ahead, etc.
Initial data analysis • The first step in analyzing any data set is to plot it. • For time series, the first plot we generally look at is the “time series plot,” which is a plot of the observed series xt by time t. • Consider three time series. • data(nhtemp): The mean annual temperature in degrees Fahrenheit in New Haven, Connecticut, from 1912 to 1971. • data(lh) : It is a series of 48 observations at 10 mins intervals on luteinizing hormone levels for blood samples from a human female. • data(nottem): A time series object containing average air temperatures at Nottingham Castle in degrees Fahrenheit for 20 years, 1920-1939. • Load package ts first. • Load three data sets. • Produce a time series plot, i.e. plot an observed series xt, as a function of time t. • plot(nhtemp, main = "nhtemp data",ylab = "Mean annual temperature in New Haven, CT (deg. F)") • plot(lh, main = "lh data",ylab = "hormone level ",xlab="Time in 10 minutes ")
Data analysis • plot(nottem, main = "nottem data",ylab = "Average monthly temperature at Nottingham Castle (deg. F)") • plot.ts: Plotting Time-Series Objects • Do you notice that some of these series exhibit a marked “periodicity,” or seasonality. • Basically, within a given year the maximum of the series occurs in the summer and the minimum in the winter. • This is perhaps not so clear for the plots of the whole series as the “cycles'' are rather “squeezed.” • Show a “blow up'' of a small number of years within each series. • We know the maximum temperature is in the summer, so the “seasonality'' of a temperature series is not unexpected. • Some economic series also show seasonality. Can you give reason to associate it to the weather as well? Consider housing sale. • The seasonal effect is demonstrated as well by computing averages across years within months. • Considering the temperature series nottem, there are 20 years of data, so there are 20 January temperatures. We take these 20 January temperatures and average them to get an estimate of the over-all temperature in January.
Further analysis on nottem data • Removing seasonal component by subtracting seasonal mean. • The approach we demonstrated is to decompose the series into a trend, a seasonal component and a residual (stationary process). • Many business and economic time series contain a seasonal phenomenon that repeats itself after a regular period of time. • As a first cut at removing the seasonality of the nottem data, we compute average temperatures across years within a month. This is done below and the results are plotted. • Computing within-month averages and subtracting • nottem.matrix<-matrix(nottem,ncol=12,byrow=T) • Arrange the data in a matrix. • The columns are the data in a given month (first column is all the Januaries) and the rows are the data in a given year. • boxplot(split(nottem.matrix,col(nottem.matrix))) • nottem.monthly.means<- apply(nottem.matrix,2,mean) 39.695 39.190 42.195 46.290 52.560 58.040 61.900 60.520 56.480 49.495 42.580 39.530 • plot(1:12,nottem.monthly.means) • 1:12 is the sequence of integers 1 to 12 • nottem.dev.mm<-nottem-rep(nottem.monthly.means,20) • There are 20 years data. We need repeat the sequence of monthly means 20 times. • ts.plot(nottem.dev.mm)
Further analysis on nottem data • Other than the obvious feature of seasonal variations, the nottem series shows irregular fluctuations. • Are those irregular fluctuations generated by a stationary process? • Stationary in mean and/or variance (a time trend) can be easily detected from the graph. • Produce a lagged scatterplot, i.e. plot xt+h versus xt for some lag h, typically h = 1 or 2, but not much more. • lag.plot(nottem.dev.mm, 12, diag.col = "forest green") : time series lag plots • Plot time series against lagged versions of themselves. Helps visualizing “auto-dependence” even when auto-correlations vanish. • Calculate and plot the sample autocorrelation, and plot the requisite bands to indicate significant sample autocorrelations. • acf: It computes (and by default plots) estimates of the autocovariance or autocorrelation function. • pacf: It is the function used for the partial autocorrelations. • acf(nottem.dev.mm,type = "covariance",plot = TRUE) • "correlation", "partial", acf(nottem.dev.mm,type = "correlation",plot = TRUE) • plot.acf: Plotting Autocovariance and Autocorrelation Functions • pacf(nottem.dev.mm, lag.max = NULL, plot = TRUE, na.action = na.fail, ...) • Can the residuals be modeled as an AR(1) process?
The Box-Jenkins approach of time series modeling • It consists of iterations among three crucial stages: model identification, model estimation, and diagnostic checking. • Model Identification • Match patterns of the sample ACF and sample PACF from the data with the known patterns of ACF and PACF for the ARMA models. Model ACF PACF White Noise Clean Clean AR(p) Tail-off Cut-off at lag p MA(q) Cut-off at lag q Tail-off ARMA(p,q) Tail-off at lag q-p Tail-off at lag p-q • Identify the values of p and q for ARMA(p,q) model for the data based on the sample autocorrelation function (SACF) and the sample partial autocorrelation function (SPACF) which can be computed through SACF. • De Gooijer, J.G., et al. (1985) Methods for determining the order of an autoregressive-moving average process: A survey. International Statistical Review 53, 301-329. • If the data are generated from a white noise, then approximate standard error of the SACF and SPACF is n–1/2. • If the SACF decays very slow and there is a spike of SPACF at lag 1 only, it indicates that differencing is needed. Try d = 1 first and re-examine the SACF and SPACF of the differenced series. If the SACF are still not decaying, try higher values of d. It should be noted that d is seldom greater than 2 in practice.
The Box-Jenkins approach of time series modeling • Model Estimation:after specifying an ARMA(p,q) model • Estimate the parameters in the model. • There are (p + q + 1) parameters (i.e.,f1,…, fp, q1,…, qq, s2). • The Method of Moments: • The Yule-Walker equations for an AR(p) process can be solved easily. • Unlike the AR(p) case, the moment equations for MA(q) models are NONLINEAR. It is very difficult to solve, especially when q is large. • The moment estimators for the mixed ARMA models are also very complicated. • Conditional Maximum Likelihood Estimation • Unconditional Maximum Likelihood Estimation • Diagnostic Checking • Assess model adequacy by checking whether the model assumptions are satisfied. • If the model is inadequate, this stage will provide some information for us to re-identify the model. • Checking normality, constant variance, and independence assumption among residuals.
Data Analysis on housing starts • Getting the data into R. • hstart: U.S. Housing Starts, monthly, January 1966 to December 1974. • This is read into R via hstart<-scan("D:/teaching/time series/hstart.txt") • The function “ts” is used to create time-series objects. • Made into an R time series via hstart.data<- ts(hstart,start=c(1966,1),frequency=12) • Model Building • plot(hstart.data, main = "Housing start data",ylab = "Monthly housing starts at USA") • hstart.matrix<-matrix(hstart.data,ncol=12,byrow=T) • boxplot(split(hstart.matrix,col(hstart.matrix))) • hstart.monthly.means<- apply(hstart.matrix,2,mean) • hstart.dev.mm<-hstart.data-rep(hstart.monthly.means,9) • ts.plot(hstart.dev.mm) • For the hstart series, there is a trend which dominates even the seasonal component. • We see that there was a clear increase in housing starts from 1966 to 1973, and a decline from 1973 to 1976.
Differencing • Box-Jenkins methodology of using differencing: • diff(hstart.data,1,1) • acf: Itcomputes (and by default plots) estimates of the autocovariance or autocorrelation function. • pacf: Itis the function used for the partial autocorrelations. • acf(x, lag.max = NULL,type = c("correlation", "covariance", "partial"), plot = TRUE, na.action = na.fail, demean = TRUE, ...) • plot.acf: Plotting Autocovariance and Autocorrelation Functions • pacf(x, lag.max = NULL, plot = TRUE, na.action = na.fail, ...) • Some time series exhibit correlations which never decay exponentially, as they would be for an ARMA process. • Fractional differencing (Brockwell & Davis, Ch. 13.2)
Data analysis on luteinizing hormone levels • data(lh) • It is a series of 48 observations at 10 minutes intervals on luteinizing hormone levels for blood samples from a human female. • lh: Time Series: Start = 1; End = 48; Frequency = 1 (the number of observations per unit of time) • acf(lh); pacf(lh) • a<- ar(lh) • Estimates of unknown parameters are obtained by solving the Yule-Walker equations with autocorrelations replaced by sample autocorrelations. • AIC is used to choose the order of the autoregressive model. • It removes the overall mean (2.4) of the series before fitting the AR model, • The p is selected to be 3. (Try a$aic.) • The fitted model is Xt–2.4 = 0.6534(Xt-1-2.4) -0.0636(Xt-2-2.4) - 0.2269(Xt-3-2.4) + e. • s2 is estimated as 0.1959. • Model diagnostic • ar(lh, FALSE, 1) # fit ar(1) or arima(lh, order = c(1,0,0)) • residual
Principle on deciding model • Principle of Parsimony • If several models are adequate to represent a given data set, one should choose the fitted model with the smallest possible number of parameters. • Model Selection Criteria • AIC (Akaike's Information Criterion) AIC(M) = -2ln(L)+2M where M is the number of parameters in the model and L is the likelihood function. • BIC (Schwartz's Bayesian Criterion) • The Criterion is quite similar to AIC: BIC(M) = -2ln(L) +M lnn
Nonstationary Time Series Models • ARIMA stands for Autoregressive Integrated Moving Average" Models • Differencing • The Random Walk Model • The present observation is the last observation plus a random shock. • The model has been widely used to describe the behaviour of security prices.
Commands in the ts package • ar: Fit an autoregressive time series model to the data, by default selecting the complexity by AIC. • arima: Fit an ARIMA model to a univariate time series. • predict: Do model predictions. • predict is a generic function for predictions from the results of various model fitting functions. The function invokes particular methods which depend on the class of the first argument. • arima.sim: Simulate from an ARIMA model. • ARMAtoMA: Convert ARMA process to infinite MA process. • box.test Box–Pierce and Ljung–Box Tests • Compute the Box–Pierce or Ljung–Box test statistic for examining the null hypothesis of independence in the time series xt is computed. • decompose: • Decompose a time series into seasonal, trend and irregular components using moving averages. • Deals with additive or multiplicative seasonal component. • stl: Decompose a time series into seasonal, trend and irregular components using loess.