310 likes | 445 Views
STATS 330: Lecture 12. Diagnostics 4. Diagnostics 4. Aim of today’s lecture To discuss diagnostics for independence. Independence. One of the regression assumptions is that the errors are independent.
E N D
STATS 330: Lecture 12 Diagnostics 4 330 lecture 12
Diagnostics 4 Aim of today’s lecture To discuss diagnostics for independence 330 lecture 12
Independence • One of the regression assumptions is that the errors are independent. • Data that is collected sequentially over time often have errors that are not independent. • If the independence assumption does not hold, then the standard errors will be wrong and the tests and confidence intervals will be unreliable. • Thus, we need to be able to detect lack of independence. 330 lecture 12
Types of dependence • If large positive errors have a tendency to follow large positive errors, and large negative errors a tendency to follow large negative errors, we say the data has positive autocorrelation • If large positive errors have a tendency to follow large negative errors, and large negative errors a tendency to follow large positive errors, we say the data has negative autocorrelation 330 lecture 12
Diagnostics • If the errors are positively autocorrelated, • Plotting the residuals against time will show long runs of positive and negative residuals • Plotting residuals against the previous residual (ie ei vs ei-1) will show a positivetrend • A correlogram of the residuals will show positive spikes, gradually decaying 330 lecture 12
Diagnostics (2) If the errors are negatively autocorrelated, • Plotting the residuals against time will show alternating positive and negative residuals • Plotting residuals against the previous residual (ie ei vs ei-1) will show a negative trend • A correlogram of the residuals will show alternating positive and negative spikes, gradually decaying 330 lecture 12
Residuals against time res<-residuals(lm.obj) plot(1:length(res),res, xlab=“time”,ylab=“residuals”, type=“b”) lines(1:length(res),res) abline(h=0, lty=2) Can omit the “x” vector if it is sequence numbers Dots/lines Dotted line at 0 (mean residual) 330 lecture 12
Residuals against previous res<-residuals(lm.obj) n<-length(res) plot.res<-res[-1] # element 1 has no previous prev.res<-res[-n] # have to be equal length plot(prev.res,plot.res, xlab=“previous residual”,ylab=“residual”) 330 lecture 12
Plots for different degrees of autocorrelation 330 lecture 12
Correlogram acf(residuals(lm.obj)) • Correlogram (autocorrelation function, acf) is plot of lag k autocorrelation versus k • Lag k autocorrelation is correlation of residuals k time units apart 330 lecture 12
Durbin-Watson test • We can also do a formal hypothesis test, (the Durbin-Watson test) for independence • The test assumes the errors follow a model of the form NB where the ui’s are independent, normal and have constant variance. r is the lag 1 correlation: this is the autoregressive model of order 1 330 lecture 12
Durbin-Watson test (2) • When r = 0, the errors are independent • The DW test tests independence by testing r = 0 • ris estimated by 330 lecture 12
Durbin-Watson test (3) DW test statistic is • Value of DW is between 0 and 4 • Values of DW around 2 are consistent with independence • Values close to 4 indicate negative serial correlation • Values close to 0 indicate positive serial correlation 330 lecture 12
Durbin-Watson test (4) • There exist values dL, dU depending on the number of variables k in the regression and the sample size n – see table on next slide • Use the value of DW to decide on independence as follows: Positive autocorrelation Negative autocorrelation Independence 0 dL dU 4-dU 4-dL 4 Inconclusive 330 lecture 12
Durbin-Watson table 330 lecture 12
Example: the advertising data • Sales and advertising data • Data on monthly sales and advertising spend for 35 months • Model is Sales ~ spend + prev.spend (prev.spend = spend in previous month) 330 lecture 12
Advertising data • > ad.df • spend prev.spend sales • 1 16 15 20.5 • 2 18 16 21.0 • 3 27 18 15.5 • 4 21 27 15.3 • 5 49 21 23.5 • 6 21 49 24.5 • 7 22 21 21.3 • 8 28 22 23.5 • 9 36 28 28.0 • 10 40 36 24.0 • 11 3 40 15.5 • 21 3 17.3 • … 35 lines in all 330 lecture 12
R code for residual vs previous plot advertising.lm<-lm(sales~spend + prev.spend, data = ad.df) res<-residuals(advertising.lm) n<-length(res) plot.res<-res[-1] prev.res<-res[-n] plot(prev.res,plot.res, xlab="previous residual",ylab="residual",main="Residual versus previous residual \n for the advertising data") abline(coef(lm(plot.res~prev.res)), col="red", lwd=2) 330 lecture 12
Time series plot, correlogram – R code par(mfrow=c(2,1)) plot(res, type="b", xlab="Time Sequence", ylab = "Residual", main = "Time series plot of residuals for the advertising data") abline(h=0, lty=2, lwd=2,col="blue") acf(res, main ="Correlogram of residuals for the advertising data") 330 lecture 12
Increasing trend? 330 lecture 12
Calculating DW > rhohat<-cor(plot.res,prev.res) > rhohat [1] 0.4450734 > DW<-2*(1-rhohat) > DW [1] 1.109853 For n=35 and k=2, dL = 1.34. Since DW = 1.109 < dL = 1.34 , strong evidence of positive serial correlation 330 lecture 12
Durbin-Watson table use (1.28 + 1.39)/2 = 1.34 330 lecture 12
Remedy (1) • If we detect serial correlation, we need to fit special time series models to the data. • For full details see STATS 326/726. • Assuming that the AR(1) model is ok, we can use the arima function in R to fit the regression 330 lecture 12
Fitting a regression with AR(1) errors > arima(ad.df$sales,order=c(1,0,0), xreg=cbind(spend,prev.spend)) Call: arima(x = ad.df$sales, order = c(1, 0, 0), xreg = cbind(spend, prev.spend)) Coefficients: ar1 intercept spend prev.spend 0.4966 16.9080 0.1218 0.1391 s.e. 0.1580 1.6716 0.0308 0.0316 sigma^2 estimated as 9.476: log likelihood = -89.16, aic = 188.32 330 lecture 12
Comparisons 330 lecture 12
Remedy (2) • Recall there was a trend in the time series plot of the residuals, these seem related to time • Thus, time is a “lurking variable” , a variable that should be in the regression but isn’t • Try model Sales ~ spend + prev.spend + time 330 lecture 12
Fitting new model time=1:35 new.advertising.lm<-lm(sales~spend + prev.spend + time, data = ad.df) res<-residuals(new.advertising.lm) n<-length(res) plot.res<-res[-1] prev.res<-res[-n] DW = 2*(1-cor(plot.res,prev.res)) 330 lecture 12
DW Retest • DW is now 1.73 • For a model with 3 explanatory variables, du is about 1.66 (refer to the table), so no evidence of serial correlation • Time is a highly significant variable in the regression • Problem is fixed! 330 lecture 12