220 likes | 415 Views
Stat 6601 Project: Regression Diagnostics (V&R 6.3). Presenters: Anthony Britto, Kathy Fung, Kai Koo. Basic Definition of Regression Diagnostics. An old robust method Developed to measure and iteratively detect possibly wrong data and reject them through analysis of globally fitted model.
E N D
Stat 6601 Project:Regression Diagnostics(V&R 6.3) Presenters: Anthony Britto, Kathy Fung, Kai Koo Regression Diagnostics
Basic Definition of Regression Diagnostics • An old robust method • Developed to measure and iteratively detect possibly wrong data and reject them through analysis of globally fitted model Regression Diagnostics
Regression Diagnostics • Goal: • Detection of possibly wrong data through analysis of globally fitted model. • Typical approach: • (1) Determine an initial fitted model • (2) Compute the residuals • (3) Reject / identify outliers • (4) Rebuild model or tracking the source of errors Regression Diagnostics
Influence and Leverage (1) • Influence: An observation is influential if the estimates change substantially when this observation is omitted. • Leverage: The "horizontal" distance of the x -value from the mean of x. The further from the mean, the more leverage an observation has. • y-discrepancy:The vertical distance between yobs.and ypredicted Conceptual formula: Influence = Leverage × y-Discrepancy Regression Diagnostics
Influence and Leverage (2) High influence point (5,60) Low influence point (30,105) (x - mean of x)2 = 830 (x -mean of x)2 = 15 yobs - ypred = 45 yobs - ypred = 45 Regression Diagnostics
Detecting Outliers • Distinguish the difference between two types of outliers • 1st type: outliers in the response variable represent model failure, such observations are called outliers. • 2nd type: outliers with respect to the predictors are called leverage points. • Both types can affect the regression model. However, they may almost uniquely determine regression coefficients. They may also cause the standard error of regression coefficients to be much smaller than they would be if the observation were excluded. Regression Diagnostics
Methods to detect outliers in R Outliers in the predictors can often be detected by simply examining the distribution of the predictors. • Dot Plots • Stem-and-leaf plots • Box Plots • Histograms Regression Diagnostics
Linear Model Y = b0 + b1x1+ b2x2 + .... + bkxk + e Matrix form Y = Xb + e Y = X = b= e = Regression Diagnostics
R Functions for Regression Diagnostics Package Function Description Base plot(model) Basic diagnostics plots ls.diag (lsfit(x,y)) Diagnostic tool car cr.plots(model) Partial residual plots av.plots(model) Partial regression plots hatvalues (model) Hat values outlier.test (model) Test for largest residual df.betas(model) DfBet as measure of influence cookd(model) Cook’s D measure of influence rstudent(model) Studentized residuals vif(model) VIF or GVIF for each term in the model Regression Diagnostics
R function for Robust Regression Package Function Description MASS rlm (yx) M-Estimation lqs ltsreg (yx) Least-Trimmed squares lms(yx) Least-Median regression Regression Diagnostics
Example: Linear regression (one independent variable) 1 Matrix form R / S-plus script Y = Xb + e > xd <- c(rep(1,5),1,3,4,5,7) > yd <- c(6,14,10,14,26) > x <- matrix(xd,5,2, byrow=F) > y <- matrix(yd,5,1, byrow=T) > xtrp <- t(x) # Matrix transpose > xxtrp <- xtrp %*% x # Matrix multiplication > inxxtrp <- solve(xxtrp) #Matrix inverting > b.hat <- inxxtrp %*% xtrp %*% y > b.hat [,1] [1,] 2 [2,] 3 > H <- x %*% inxxtrp %*% xtrp # hat matrix > H [,1] [,2] [,3] [,4] [,5] [1,] 0.65 0.35 0.2 0.05 -0.25 [2,] 0.35 0.25 0.2 0.15 0.05 [3,] 0.20 0.20 0.2 0.20 0.20 [4,] 0.05 0.15 0.2 0.25 0.35 [5,] -0.25 0.05 0.2 0.35 0.65 Regression Diagnostics
Example: Linear regression (one independent variable) 2 Extraction of leverages and predicted values Leverage of the ith observation (hii) (for one independent variable; n = # of obs.; p =1) > n <- 5 > lev <- numeric(n) > for (i in 1:n) { + lev[i] <- H[i,i] + } > lev [1] 0.65 0.25 0.20 0.25 0.65 > h <- lm.influence(lm(y~x))$hat > h [1] 0.65 0.25 0.20 0.25 0.65 > ls.diag(lsfit(x[,2],y))$hat [1] 0.65 0.25 0.20 0.25 0.65 > y1.pred <- 0 > for (i in 1:n) { + y1.pred <- y1.pred + H[1,i]* y[i] + } > y1.pred # y1.pred=(x1=1)*3(slope+2(intercept) [1] 5 hij = leverage of (xi, yi) if i =j Regression Diagnostics
Example: linear regression (measurement of residuals) From y-discrepancy to influence Raw residual value (y-discrepancy) Standardized residual value (influence) Studentized residual value (influence) Regression Diagnostics
Influence, leverage and discrepancy The influence of observations can be determined by their residual values and leverages. Regression Diagnostics
Calculation of residual values # Do it by youself in R > y.pred <- numeric(n) > for (i in 1:n) { + for (j in 1:n) { + y.pred[i] <- y.pred[i] + H[i,j]* yd[j] + } + } > res <- yd-y.pred > Sy <- sqrt(sum(res^2)/(n-2)) > resstd <- res/(Sy*sqrt(1-lev)) > resstd [1] 0.4413674 0.9045340 -1.1677484 -0.9045340 1.3241022 # Using ls.diag to get residuals > ls.diag(lsfit(x[,2],y))$std.res #standardized residuals [1] 0.4413674 0.9045340 -1.1677484 -0.9045340 1.3241022 > ls.diag(lsfit(x[,2],y))$stud.res #Studentized residuals [1] 0.3726780 0.8660254 -1.2909944 -0.8660254 1.6770510 Regression Diagnostics
Example: Multiple regression R / S-plus scriptR output Call: glm(formula = log10price ~ elevation + date + flood + distance, data = project.data) Deviance Residuals: Min 1Q Median 3Q Max -0.22145 -0.09075 -0.04765 0.07475 0.43564 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.226620 0.092763 13.223 4.74e-13 *** elevation 0.032394 0.007304 4.435 0.000149 *** date 0.008065 0.001168 6.902 2.50e-07 *** flood -0.338254 0.087451 -3.868 0.000659 *** distance 0.025659 0.007177 3.575 0.001401 ** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for gaussian family taken to be 0.02453675) Null deviance: 2.90725 on 30 degrees of freedom Residual deviance: 0.63796 on 26 degrees of freedom AIC: -20.414 Number of Fisher Scoring iterations: 2 project.data<-read.csv("projdata.csv") model1 <- glm(log10price~elevation+date+flood+distance, data=project.data) summary(model1) Regression Diagnostics
Example: Multiple regression (measurement of influence using R / S-plus) Residual plot R / S-plus script # Measurement of influence y <- matrix(log10price,31,1, byrow=T) x <- matrix(c(elevation, date, flood, distance), 31,4,byrow=F) lesi <- ls.diag(lsfit(x,y)) # Regression diagnostics lesi$stud.res # Extraction of Studentized residuals plot(lesi$stud.res, ylab="Studentized residuals", xlab="obs #") lesi$cooks # Extraction of Cook's [1] 1.392863e-02 3.528960e-01 8.396778e-02 1.518977e-01 1.390608e-01 [6] 1.145438e-02 2.437453e-03 1.972966e-03 1.705327e-01 9.386767e-02 [11] 7.468621e-03 1.134031e-06 1.945352e-04 1.678359e-03 8.794873e-03 [16] 5.150404e-03 2.257051e-05 4.193730e-03 1.961141e-02 1.120336e-03 [21] 1.075247e-01 1.071167e-02 2.825819e-02 2.193734e-03 5.710213e-02 [26] 7.024345e-02 1.166287e-03 1.322331e-02 2.616666e-03 1.411050e-01 [31] 1.06727e-02 Regression Diagnostics
Example: Multiple regression (SAS) SAS scriptOutput The REG Procedure Model: MODEL1 Dependent Variable: log10price Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 2.00013 0.50003 14.33 <.0001 Error 26 0.90712 0.03489 Corrected Total 30 2.90725 Root MSE 0.18679 R-Square 0.6880 Dependent Mean0.98126 Adj R-Sq 0.6400 Coeff Var 19.03533 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.38737 0.09877 14.05 <.0001 size 1 0.00012958 0.00011481 1.13 0.2694 elevation 1 0.02820 0.00866 3.26 0.0031 flood 1 -0.23779 0.09837 -2.42 0.0229 date 1 0.00881 0.00150 5.88 <.0001 data land(drop=county sewer); infile "c:\stat 6401\projdata.csv" delimiter=',' firstobs=2; input price county size elevation sewer date flood distance; log10price=log10(price); run; proc reg data=land; model log10price=elevation size date flood /r ; plot rstudent.*log10price='+'; output out=pred pred=phat; title 'linear regression for housing prices'; run; Regression Diagnostics
Example: Multiple regression (SAS) Output Statistics Dep Var Predicted Std Error Std Error Student Cook's Obs log10price Value Mean Predict Residual Residual Residual -2-1 0 1 2 D 1 0.6532 0.7276 0.0698 -0.0744 0.140 -0.530 | *| | 0.014 2 1.0253 0.5897 0.0628 0.4356 0.144 3.036 | |******| 0.353 3 0.2304 0.3623 0.0850 -0.1319 0.132 -1.002 | **| | 0.084 4 0.6990 0.8682 0.0872 -0.1692 0.130 -1.301 | **| | 0.152 5 0.6990 0.5380 0.0875 0.1609 0.130 1.239 | |** | 0.139 6 0.5185 0.5978 0.0623 -0.0793 0.144 -0.552 | *| | 0.011 7 0.7559 0.8078 0.0474 -0.0519 0.149 -0.348 | | | 0.002 8 0.7924 0.8400 0.0466 -0.0477 0.150 -0.319 | | | 0.002 9 1.2878 1.3972 0.1082 -0.1094 0.113 -0.966 | *| | 0.171 10 0.5051 0.7266 0.0635 -0.2215 0.143 -1.546 | ***| | 0.094 11 0.6721 0.7347 0.0634 -0.0626 0.143 -0.437 | | | 0.007 12 0.8388 0.8399 0.0495 -0.001063 0.149 -0.0072 | | | 0.000 13 0.9085 0.9256 0.0416 -0.0171 0.151 -0.113 | | | 0.000 14 1.0645 1.1228 0.0364 -0.0584 0.152 -0.383 | | | 0.002 15 1.2856 1.1825 0.0457 0.1031 0.150 0.688 | |* | 0.009 16 1.0682 1.1578 0.0409 -0.0896 0.151 -0.593 | *| | 0.005 17 1.1239 1.1305 0.0368 -0.006693 0.152 -0.0440 | | | 0.000 18 1.1790 1.0709 0.0315 0.1081 0.153 0.704 | |* | 0.004 19 1.0934 1.2252 0.0519 -0.1317 0.148 -0.891 | *| | 0.020 20 1.1847 1.1382 0.0373 0.0464 0.152 0.305 | | | 0.001 21 1.0864 1.2145 0.0920 -0.1281 0.127 -1.011 | **| | 0.108 22 1.2577 1.0865 0.0318 0.1712 0.153 1.116 | |** | 0.011 23 1.2253 1.0051 0.0393 0.2202 0.152 1.452 | |** | 0.028 24 0.7709 0.7985 0.0728 -0.0277 0.139 -0.200 | | | 0.002 25 0.6021 0.7314 0.0769 -0.1293 0.136 -0.948 | *| | 0.057 26 1.5705 1.2588 0.0431 0.3117 0.151 2.070 | |**** | 0.070 27 1.2601 1.2152 0.0391 0.0449 0.152 0.296 | | | 0.001 28 1.1790 1.2709 0.0589 -0.0919 0.145 -0.633 | *| | 0.013 29 1.3598 1.4007 0.0590 -0.0408 0.145 -0.281 | | | 0.003 30 1.1818 1.0540 0.0980 0.1279 0.122 1.047 | |** | 0.141 31 1.3404 1.4003 0.0737 -0.0598 0.138 -0.433 | | | 0.011 Regression Diagnostics
Example: Multiple regression (SAS) Residual plotStudentized Residual plot Regression Diagnostics
Further studies for regression analysis • Analysis of models • Multicollinearity • Heteroscedasticity • Autocorrelation • Validation of models • Website of R Function for modern regression http://socserv.socsci.mcmaster.ca/andersen/ICPSR/RFunctions.pdf Regression Diagnostics
The End Regression Diagnostics