350 likes | 569 Views
STATS 330: Lecture 11. Diagnostics 3. Outliers and high-leverage points. An outlier is a point that has a larger or smaller y value that the model would suggest Can be due to a genuine large error e Can be caused by typographical errors in recording the data
E N D
STATS 330: Lecture 11 Diagnostics 3 330 lecture 11
Outliers and high-leverage points • An outlier is a point that has a larger or smaller y value that the model would suggest • Can be due to a genuine large error e • Can be caused by typographical errors in recording the data • A high leverage point is a point with extreme values of the explanatory variables 330 lecture 11
Outliers • The effect of an outlier depends on whether it is also a high leverage point • A “high leverage” outlier • Can attract the fitted plane, distorting the fit, sometimes extremely • In extreme cases may not have a big residual • In extreme cases can increase R2 • A “low leverage” outlier • Does not distort the fit to the same extent • Usually has a big residual • Inflates standard errors, decreases R2 330 lecture 11
No outliers No high-leverage points Low leverage Outlier: big residual High leverage Not an outlier High-leverage outlier 330 lecture 11
Example: the education data (ignoring urban) High leverage point 330 lecture 11
An outlier also? Residual somewhat extreme 330 lecture 11
Measuring leverage It can be shown (see eg STATS 310) that the fitted value of case i is related to the response data y1,…,yn by the equation The hij depend on the explanatory variables. The quantities hii are called “hat matrix diagonals” (HMD’s) and measure the influence yi has on the ith fitted value. They can also be interpreted as the distance between the X-data for the ith case and the average x-data for all the cases. Thus, they directly measure how extreme the x-values of each point are 330 lecture 11
Interpreting the HMDs • Each HMD lies between 0 and 1 • The average HMD is p/n • (p=no of reg coefficients, p=k+1) • An HMD more than 3p/n is considered extreme 330 lecture 11
Example: the education data educ.lm<-lm(educ~percapita+under18, data=educ.df) hatvalues(educ.lm)[50] 50 0.3428523 > 9/50 [1] 0.18 Clearly extreme! n=50, p=3 330 lecture 11
Studentized residuals • How can we recognize a big residual? How big is big? • The actual size depends on the units in which the y-variable is measured, so we need to standardize them. • Can divide by their standard deviations • Variance of a typical residual e is var(e) = (1-h) s2 where h is the hat matrix diagonal for the point. 330 lecture 11
Studentized residuals (2) • “Internally studentised” (Called “standardised” in R) • “Externally studentised” (Called “studentised” in R) s2 is Usual Estimate of s2 s2i is estimate of s2 after deleting the ith data point 330 lecture 11
Studentized residuals (3) • How big is big? • Both types of studentised residual are approximately distributed as standard normals when the model is OK and there are no outliers. (in fact the externally studentised one has a t-distribution) • Thus, studentised residuals should be between -2 and 2 with approximately 95% probability. 330 lecture 11
Studentized residuals (4) Calculating in R: library(MASS) # load the MASS library stdres(educ.lm) #internally studentised (standardised in R) studres(educ.lm) #externally studentised (studentised in R) > stdres(educ.lm)[50] 50 3.275808 > studres(educ.lm)[50] 50 3.700221 330 lecture 11
What does studentised mean? 330 lecture 11
Recognizing outliers • If a point is a low influence outlier, the residual will usually be large, so large residual and a low HMD indicates an outlier • If a point is a high leverage outlier, then a large error usually will cause a large residual. • However, in extreme cases, a high leverage outlier may not have a very big residual, depending on how much the point attracts the fitted plane. Thus, if a point has a large HMD, and the residual is not particularly big, we can’t always tell if a point is an outlier or not. 330 lecture 11
High-leverage outlier Small residual! Small residual! 330 lecture 11
Leverage-residual plot plot(educ.lm, which=5) Can plot standardised residuals versus leverage (HMD’s): the leverage-residual plot (LR plot) Point 50 is high leverage, big residual, is an outlier 330 lecture 11
Interpreting LR plots Low leverage outlier 3p/n High leverage outlier Standardised residual 2 Possible high leverage outlier OK 0 -2 High leverage, outlier Low leverage outlier leverage 330 lecture 11
Residuals and HMD’s No big studentized residuals, no big HMD’s (3p/n = 0.2 for this example) 330 lecture 11
Residuals and HMD’s (2) Point 24 Point 24 One big studentized residual, no big HMD’s (3p/n = 0.2 for this example). Line moves a bit 330 lecture 11
Residuals and HMD’s (3) Point 1 Point 1 No big studentized residual, one big HMD, pt 1. (3p/n = 0.2 for this example). Line hardly moves.Pt 1 is high leverage but not influential. 330 lecture 11
Residuals and HMD’s (4) Point 1 Point 1 One big studentized residual, one big HMD, both pt 1. (3p/n = 0.2 for this example). Line moves but residual is large. Pt 1 is influential 330 lecture 11
Residuals and HMD’s (5) Point 1 Point 1 No big studentized residuals, big HMD, 3p/n= 0.2. Point 1 is high leverage and influential 330 lecture 11
Influential points • How can we tell if a high-leverage point/outlier is affecting the regression? • By deleting the point and refitting the regression: a large change in coefficients means the point is affecting the regression • Such points are called influential points • Don’t want analysis to be driven by one or two points 330 lecture 11
“Leave-one out” measures • We can calculate a variety of measures by leaving out each data point in turn, and looking at the change in key regression quantities such as • Coefficients • Fitted values • Standard errors • We discuss each in turn 330 lecture 11
Example: education data 330 lecture 11
Standardized difference in coefficients: DFBETAS Formula: Problem when: Greater than 1 in absolute value This is the criterion coded into R 330 lecture 11
Standardized difference in fitted values: DFFITS Formula: Problem when: Greater than 3Ö(p/(N-p) ) in absolute value (p=number of regression coefficients) 330 lecture 11
COV RATIO & Cooks D • Cov Ratio: • Measures change in the standard errors • of the estimated coefficients • Problem indicated: when Cov Ratio more than • 1+3p/n or less than 1-3p/n • Cook’s D • Measures overall change in the coefficients • Problem when: More than qf(0.50, p,n-p) • (lower 50% of F distribution), roughly 1 in • most cases 330 lecture 11
Calculations > influence.measures(educ.lm) Influence measures of lm(formula = educ ~ under18 + percap, data = educ.df) dfb.1. dfb.un18 dfb.prcp dffit cov.r cook.d hat inf 10 0.06381 -0.02222 -0.16792 -0.3631 0.803 4.05e-02 0.0257 * 44 0.02289 -0.02948 0.00298 -0.0340 1.283 3.94e-04 0.1690 * 50 -2.36876 2.23393 1.501812.4733 0.821 1.66e+000.3429 * > p=3, n=50, 3p/n=0.18, 3Ö(p/(n-p)) =0.758, qf(0.5,3,47)= 0.8002294 330 lecture 11
Plotting influence # set up plot window with 2 x 4 array of plots par(mfrow=c(2,4)) # plot dffbetas, dffits, cov ratio, # cooks D, HMD’s influenceplots(educ.lm) 330 lecture 11
Remedies for outliers • Correct typographical errors in the data • Delete a small number of points and refit (don’t want fitted regression to be determined by one or two influential points) • Report existence of outliers separately: they are often of scientific interest • Don’t delete too many points (1 or 2 max) 330 lecture 11
Summary: Doing it in R • LR plot: plot(educ.lm, which=5) • Full diagnostic display plot(educ.lm) • Influence measures: influence.measures(educ.lm) • Plots of influence measures par(mfrow=c(2,4)) influenceplots(educ.lm) 330 lecture 11
HMD Summary • Hat matrix diagonals • Measure the effect of a point on its fitted value • Measure how outlying the x-values are (how “high –leverage” a point is) • Are always between 0 and 1 with bigger values indicating high leverage • Points with HMD’s more than 3p/n are considered “high leverage” 330 lecture 11