1 / 43

Outliers and influential data points

Outliers and influential data points. No outliers?. An outlier? Influential?. An outlier? Influential?. An outlier? Influential?. An outlier? Influential?. An outlier? Influential?. An outlier? Influential?. Impact on regression analyses.

Download Presentation

Outliers and influential data points

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outliers and influential data points

  2. No outliers?

  3. An outlier? Influential?

  4. An outlier? Influential?

  5. An outlier? Influential?

  6. An outlier? Influential?

  7. An outlier? Influential?

  8. An outlier? Influential?

  9. Impact on regression analyses • Not every outlier strongly influences the estimated regression function. • Always determine if estimated regression function is unduly influenced by one or a few cases. • Simple plots for simple linear regression. • Summary measures for multiple linear regression.

  10. The hat matrix H

  11. The hat matrix H The regression model Least squares estimates Fitted values

  12. Identifying outlying Y values

  13. Identifying outlying Y values • Residuals • Standardized residuals • also called internally studentized residuals • Deleted residuals • Deleted t residuals • also called studentized deleted residuals • also called externally studentized residuals

  14. Using matrix notation: Residuals Ordinary residuals defined for each observation, i = 1, …, n:

  15. Variance of the residuals Residual vector Variance matrix Variance of the ith residual Estimated variance of the ith residual

  16. Standardized residuals Standardized residuals defined for each observation, i = 1, …, n: Standardized residuals quantify how large the residuals are in standard deviation units. Standardized residuals larger than 2 or smaller than -2 suggest that the y values are unusual.

  17. An outlying y value?

  18. S = 4.711 x y FITS1 HI1 s(e) RESI1 SRES1 0.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.82635 0.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.24916 1.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.43544 1.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.99818 2.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191 ... 8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.05561 9.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.77679 4.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110 • Unusual Observations • Obs x y Fit SE Fit Residual St Resid • 4.00 40.00 23.11 1.06 16.89 3.68R • R denotes an observation with a large standardized residual

  19. Deleted residual Deleted residuals If observed yi is extreme, it may “pull” the fitted equation towards itself, thereby yielding a small ordinary residual. Delete the ith case, estimate the regression function using remaining n-1 cases, and use the x values to predict the response for the ith case.

  20. Deleted t residuals A deleted t residual is just a standardized deleted residual: The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

  21. x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990

  22. Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012

  23. Identifying outlying X values

  24. Identifying outlying X values • Use the diagonal elements, hii, of the hat matrix H to identify outlying X values. • The hii are called leverages.

  25. Properties of the leverages (hii) • The hii is a measure of the distance between the X values for the ith case and the means of the X values for all n cases. • The hiiis a number between 0 and 1, inclusive. • The sum of the hiiequals p, the number of parameters.

  26. HI1 0.176297 0.157454 0.127014 0.119313 0.086145 0.077744 0.065028 0.061276 0.048147 0.049628 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.141136 0.140453 0.163492 0.050974 Sum of HI1 = 2.0000

  27. Properties of the leverages (hii) If the ith case is outlying in terms of its X values, it has a large leverage value hii, and therefore exercises substantial leverage in determining the fitted value.

  28. Using leverages to identify outlying X values Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value…. …or if it’s greater than 0.99.

  29. x y HI1 14.00 68.00 0.357535 Unusual Observations Obs x y Fit SE Fit Residual St Resid 21 14.0 68.00 71.449 1.620 -3.449 -1.59 X X denotes an observation whose X value gives it large influence.

  30. x y HI2 13.00 15.00 0.311532 Unusual Observations Obs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence.

  31. Identifying influential cases

  32. Influence • A case is influential if its exclusion causes major changes in the estimated regression function.

  33. Identifying influential cases • Difference in fits, DFITS • Cook’s distance measure

  34. DFITS The difference in fits … … represent the number of standard deviations that the fitted value increases or decreases when the ith case is included.

  35. DFITS A case is influential if the absolute value of its DFIT value is … … greater than 1 for small to medium data sets …greater than for large data sets

  36. x y DFIT1 14.00 68.00 -1.23841

  37. x y DFIT2 13.00 15.00 -11.4670

  38. Cook’s distance Cook’s distance measure… … considers the influence of the ith case on all n fitted values.

  39. Cook’s distance • Relate Di to the F(p, n-p) distribution. • If Di is greater than the 50th percentile, F(0.50, p, n-p), then the ith case has lots of influence.

  40. x y COOK1 14.00 68.00 0.701960

  41. x y COOK2 13.00 15.00 4.04801

More Related