1 / 68

Outliers and influential data points

Outliers and influential data points. The distinction. An outlier is a data point whose response y does not follow the general trend of the rest of the data.

plato-russo
Download Presentation

Outliers and influential data points

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Outliers and influential data points

  2. The distinction • An outlier is a data point whose response y does not follow the general trend of the rest of the data. • A data point is influential if it unduly influences any part of a regression analysis, such as predicted responses, estimated slope coefficients, hypothesis test results, etc.

  3. No outliers? No influential data points?

  4. Any outliers? Any influential data points?

  5. Any outliers? Any influential data points?

  6. Without the blue data point: The regression equation is y = 1.73 + 5.12 x Predictor Coef SE Coef T P Constant 1.732 1.121 1.55 0.140 x 5.1169 0.2003 25.55 0.000 S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2% With the blue data point: The regression equation is y = 2.96 + 5.04 x Predictor Coef SE Coef T P Constant 2.958 2.009 1.47 0.157 x 5.0373 0.3633 13.86 0.000 S = 4.711 R-Sq = 91.0% R-Sq(adj) = 90.5%

  7. Any outliers? Any influential data points?

  8. Any outliers? Any influential data points?

  9. Without the blue data point: The regression equation is y = 1.73 + 5.12 x Predictor Coef SE Coef T P Constant 1.732 1.121 1.55 0.140 x 5.1169 0.2003 25.55 0.000 S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2% With the blue data point: The regression equation is y = 2.47 + 4.93 x Predictor Coef SE Coef T P Constant 2.468 1.076 2.29 0.033 x 4.9272 0.1719 28.66 0.000 S = 2.709 R-Sq = 97.7% R-Sq(adj) = 97.6%

  10. Any outliers? Any influential data points?

  11. Any outliers? Any influential data points?

  12. Without the blue data point: The regression equation is y = 1.73 + 5.12 x Predictor Coef SE Coef T P Constant 1.732 1.121 1.55 0.140 x 5.1169 0.2003 25.55 0.000 S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2% With the blue data point: The regression equation is y = 8.50 + 3.32 x Predictor Coef SE Coef T P Constant 8.505 4.222 2.01 0.058 x 3.3198 0.6862 4.84 0.000 S = 10.45 R-Sq = 55.2% R-Sq(adj) = 52.8%

  13. Impact on regression analyses • Not every outlier strongly influences the regression analysis. • Always determine if the regression analysis is unduly influenced by one or a few data points. • Simple plots for simple linear regression. • Summary measures for multiple linear regression.

  14. The leverageshii

  15. The leverageshii The predicted response can be written as a linear combination of the n observed values y1, y2, …, yn: where the weights hi1, hi2, …, hii, …, hin depend only on the predictor values. For example:

  16. The leverageshii Because the predicted response can be written as: the leverage, hii, quantifies the influence that the observed response yi has on its predicted value .

  17. Properties of the leverages hii • The leverage hii is: • a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points. • a number between 0 and 1, inclusive. • The sum of the hiiequals p, the number of parameters.

  18. Any high leverages hii?

  19. HI1 0.176297 0.157454 0.127014 0.119313 0.086145 0.077744 0.065028 0.061276 0.050974 0.049628 0.048147 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.140453 0.141136 0.163492 Sum of HI1 = 2.0000

  20. Any high leverages hii?

  21. HI1 0.153481 0.139367 0.116292 0.110382 0.084374 0.077557 0.066879 0.063589 0.050033 0.052121 0.047632 0.048156 0.049557 0.055893 0.057574 0.078121 0.088549 0.096634 0.096227 0.110048 0.357535 Sum of HI1 = 2.0000

  22. Identifying data points whose x values are extreme .... and therefore potentially influential

  23. Using leverages to identify extreme x values Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value…. …or if it’s greater than 0.99 (whichever is smallest).

  24. x y HI1 14.00 68.00 0.357535 Unusual Observations Obs x y Fit SE Fit Residual St Resid 21 14.0 68.00 71.449 1.620 -3.449 -1.59 X X denotes an observation whose X value gives it large influence.

  25. x y HI2 13.00 15.00 0.311532 Unusual Observations Obs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence.

  26. Important distinction! • The leverage merely quantifies the potential for a data point to exert strong influence on the regression analysis. • The leverage depends only on the predictor values. • Whether the data point is influential or not depends on the observed value yi.

  27. Identifying outliers(unusual y values)

  28. Identifying outliers • Residuals • Standardized residuals • also called internally studentized residuals

  29. Residuals Ordinary residuals defined for each observation, i = 1, …, n: x y FITS1 RESI1 1 2 2.2 -0.2 2 5 4.4 0.6 3 6 6.6 -0.6 4 9 8.8 0.2

  30. Standardized residuals Standardized residuals defined for each observation, i = 1, …, n: MSE1 0.400000 x y FITS1 RESI1 HI1 SRES1 1 2 2.2 -0.2 0.7 -0.57735 2 5 4.4 0.6 0.3 1.13389 3 6 6.6 -0.6 0.3 -1.13389 4 9 8.8 0.2 0.7 0.57735

  31. Standardized residuals • Standardized residuals quantify how large the residuals are in standard deviation units. • An observation with a standardized residual that is larger than 3 (in absolute value) is generally deemed an outlier. • Recall that Minitab flags any observation with a standardized residual that is larger than 2 (in absolute value).

  32. An outlier?

  33. S = 4.711 x y FITS1 HI1 s(e) RESI1 SRES1 0.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.82635 0.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.24916 1.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.43544 1.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.99818 2.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191 ... 8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.05561 9.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.77679 4.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110 • Unusual Observations • Obs x y Fit SE Fit Residual St Resid • 4.00 40.00 23.11 1.06 16.89 3.68R • R denotes an observation with a large standardized residual.

  34. Why should we care?(Regression of y on xwith outlier) The regression equation is y = 2.95763 + 5.03734 x S = 4.71075 R-Sq = 91.0 % R-Sq(adj) = 90.5 % Analysis of Variance Source DF SS MS F P Regression 1 4265.82 4265.82 192.230 0.000 Error 19 421.63 22.19 Total 20 4687.46

  35. Why should we care?(Regression of y on xwithout outlier) The regression equation is y = 1.73217 + 5.11687 x S = 2.5919 R-Sq = 97.3 % R-Sq(adj) = 97.2 % Analysis of Variance Source DF SS MS F P Regression 1 4386.07 4386.07 652.841 0.000 Error 18 120.93 6.72 Total 19 4507.00

  36. Identifying influential data points

  37. Identifying influential data points • Deleted residuals • Deleted t residuals • also called studentized deleted residuals • also called externally studentized residuals • Difference in fits, DFITS • Cook’s distance measure

  38. Basic idea of these four measures • Delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations. • Compare the results using all n observations to the results with the ith observation deleted to see how much influence the observation has on the analysis.

  39. Deleted residuals yi = the observed response for ith observation = predicted response for ith observationbased on the estimated model with the ith observation deleted Deleted residual

  40. Deleted t residuals A deleted t residual is just a standardized deleted residual: The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

  41. x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990

  42. Do any of the deleted t residuals stick out like a sore thumb?

  43. Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012

  44. Do any of the deleted t residuals stick out like a sore thumb?

  45. DFITS The difference in fits: is the number of standard deviations that the fitted value changes when the ith case is omitted.

  46. Using DFITS An observation is deemed influential … … if the absolute value of its DFIT value is greater than: … or if the absolute value of its DFIT value sticks out like a sore thumb from the other DFIT values.

  47. x y DFIT1 14.00 68.00 -1.23841

  48. Row x y DFIT1 1 0.1000 -0.0716 -0.52503 2 0.4540 4.1673 -0.08388 3 1.0977 6.5703 -0.18232 4 1.2794 13.8150 0.75898 5 2.2061 11.4501 -0.21823 6 2.5006 12.9554 -0.20155 7 3.0403 20.1575 0.27774 8 3.2358 17.5633 -0.08230 9 4.4531 26.0317 0.13865 10 4.1699 22.7573 -0.02221 11 5.2847 26.3030 -0.18487 12 5.5924 30.6885 0.05523 13 5.9209 33.9402 0.19741 14 6.6607 30.9228 -0.42449 15 6.7995 34.1100 -0.17249 16 7.9794 44.4536 0.29918 17 8.4154 46.5022 0.30960 18 8.7161 50.0568 0.63049 19 8.7016 46.5475 0.14948 20 9.1646 45.7762 -0.25094 2114.0000 68.0000 -1.23841

  49. x y DFIT2 13.00 15.00 -11.4670

More Related