680 likes | 872 Views
Outliers and influential data points. The distinction. An outlier is a data point whose response y does not follow the general trend of the rest of the data.
E N D
The distinction • An outlier is a data point whose response y does not follow the general trend of the rest of the data. • A data point is influential if it unduly influences any part of a regression analysis, such as predicted responses, estimated slope coefficients, hypothesis test results, etc.
Without the blue data point: The regression equation is y = 1.73 + 5.12 x Predictor Coef SE Coef T P Constant 1.732 1.121 1.55 0.140 x 5.1169 0.2003 25.55 0.000 S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2% With the blue data point: The regression equation is y = 2.96 + 5.04 x Predictor Coef SE Coef T P Constant 2.958 2.009 1.47 0.157 x 5.0373 0.3633 13.86 0.000 S = 4.711 R-Sq = 91.0% R-Sq(adj) = 90.5%
Without the blue data point: The regression equation is y = 1.73 + 5.12 x Predictor Coef SE Coef T P Constant 1.732 1.121 1.55 0.140 x 5.1169 0.2003 25.55 0.000 S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2% With the blue data point: The regression equation is y = 2.47 + 4.93 x Predictor Coef SE Coef T P Constant 2.468 1.076 2.29 0.033 x 4.9272 0.1719 28.66 0.000 S = 2.709 R-Sq = 97.7% R-Sq(adj) = 97.6%
Without the blue data point: The regression equation is y = 1.73 + 5.12 x Predictor Coef SE Coef T P Constant 1.732 1.121 1.55 0.140 x 5.1169 0.2003 25.55 0.000 S = 2.592 R-Sq = 97.3% R-Sq(adj) = 97.2% With the blue data point: The regression equation is y = 8.50 + 3.32 x Predictor Coef SE Coef T P Constant 8.505 4.222 2.01 0.058 x 3.3198 0.6862 4.84 0.000 S = 10.45 R-Sq = 55.2% R-Sq(adj) = 52.8%
Impact on regression analyses • Not every outlier strongly influences the regression analysis. • Always determine if the regression analysis is unduly influenced by one or a few data points. • Simple plots for simple linear regression. • Summary measures for multiple linear regression.
The leverageshii The predicted response can be written as a linear combination of the n observed values y1, y2, …, yn: where the weights hi1, hi2, …, hii, …, hin depend only on the predictor values. For example:
The leverageshii Because the predicted response can be written as: the leverage, hii, quantifies the influence that the observed response yi has on its predicted value .
Properties of the leverages hii • The leverage hii is: • a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points. • a number between 0 and 1, inclusive. • The sum of the hiiequals p, the number of parameters.
HI1 0.176297 0.157454 0.127014 0.119313 0.086145 0.077744 0.065028 0.061276 0.050974 0.049628 0.048147 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.140453 0.141136 0.163492 Sum of HI1 = 2.0000
HI1 0.153481 0.139367 0.116292 0.110382 0.084374 0.077557 0.066879 0.063589 0.050033 0.052121 0.047632 0.048156 0.049557 0.055893 0.057574 0.078121 0.088549 0.096634 0.096227 0.110048 0.357535 Sum of HI1 = 2.0000
Identifying data points whose x values are extreme .... and therefore potentially influential
Using leverages to identify extreme x values Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value…. …or if it’s greater than 0.99 (whichever is smallest).
x y HI1 14.00 68.00 0.357535 Unusual Observations Obs x y Fit SE Fit Residual St Resid 21 14.0 68.00 71.449 1.620 -3.449 -1.59 X X denotes an observation whose X value gives it large influence.
x y HI2 13.00 15.00 0.311532 Unusual Observations Obs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence.
Important distinction! • The leverage merely quantifies the potential for a data point to exert strong influence on the regression analysis. • The leverage depends only on the predictor values. • Whether the data point is influential or not depends on the observed value yi.
Identifying outliers • Residuals • Standardized residuals • also called internally studentized residuals
Residuals Ordinary residuals defined for each observation, i = 1, …, n: x y FITS1 RESI1 1 2 2.2 -0.2 2 5 4.4 0.6 3 6 6.6 -0.6 4 9 8.8 0.2
Standardized residuals Standardized residuals defined for each observation, i = 1, …, n: MSE1 0.400000 x y FITS1 RESI1 HI1 SRES1 1 2 2.2 -0.2 0.7 -0.57735 2 5 4.4 0.6 0.3 1.13389 3 6 6.6 -0.6 0.3 -1.13389 4 9 8.8 0.2 0.7 0.57735
Standardized residuals • Standardized residuals quantify how large the residuals are in standard deviation units. • An observation with a standardized residual that is larger than 3 (in absolute value) is generally deemed an outlier. • Recall that Minitab flags any observation with a standardized residual that is larger than 2 (in absolute value).
S = 4.711 x y FITS1 HI1 s(e) RESI1 SRES1 0.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.82635 0.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.24916 1.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.43544 1.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.99818 2.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191 ... 8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.05561 9.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.77679 4.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110 • Unusual Observations • Obs x y Fit SE Fit Residual St Resid • 4.00 40.00 23.11 1.06 16.89 3.68R • R denotes an observation with a large standardized residual.
Why should we care?(Regression of y on xwith outlier) The regression equation is y = 2.95763 + 5.03734 x S = 4.71075 R-Sq = 91.0 % R-Sq(adj) = 90.5 % Analysis of Variance Source DF SS MS F P Regression 1 4265.82 4265.82 192.230 0.000 Error 19 421.63 22.19 Total 20 4687.46
Why should we care?(Regression of y on xwithout outlier) The regression equation is y = 1.73217 + 5.11687 x S = 2.5919 R-Sq = 97.3 % R-Sq(adj) = 97.2 % Analysis of Variance Source DF SS MS F P Regression 1 4386.07 4386.07 652.841 0.000 Error 18 120.93 6.72 Total 19 4507.00
Identifying influential data points • Deleted residuals • Deleted t residuals • also called studentized deleted residuals • also called externally studentized residuals • Difference in fits, DFITS • Cook’s distance measure
Basic idea of these four measures • Delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations. • Compare the results using all n observations to the results with the ith observation deleted to see how much influence the observation has on the analysis.
Deleted residuals yi = the observed response for ith observation = predicted response for ith observationbased on the estimated model with the ith observation deleted Deleted residual
Deleted t residuals A deleted t residual is just a standardized deleted residual: The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.
x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990
Do any of the deleted t residuals stick out like a sore thumb?
Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012
Do any of the deleted t residuals stick out like a sore thumb?
DFITS The difference in fits: is the number of standard deviations that the fitted value changes when the ith case is omitted.
Using DFITS An observation is deemed influential … … if the absolute value of its DFIT value is greater than: … or if the absolute value of its DFIT value sticks out like a sore thumb from the other DFIT values.
x y DFIT1 14.00 68.00 -1.23841
Row x y DFIT1 1 0.1000 -0.0716 -0.52503 2 0.4540 4.1673 -0.08388 3 1.0977 6.5703 -0.18232 4 1.2794 13.8150 0.75898 5 2.2061 11.4501 -0.21823 6 2.5006 12.9554 -0.20155 7 3.0403 20.1575 0.27774 8 3.2358 17.5633 -0.08230 9 4.4531 26.0317 0.13865 10 4.1699 22.7573 -0.02221 11 5.2847 26.3030 -0.18487 12 5.5924 30.6885 0.05523 13 5.9209 33.9402 0.19741 14 6.6607 30.9228 -0.42449 15 6.7995 34.1100 -0.17249 16 7.9794 44.4536 0.29918 17 8.4154 46.5022 0.30960 18 8.7161 50.0568 0.63049 19 8.7016 46.5475 0.14948 20 9.1646 45.7762 -0.25094 2114.0000 68.0000 -1.23841
x y DFIT2 13.00 15.00 -11.4670