Outliers and influential data points

Outliers and influential data points

The distinction • An outlier is a data point whose response y does not follow the general trend of the rest of the data. • A data point is influential if it unduly influences any part of a regression analysis, such as predicted responses, estimated beta coefficients, hypothesis test results, etc.

No outliers? No influential data points?

Any outliers? Any influential data points?

Impact on regression analyses • Not every outlier strongly influences the regression analysis. • Always determine if the regression analysis is unduly influenced by one or a few data points. • Simple plots for simple linear regression. • Summary measures for multiple linear regression.

The leverageshi

The leverage, hi, quantifies the influence that the observed response yi has on its predicted value The leverageshi The predicted response can be written as a linear combination of the n observed values y1, y2, …, yn: where the weights h1, h2, …, hi, …, hn depend only on the predictor values.

Properties of the leverages hi • The hi is: • a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points. • a number between 0 and 1, inclusive. • The sum of the hiequals p, the number of parameters.

Any high leverages hi?

HI1 0.176297 0.157454 0.127014 0.119313 0.086145 0.077744 0.065028 0.061276 0.050974 0.049628 0.048147 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.140453 0.141136 0.163492 Sum of HI1 = 2.0000

Any high leverages hi?

HI1 0.153481 0.139367 0.116292 0.110382 0.084374 0.077557 0.066879 0.063589 0.050033 0.052121 0.047632 0.048156 0.049557 0.055893 0.057574 0.078121 0.088549 0.096634 0.096227 0.110048 0.357535 Sum of HI1 = 2.0000

Identifying data points whose x values are extreme .... and therefore potentially influential

Using leverages to identify extreme x values Minitab flags any observations whose leverage value, hi, is more than 3 times larger than the mean leverage value…. …or if it’s greater than 0.99 (whichever is smallest).

x y HI1 14.00 68.00 0.357535 Unusual Observations Obs x y Fit SE Fit Residual St Resid 21 14.0 68.00 71.449 1.620 -3.449 -1.59 X X denotes an observation whose X value gives it large influence.

x y HI2 13.00 15.00 0.311532 Unusual Observations Obs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence.

Identifying outliers(unusual y values)

Identifying outliers • Residuals • Standardized residuals • also called internally studentized residuals

Residuals Ordinary residuals defined for each observation, i = 1, …, n: x y FITS1 RESI1 1 2 2.2 -0.2 2 5 4.4 0.6 3 6 6.6 -0.6 4 9 8.8 0.2

Standardized residuals Standardized residuals defined for each observation, i = 1, …, n: MSE1 0.400000 x y FITS1 RESI1 HI1 SRES1 1 2 2.2 -0.2 0.7 -0.57735 2 5 4.4 0.6 0.3 1.13389 3 6 6.6 -0.6 0.3 -1.13389 4 9 8.8 0.2 0.7 0.57735

Standardized residuals • Standardized residuals quantify how large the residuals are in standard deviation units. • An observation with a standardized residual that is larger than 3 (in absolute value) is considered an outlier. • Recall that Minitab flags any observation with a standardized residual that is larger than 2 (in absolute value).

An outlier?

S = 4.711 x y FITS1 HI1 s(e) RESI1 SRES1 0.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.82635 0.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.24916 1.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.43544 1.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.99818 2.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191 ... 8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.05561 9.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.77679 4.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110 • Unusual Observations • Obs x y Fit SE Fit Residual St Resid • 4.00 40.00 23.11 1.06 16.89 3.68R • R denotes an observation with a large standardized residual.

Identifying influential data points

Identifying influential data points • Deleted residuals • Deleted t residuals • also called studentized deleted residuals • also called externally studentized residuals • Difference in fits, DFITS • Cook’s distance measure

Basic idea of these four measures • Delete the observations one at a time, each time refitting the regression model on the remaining n-1 observations. • Compare the results using all n observations to the results with the ith observation deleted to see how much influence the observation has on the analysis.

Deleted residuals yi = the observed response for ith observation = predicted response for ith observationbased on the estimated model with the ith observation deleted Deleted residual

Deleted t residuals A deleted t residual is just a standardized deleted residual: The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990

Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012

DFITS The difference in fits: is the number of standard deviations that the fitted value changes when the ith case is omitted.

… greater than for large data sets DFITS An observation is deemed influential if the absolute value of its DFIT value is … … greater than 1 for small to medium data sets … or if it just sticks out like a sore thumb

x y DFIT1 14.00 68.00 -1.23841

Row x y DFIT1 1 0.1000 -0.0716 -0.52503 2 0.4540 4.1673 -0.08388 3 1.0977 6.5703 -0.18232 4 1.2794 13.8150 0.75898 5 2.2061 11.4501 -0.21823 6 2.5006 12.9554 -0.20155 7 3.0403 20.1575 0.27774 8 3.2358 17.5633 -0.08230 9 4.4531 26.0317 0.13865 10 4.1699 22.7573 -0.02221 11 5.2847 26.3030 -0.18487 12 5.5924 30.6885 0.05523 13 5.9209 33.9402 0.19741 14 6.6607 30.9228 -0.42449 15 6.7995 34.1100 -0.17249 16 7.9794 44.4536 0.29918 17 8.4154 46.5022 0.30960 18 8.7161 50.0568 0.63049 19 8.7016 46.5475 0.14948 20 9.1646 45.7762 -0.25094 21 14.0000 68.0000 -1.23841

x y DFIT2 13.00 15.00 -11.4670

Row x y DFIT2 1 0.1000 -0.0716 -0.4028 2 0.4540 4.1673 -0.2438 3 1.0977 6.5703 -0.2058 4 1.2794 13.8150 0.0376 5 2.2061 11.4501 -0.1314 6 2.5006 12.9554 -0.1096 7 3.0403 20.1575 0.0405 8 3.2358 17.5633 -0.0424 9 4.4531 26.0317 0.0602 10 4.1699 22.7573 0.0092 11 5.2847 26.3030 0.0054 12 5.5924 30.6885 0.0782 13 5.9209 33.9402 0.1278 14 6.6607 30.9228 0.0072 15 6.7995 34.1100 0.0731 16 7.9794 44.4536 0.2805 17 8.4154 46.5022 0.3236 18 8.7161 50.0568 0.4361 19 8.7016 46.5475 0.3089 20 9.1646 45.7762 0.2492 21 13.0000 15.0000 -11.4670

Cook’s distance Cook’s distance • Di depends on both residual ei and leverage hi. • Di summarizes how much all of the estimated beta coefficients change when deleting the ith observation. • A large Di indicates yi has a strong influence on the estimated beta coefficients.

Cook’s distance • Compare Di to the F(p, n-p) distribution. • If Di is greater than the 50th percentile, F(0.50, p, n-p), then the ith observation has lots of influence.

x y COOK1 14.00 68.00 0.701960

x y COOK2 13.00 15.00 4.04801

Outliers and influential data points