470 likes | 622 Views
Psych 5510/6510. Chapter Nine: Outliers and Data Having Undue Influence. Spring, 2009. Effect of Outliers. Set 1: 1, 3, 5, 9, 14 : Sample Mean = est. μ = 6.4, MSE = S 2 = 26.8 Confidence interval mean: 0 μ 12.8 Set 2: 1, 3, 5, 9, 140 :
E N D
Psych 5510/6510 Chapter Nine: Outliers and Data Having Undue Influence Spring, 2009
Effect of Outliers Set 1: 1, 3, 5, 9, 14: Sample Mean = est. μ = 6.4, MSE = S2 = 26.8 Confidence interval mean: 0 μ 12.8 Set 2: 1, 3, 5, 9, 140: Sample Mean = est. μ = 31.6, MSE = S2 = 3680.8 Confidence interval mean: -43.7 μ 106.9 • Parameter estimate greatly effected. • Confidence interval based on the second data set is much wider; less likely to reject the null hypothesis.
Causes of Outliers • Error in measurement or inputting the score, i.e. the score does not measure the variable. Could be a data inputting error, or the subject not understanding the instructions. • The sample is essentially being drawn from two distinct populations, with one population having a greater frequency. Example, sampling from a gym class mainly consisting of average students plus a few members from the track team. • Sampling from a error distribution with ‘thick tails’ (ones that have a greater than normal chance of sampling a score far from the mean).
1) Error in Measurement or Inputting the Score This is less likely to be caught when inputting a series of data for the computer to analyze, than when computing the analysis yourself with a calculator.
2) Outliers due to non-homogeneous set Discovering that more than one kind of thing is being measured (e.g. that more than one population is appearing in the group) can be very interesting. Rather than get rid of outliers, need to identify them so they can be examined and possibly the existence of two different populations can be incorporated into the model.
3) Outliers due to thick tails (The taller curve with thinner tails is normally distributed) Thick tails lead to more frequent extreme scores and greater error variance than sampling from a normal distribution.
Example of Accidentally Reversing Scores Note error in data entry for student #6. PRE and F don’t change much, but there is a huge difference in the parameters, including a reversal in the direction of the slope.
With Outlier Note reversal of slope and big change in intercept, yet model still statistically significant, without looking at graph you might think that you have a pretty good model.
Identifying Outliers • Is X unusual? Leverage • Is Yi unusual? Discrepancy • Would omission of the observation produce a dramatic change in the model (i.e in the values of the parameter estimates bo, b1, b2,...) Happens when both X and Y are unusual. Influence
Identifying Outliers • Using graphs to look at your data should be your first and last resort. This gets increasingly complicated as the number of predictor variables increases, and it is nice to have some criteria for when to worry, so... • There are many, many approaches for using statistical analyses to flag potential outliers (which you can then look at with graphs to see what is going on).
Leverage Leverage involves determining whether any particular observation has unusual values for X. The approach we will use involves looking at the ‘lever’ that goes with each observation.
Levers Buried within the regression equation is the fact that all of the X and all of the Y scores in the data set go into computing the values of the b’s in the regression equation. This means that for any one observation the X scores for all of the observations influence it’s predicted value of Y.
Levers In a much more complicated but equivalent version of the regression equation you plug in all of the X scores for all of the observations to predict a particular value of Y. Alternative regression equation: Ŷi= a complicated formula that includes all of the X scores in the data set, not just the X score for observation ‘i’.
The Basis of Using Levers If an observation has unusual X scores then its predicted value of Y is not very strongly influenced by the X scores of the other observations, instead its prediction is influenced mainly by its own X scores. If, on the other hand, an observation has X scores similar to those of the other observations, then its predicted value of Y is influenced not only by its own X scores but also by those of the other observations.
Levers This gives us a way of determining whether an observation has unusual X scores. If its predicted value of Y is heavily influenced by its own X scores then those X scores must have been unusual. If its predicted value of Y is influenced by the X scores of other observations then its X scores must have been similar to those of the other observations.
Levers A ‘lever’ (symbolized as hii) is a measure of how much an observation’s own X scores influences its predicted value of Y (it measures how much leverage its own X scores had in the prediction). If the observation has unusual X scores (compared to other observations) then its lever is a large value, if the observation has X scores similar to other observations then its lever is a smaller value.
Interpreting levers Levers will always have a value between 0 and 1. The mean (expected) value of a lever--if its values of X conform to those of the other observations--is PA/n. If an observation has a lever much greater than that then it is a flag that it has some unusual X values.
Interpreting levers (cont) The bigger the lever, the more its X scores stood out as different from the rest. How big does a lever have to be to draw our attention? • If the value of the lever is 2 or 3 times the mean of the levers (see formula on the previous slide for the mean of the levers) then it deserves special attention, or, • The value of 1/hii can be thought of as being roughly equivalent to how many observations go into predicting that value of Y. So if hii is the maximum value of 1, then 1/hii = 1, and we can say that the X scores from just one observation went into making that prediction (i.e. itself). If hii =.1, then 1/hii = 10 and we can say that the equivalent of ten observations went into making that prediction. If N is large, then any value of 1/hii 5 should grab out attention, as that predicted value of Y was based upon the X scores of only 5 or less other observations.
Levers Table Expected (mean) value of the levers is PA/n = 2/13 = .15
Discrepancy Discrepancy involves determining whether any particular observation has unusual values for Y. Q: Is Yi unusual in respect to what? A: To the model. In other words, look for observations that differ greatly from the regression line (giving them a large error). An examination of error terms is referred to as an ‘analysis of the residuals’.
Two problems • The magnitude of error depends upon the scale, e.g. an error of 12 is large when predicting number of children in a household, but it is a small error when predicting the weight of a car in ounces. • Outliers grab the model (particularly with small n), greatly influencing the regression line in the model’s attempt to reduce squared error. Thus an unusual Y tends to make the regression line move towards it to reduce error.
Discrepancy Approach: if an observation is unusual (way off the regression line that would fit all of the other observations) then creating a parameter just to handle it should greatly reduce error (visually think of the original regression line being freed from the pull of the outlier...look back the original scatter plots with and without the outlier). We will start by looking at how that works with our outlier.
Discrepancy Approach: Model C is the original Model, in this case: For Model A, add another variable (X2) that has a score of X2= 0 everywhere but at the outlier, where X2=1 If PRE is significant, then it was worthwhile to handle that one outlier individually in the model, i.e. it doesn’t belong with the other scores.
Data With Dummy Variable The dummy variable is there to handle observation #6.
Studentized Deleted Residual Model C SAT=96.55 - .50(HSRANK) Model A SAT=6.71 + .50(HSRANK)+55.49(Dummy) PRE=0.68 F*1,10=21.4, t*=4.6, p<.01 Thus it was worthwhile to introduce a dummy variable to account for the outlier. Stat programs will do this for each observation, one at a time, to determine whether or not it is worthwhile to create a dummy variable to handle just that observation. They report this as the Studentized Deleted Residual, which is the square root of the value of F* above.
Do this for each observation, compare a model with just HSRANK (X1) to one that has a dummy variable (X2) just to handle that one value of Y, report the t value for the PRE.
The problem of error rate Problem: If each t test has a .05 chance of making a type 1 error, then the overall error rate is too large with this approach.. Solutions: • Use = alpha / n as your alpha (alpha=.05/13=.0038 for this example), but note that p values are not provided by SPSS, so, • Use the test only to draw your attention to possible outliers, look for |t| values greater than 2 (worth a look), or 3 (careful attention), or 4 (alarm bells)
Influence The third approach to identifying unusual scores is to see if dropping the score would dramatically change the model, this is known as influence. Procedure: compare the estimates of the parameters in the model with the outlier, to the estimates of the parameters in the model without the outlier.
Looking for Influence We want to compare the following: If deleting the k’th observation greatly changes the values of the b’s, then it must have been having a large influence on the values when it was included, as can be seen in the previous 2 slides, omitting the outlier greatly changes the b’s (notice how the slope and intercept both changed).
Looking for Influence We want to compare the following: If the values of the b’s change in the two models then the predictions made by the two models will also change. The easiest way to see if the models differ is by comparing their predictions, specifically looking at (Ŷi - Ŷi,[k]) for each observation. As you might guess, to see the total difference between the two models across all observations we will use:
Cook’s D Cook’s D (distance) • The size of D is influenced by how much the predictions • change when observation k is removed. It ends up that • you get the greatest change in predicted values when both: • The X value is unusual (leverage); and • The Y value is unusual (discrepancy). So Cook’s D is • largest when both occur.
Evaluating Outliers with Cook’s D There are only informal guidelines for when Cook’s D is considered large: • D > 1 or 2, or, • definite gaps between largest Dks Again, this will be used simply to draw your attention to where to look for outliers.
i SAT Cook’s D 1 42 .05 2 48 .01 3 58 .01 4 45 .09 5 45 .02 6 86 11.86 7 51 .01 8 56 .08 9 51 .02 10 58 .05 11 42 .06 12 55 .06 13 61 .21 Values of Cook’s D The 11.86 really stands out, may also want to take a look at .21
Leverage, Discrepancy, & Influence Identify what is exemplified by ‘A’, ‘B’, and ‘C’
Usual Effects of Outliers Leverage: leads us to falsely think we have found something interesting. Unusual X scores inflate SSx, which leads to smaller confidence intervals, making it easier to reject H0. Discrepancy: shooting ourselves in the foot by causing us to miss something interesting. Scores off the regression line add to SSE(A), which reduces SSR, making PRE smaller. Influence: all bets are off, model just doesn’t fit the majority of scores.
The Importance of Always Looking at Your Data The following four scatter plots all have the same model, PRE, and significance! Ŷi=3.0 + .5Xi PRE=.666 F*=17.95 p<.01
Clearly it would be good to do something with the outlier so that the regression line would better fit the other scores.
The one point on the far right is completely responsible for determining the slope, what the heck is X doing?
Complex Models Partial regression plots can be of help in visually identifying outliers when there is more than one predictor variable.
Doing Something About Outliers • The authors argue against the bias that doing nothing about an outlier is somehow better than doing something about it. • Always mention in your report anything you do to handle outliers. • If you think others might question what you have done, provide both analyses (with and without your action to handle the outlier)
What To Do • If it seems reasonable to conclude that the outlier is an error of measurement or recording, or leads to a model that is less accurate than when you omit the outlier, then omit it (explain that in your report). Removing an error is far better than leaving it in. • If further exploration leads to the conclusion that the outliers represent essentially a second population in the sample, then find an independent way (other than an extreme score) to measure which population the observation came from, and include that variable in your model, test this in further research.
(From text): “We think it is far more honest to omit outliers from the analysis with the explicit admission in the report that there are some observations which we do not understand and to report a good model for those observations which we do understand....To ignore outliers by failing to detect and report them is dishonest and misleading.”