140 likes | 327 Views
Influential Points and Outliers. Debbi Amanti. OUTLIERS:. Data points two or three standard deviations from the mean of the data. Observations that differ significantly from the pattern of the REST OF THE DATA
E N D
Influential Pointsand Outliers Debbi Amanti
OUTLIERS: • Data points two or three standard deviations from the mean of the data. • Observations that differ significantly from the pattern of the REST OF THE DATA • Observations that lie outside the overall pattern of the other observations.
OUTLIERS IN TERMS OF REGRESSION: • Observations with large (in absolute value) residuals. • Observations falling f a r from the regression line while not following the pattern of the relationship apparent in the others • Residual=actual-fitted
To mathematically compute an outlier given a univariate set of data: Find the Inter Quartile Range a.k.a. IQR (Q3-Q1) and multiply this value by 1.5. An outlier for a data set is any point: • Greater than Q3+1.5*(IQR) • Less than Q1-1.5*(IQR)
INFLUENTIAL POINTS ARE: • Points whose removal would greatly affect the association of two variables • Points whose removal would significantly change the slope of an LSR line • Points with a large moment (i.e they are far away from the rest of the data.) • Usually outliers in the x direction.
The two graphs below show the same data – the one on the right with the removal of the green data point. As you can see, the removal of this point significantly affects the slope of the regression line. This is an influential point!
X DATA IQR= 5 Q1=3 Q3=8 MAX=15.5 MIN=1 An outlier is any point: > Q3+1.5*IQR=15.5 or < Q1-1.5*IQR=-4.5 THERE ARE NO OUTLIERS IN THIS DATA SET!!! Y DATA IQR=5 Q1=4 Q3=9 MAX=10 MIN=2 An outlier is any point: > Q3+1.5*IQR=16.5 or < Q1-1.5*IQR=-3.5 THERE ARE NO OUTLIERS IN THIS DATA SET!!! Using the same data as shown on the previous slide, let’s compare the x and y data sets for the presence of outliers:
!!!REMEMBER!!! An observation does NOT have to be an Outlier to be an Influential Point!! Nor does an observation need to be an Influential Point in order to be an Outlier!!
Given the five-number summary {8 21 35 43 77}, which of the following is correct? A. There are no outliers B. There are at least two outliers C. There is not enough data to make any conclusion D. There is exactly one outlier E. There is at least one outlier
The correct answer is E The five number summary gives you {Min Q1 Median Q3 Max} The IQR is calculated by Q3-Q1 So, the IQR for the given data is 43-21=22 An outlier for this data would be: >Q3+1.5*IQR or <Q1-1.5*IQR >43+(22*1.5)=76 or <21-(22*1.5)=-12 Since the max is 77, there must be at least oneoutlier in this data set, but we cannot conclude how many outliers without more data.
Given the following scatterplot and residual plot. Which of the following is true about the yellow data point? I. It is an influential point II. It is an outlier with respect to the regression model II. It appears to be an outlier in the x direction A. I only B. I and II C. I and III D. None of the above E. All of the above
The correct answer is c I. Because this point has a large moment and is far from the rest of the data, it is an influential point. If this point was removed, the slope of the line would markedly change. II. This point is not an outlier with respect to the model because as you can see in the residual plot, it does not have a large residual (It follows the regression pattern of the data). III. By looking at both the scatterplot and the residual plot, you can see that the yellow point is an outlier in the x direction (far right of the rest of the data).
Resources used in this presentation include: • Workshop Statistics by Allan Rossman • The Basic Practice of Statistics by David S. Moore • AMSCO’s AP Statistics by James Bohan • Any further questions, email me at: debora_amanti@bbns.org