1 / 20

Strategies for Identifying Outliers and Managing Missing Data

Strategies for Identifying Outliers and Managing Missing Data. R. Michael Haynes, PhD rhaynes@tarleton.edu Tarleton State University A PRIORI MARCH 1, 2012 Assistant Vice President for Student Life Studies POST HOC FEBRUARY 29, 2012 Executive Director of Institutional Research

osman
Download Presentation

Strategies for Identifying Outliers and Managing Missing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Strategies for Identifying Outliers and Managing Missing Data R. Michael Haynes, PhD rhaynes@tarleton.edu Tarleton State University A PRIORI MARCH 1, 2012 Assistant Vice President for Student Life Studies POST HOC FEBRUARY 29, 2012 Executive Director of Institutional Research Assistant Professor Educational Leadership and Policy Studies

  2. A little background... • Outlier analysis in multiple regression class • Data inspection (missing data) was a key aspect of dissertation • Try to incorporate at the very least “a nod” to data inspection in any assessment/project completed

  3. Why is it important to evaluate your data set? Can help in….. • Identifying input errors • Indentifying spurious data points • (an answer of “6” on a 1-5 Likert scale) • Makes your findings more sound • Good practice as recommended by • the American Psychological Association • (Wilkerson & APA Task Force on Statistical • Inference, 1999)

  4. Knowledge of various data inspection methods • visual • range of data set • Methods for managing missing data • list wise deletion • pair wise deletion • mean replacement • linear trend point • Criteria for identifying outliers/spurious data points • standardized residuals/predicted values • standard deviation diagnostics • Cook’s D values Desired Outcomes

  5. Data inspection methods • Visual • Can alert you to missing cases • Most beneficial with smaller datasets where review of individual cases is possible

  6. Data inspection methods • SPSS minimum/maximum values function • Quick method of inspecting range of larger data sets

  7. What to do about missing values? • SPSS options • Exclude cases listwise: Only cases with valid values for all variables are included in the analyses. • Exclude cases pairwise: Cases with complete data for the pair of variables being correlated are used to compute the correlation coefficient on which the regression analysis is based. Degrees of freedom are based on the minimum pairwise N • Replace with mean: All cases are used for computations, with the mean • of the variable substituted for missing observations • (SPSS Inc., 233 S.Wacker Drive, Chicago, IL, 60606)

  8. Problems with these options… • Listwise excludes all values for a case missing even 1 variable value…throws the baby out with the bath water! • Pairwise only utilizes variables for which both values are present • Can lead to distortion of findings through selection bias • (King, Honeker, Joseph, & Scheve, 1998)

  9. More preferred options… • Choose “Transform” -> “Missing Values” • Enter variables with missing values into “New Variable” box • Under “Name and Method”, select one of the following: • Series Mean • Mean of Nearby Points • Median of Nearby Points • Linear Interpolation • Linear Trend at Point I prefer the last option, Linear Trend at Point

  10. Linear Trend at Point • Uses the theory of regression to calculate coefficients based upon existing values • Generates a replacement value for each case on each variable • More robust than simply replacing with mean

  11. Identifying outliers… what is an outlier? An unusual score in a distribution that is considered extreme and may warrant special consideration (Hinkle, Wiersma, & Jurs, 2003) ...a data point distinct or deviant from the rest of the data (Pedhazur, 1997)

  12. Why is it important to identify potential outliers? • Can skew findings which in turn can skew conclusions/decisions/programming • Can help identify case in dire need of additional programming/resources…..finding that lost raft at sea! • As mentioned earlier, can assist in identifying data entry errors

  13. Strategies for identifying outliers in your dataset • Standardized predicted and residual scores

  14. Strategies for identifying outliers in your dataset

  15. Strategies for identifying outliers in your dataset • Residuals 3 standard deviations away from mean • Rule of thumb….”99% of your dataset should fall within + or – 3 standard deviations from the mean”

  16. Strategies for identifying outliers in your dataset • Cook’s D values • Considers each variables relationship to the other variables in the dataset (Pedhazar, 1997) • Cook’s D values greater than 1 could be suspect

  17. Strategies for identifying outliers in your dataset • Cook’s D values • Considers each variables relationship to the other variables in the dataset (Pedhazar, 1997) • Cook’s D values greater than 1 could be suspect • Saves values to dataset

  18. OK, so what if some of your cases don’t pass this 3 prong approach and it’s not a data entry error? • Discard the case? • Rejects the notion that the data “is what it is…” • “Tightens-up” the model to be more representative • of the norm • Keep it in? • Distorts the whole for a special circumstance • Depending upon your research question, could bring attention to a group needing special consideration • Either way, can be addressed in limitations/conclusions/need for further research

  19. References Hinkle, D.E., Wiersma, W., & Jurs, S.G. (2003). Applied statistics for the behavioral sciences (5th ed.). Boston, MA: Houghton Mifflin Company King, G., Honaker, J., Joseph, A., & Scheve, K. (1998). Listwise deletion is evil: What to do about missing data in political science [Electronic version]. Society for Political Methodology: American Political Science Association, Washington University in St. Louis, St. Louis, MO. Retrieved February 2, 2009, from http://polmeth.wustl.edu/workingpapers.php?order=dateasc&title=1998&startdate=1998-01-01&enddate=1998-12-31 Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd ed.). South Melbourne, Australia: Wadsworth. Wilkinson, L. & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanation. American Psychologist, 54, 594-604.

More Related