1 / 27

Basics of Data Cleaning

Basics of Data Cleaning. Why Examine Your Data?. Basic understanding of the data set Ensure statistical and theoretical underpinnings of a given m.v. technique are met Concerns about the data Departures from distribution assumptions (i.e., normality) Outliers Missing Data.

mendel
Download Presentation

Basics of Data Cleaning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basics of Data Cleaning

  2. Why Examine Your Data? • Basic understanding of the data set • Ensure statistical and theoretical underpinnings of a given m.v. technique are met • Concerns about the data • Departures from distribution assumptions (i.e., normality) • Outliers • Missing Data

  3. Testing Assumptions • MV Normality assumption • Solution is better • Violation of MV Normality • Skewness (symmetry) • Kurtosis (peakedness) • Heteroscedascity • Non-linearity

  4. Negative Skew

  5. Positive Skew

  6. Kurtosis Mesokurtic Leptokurtic Platykurtic

  7. Skewness & Kurtosis SPSS Syntax FREQUENCIES VARIABLES=age /STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS. Skewness = .354/.205 = 1.73 Kurtosis = -.266/.407 = -.654 Z values = Statistic Std Error Critical Values for z score .05  +/- 1.96 .01 +/- 2.58

  8. s24 s22 s23 s21 m4 m2 m3 m1 Homoscedascity s21 = s22 = s23 = s24 = s2e When there are multiple groups, each group has similar levels of variance (similar standard deviation)

  9. Linearity

  10. Testing the Assumptions of Absence of Correlated Errors • Correlated errors means there is an unmeasured variable affecting the analysis • Key is to identify the unmeasured variable and to include it in the analysis • How often do we meet this assumption?

  11. Data Cleaning • Examine • Individual items/scales (i.e., reliability) • Bivariate relationships • Multivariate relationships • Techniques to use • Graphs  non-normality, heteroscedasticity • Frequencies  missing data, out of bounds values • Univariate outliers (+/- 3 SD from mean) • Mahalanobis Distance (.001)

  12. Graphical Examination • Single Variable: Shape of Distribution • Histogram • Stem and leaf • Relationships between two+ variables • Scatterplot

  13. Histogram

  14. Scatterplot

  15. Frequencies

  16. Outliers • Where do outliers come from? • Inclusion of subjects not part of the population (e.g., ESL response to vocabulary test) • Legitimate data points* • Extreme values of random error (X = t + e) • Error in observation • Error in data preparation

  17. Univariate Outliers • Criteria: Mean +/- 3 SD • Example: Age • Mean = 34.68 • SD = 10.05 • Out of range values > 64.83 or < 4.53

  18. Univariate Outliers

  19. Multivariate Outliers Mahalanobis Distance SPSS Syntax Regression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal). Critical Values (case with D > c.v. is m.v. outlier) two variables - 13.82 three variables - 16.27 four variables - 18.46 five variables - 20.52 six variables - 22.46

  20. Approaches to Outliers • Leave them alone • Delete entire case (listwise) • Delete only relevant variables (pairwise) • Trim – highest legitimate value • Mean substitution • Imputation

  21. Effects of Outliers r = .50 r = .32

  22. Effects of Outliers

  23. Major Problems: Missing Data • Generalizability issues • Reduces power (sample size) • Impacts accuracy of results • Accuracy = dispersion around true score (can be under- or over-estimation) • Varies with MDT used

  24. Dealing with Missing Data • Listwise deletion • Pairwise deletion • Mean substitution • Regression imputation • Hot-deck imputation • Multiple imputation

  25. Dealing with Missing Data In Order of Accuracy: • Pairwise deletion • Listwise deletion • Regression imputation • Mean substitution • Hot-deck imputation

  26. Dealing with Missing Data

  27. Best Transformation to Try Square Root Log Inverse “Reflect” (mirror image), then transform Distribution Moderate deviation from normality Substantial deviation from normality Severe deviation; esp. j- shape Negative skew Transformations • Interpretation of transformed variables?

More Related