280 likes | 431 Views
Basics of Data Cleaning. Why Examine Your Data?. Basic understanding of the data set Ensure statistical and theoretical underpinnings of a given m.v. technique are met Concerns about the data Departures from distribution assumptions (i.e., normality) Outliers Missing Data.
E N D
Why Examine Your Data? • Basic understanding of the data set • Ensure statistical and theoretical underpinnings of a given m.v. technique are met • Concerns about the data • Departures from distribution assumptions (i.e., normality) • Outliers • Missing Data
Testing Assumptions • MV Normality assumption • Solution is better • Violation of MV Normality • Skewness (symmetry) • Kurtosis (peakedness) • Heteroscedascity • Non-linearity
Kurtosis Mesokurtic Leptokurtic Platykurtic
Skewness & Kurtosis SPSS Syntax FREQUENCIES VARIABLES=age /STATISTICS=SKEWNESS SESKEW KURTOSIS SEKURT /ORDER= ANALYSIS. Skewness = .354/.205 = 1.73 Kurtosis = -.266/.407 = -.654 Z values = Statistic Std Error Critical Values for z score .05 +/- 1.96 .01 +/- 2.58
s24 s22 s23 s21 m4 m2 m3 m1 Homoscedascity s21 = s22 = s23 = s24 = s2e When there are multiple groups, each group has similar levels of variance (similar standard deviation)
Testing the Assumptions of Absence of Correlated Errors • Correlated errors means there is an unmeasured variable affecting the analysis • Key is to identify the unmeasured variable and to include it in the analysis • How often do we meet this assumption?
Data Cleaning • Examine • Individual items/scales (i.e., reliability) • Bivariate relationships • Multivariate relationships • Techniques to use • Graphs non-normality, heteroscedasticity • Frequencies missing data, out of bounds values • Univariate outliers (+/- 3 SD from mean) • Mahalanobis Distance (.001)
Graphical Examination • Single Variable: Shape of Distribution • Histogram • Stem and leaf • Relationships between two+ variables • Scatterplot
Outliers • Where do outliers come from? • Inclusion of subjects not part of the population (e.g., ESL response to vocabulary test) • Legitimate data points* • Extreme values of random error (X = t + e) • Error in observation • Error in data preparation
Univariate Outliers • Criteria: Mean +/- 3 SD • Example: Age • Mean = 34.68 • SD = 10.05 • Out of range values > 64.83 or < 4.53
Multivariate Outliers Mahalanobis Distance SPSS Syntax Regression Var = case VAR1 VAR2 /statistics collin /dependent =case / enter /residuals = outliers(mahal). Critical Values (case with D > c.v. is m.v. outlier) two variables - 13.82 three variables - 16.27 four variables - 18.46 five variables - 20.52 six variables - 22.46
Approaches to Outliers • Leave them alone • Delete entire case (listwise) • Delete only relevant variables (pairwise) • Trim – highest legitimate value • Mean substitution • Imputation
Effects of Outliers r = .50 r = .32
Major Problems: Missing Data • Generalizability issues • Reduces power (sample size) • Impacts accuracy of results • Accuracy = dispersion around true score (can be under- or over-estimation) • Varies with MDT used
Dealing with Missing Data • Listwise deletion • Pairwise deletion • Mean substitution • Regression imputation • Hot-deck imputation • Multiple imputation
Dealing with Missing Data In Order of Accuracy: • Pairwise deletion • Listwise deletion • Regression imputation • Mean substitution • Hot-deck imputation
Best Transformation to Try Square Root Log Inverse “Reflect” (mirror image), then transform Distribution Moderate deviation from normality Substantial deviation from normality Severe deviation; esp. j- shape Negative skew Transformations • Interpretation of transformed variables?