210 likes | 339 Views
Nasty data … When killer data can ruin your analyses. JENA GRADUATE ACADEMY Dr. Friedrich Funke. Learning Objectives. What will you have learnt today? Why to inspect your data Why data become nasty
E N D
Nastydata…Whenkillerdatacanruinyouranalyses JENA GRADUATE ACADEMY Dr. Friedrich Funke
Learning Objectives What will you have learnt today? • Why to inspect your data • Why data become nasty • How to inspect your data • Coping strategies
Why to inspect your data? Assumptions of parametric tests (e.g. ANOVA) The error terms are… • randomly, independently, and normally distributed, • with a mean of zero and • a common variance(homoscedasticity)
Why to inspect your data? Basic statistical method – Ordinary least squares (OLS)
Where are we? • Whytoinspectyourdata violationofassumptions • Whydatabecomenasty • Howtoinspectyourdata • Copingstrategies
Where are we? • Whytoinspectyourdata violationofassumptions • Whydatabecomenasty • Howtoinspectyourdata • Copingstrategies • Input errors (55 insteadof 5) • dropout/non-response • human naturekeepsthegameinteresting
Am I allowed to alter my data? • It is unethical to alter data for any reason. Or • Data points should be removed if they are outliers and there is a identifiable reason for invalidity. Or • Data points should be removed if they are outliers. Extremity is reason enough. 29% 67% 4%
Am I allowed to alter my data? • It is unethical to alter data for any reason • It is unethical to alter data for any reason • A good model for most data is better than a poor model for all of your data.
Where are we? • Whytoinspectyourdata violationofassumptions • Whydatabecomenasty • Howtoinspectyourdata • Copingstrategies
Test on normality • Access e.g. via EXPLORE
My data are skewed – what shall i do? • Transformed variables are difficult to interpret • Scales are often arbitrary no problem of interpretation • Find a transformation that produces the prettiest picture and skewness and kurtosis near 0 (iterative)
Common data transformations • Before/after COMPUTE after = sqrt(before). or COMPUTE after = lg10(before+constant). or COMPUTE after = 1/(before+constant).
Common data transformations • Add a constant to make the smallest value > 1 • For left-skewed variables reverse the variables (reversed = max+1-old_var)
Rules of thumb • Studentized deleted residuals with an absolute value greater than 2 deserve a look (greater than 4, alarm bells) • Cook's D problematic if D. One recommendation is to consider values to be large which exceed 4/PAn. • Another suggested rule is to consider any value greater than 1 or 2 as indicating that an observation requires a careful look. • Finally, some researchers look for gaps between the D values.
Checklist For Screening Data • Inspect univariate descriptive statistics for accuracy of input • out-of-range values, be aware of measurement scales • plausible means and standard deviations • coefficient of variation • Evaluate amount and distribution of missing data: deal with problem • Independence of variables • Identify and deal with nonnormal variables • check skewness and kurtosis, probability plots • transform variables (if desirable) • check results of transformations • Identify and deal with outliers • univariate outliers • multivariate outliers • Check pairwise plots for nonlinearity and heteroscedasticity • Evaluate variables for multicollinearity and singularity • Check for spatial autocorrelation Adapted from Tabachnick & Fidell
Best practice flow chart Plausible range, missing, normality, outliers, homoscedascity Pairwiselinearity (differential skewness?) Studentizeddeletedresiduals, leverage, Cooks‘s D … e.g. squareroot, lg10, arcsin
Take home message • Detecting nasty data is important • Knowing how to handle them is better • Understanding WHY they are there is most important
Francis Bacon in Novum Organum: » For whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately describe her ways «