1 / 21

Nasty data … When killer data can ruin your analyses

Nasty data … When killer data can ruin your analyses. JENA GRADUATE ACADEMY Dr. Friedrich Funke. Learning Objectives. What will you have learnt today? Why to inspect your data Why data become nasty

Download Presentation

Nasty data … When killer data can ruin your analyses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nastydata…Whenkillerdatacanruinyouranalyses JENA GRADUATE ACADEMY Dr. Friedrich Funke

  2. Learning Objectives What will you have learnt today? • Why to inspect your data • Why data become nasty • How to inspect your data • Coping strategies

  3. Why to inspect your data? Assumptions of parametric tests (e.g. ANOVA) The error terms are… • randomly, independently, and normally distributed, • with a mean of zero and • a common variance(homoscedasticity)

  4. Why to inspect your data? Basic statistical method – Ordinary least squares (OLS)

  5. Where are we? • Whytoinspectyourdata violationofassumptions • Whydatabecomenasty • Howtoinspectyourdata • Copingstrategies

  6. Where are we? • Whytoinspectyourdata violationofassumptions • Whydatabecomenasty • Howtoinspectyourdata • Copingstrategies • Input errors (55 insteadof 5) • dropout/non-response • human naturekeepsthegameinteresting

  7. Am I allowed to alter my data? • It is unethical to alter data for any reason. Or • Data points should be removed if they are outliers and there is a identifiable reason for invalidity. Or • Data points should be removed if they are outliers. Extremity is reason enough. 29% 67% 4%

  8. Am I allowed to alter my data? • It is unethical to alter data for any reason • It is unethical to alter data for any reason • A good model for most data is better than a poor model for all of your data.

  9. Where are we? • Whytoinspectyourdata violationofassumptions • Whydatabecomenasty • Howtoinspectyourdata • Copingstrategies

  10. Graphical data screening

  11. Normal q-q plot

  12. Test on normality • Access e.g. via EXPLORE

  13. My data are skewed – what shall i do? • Transformed variables are difficult to interpret • Scales are often arbitrary  no problem of interpretation • Find a transformation that produces the prettiest picture and skewness and kurtosis near 0 (iterative)

  14. Common data transformations • Before/after COMPUTE after = sqrt(before). or COMPUTE after = lg10(before+constant). or COMPUTE after = 1/(before+constant).

  15. Common data transformations • Add a constant to make the smallest value > 1 • For left-skewed variables reverse the variables (reversed = max+1-old_var)

  16. Tobecompletedwith residual Analysis

  17. Rules of thumb • Studentized deleted residuals with an absolute value greater than 2 deserve a look (greater than 4, alarm bells) • Cook's D problematic if D. One recommendation is to consider values to be large which exceed 4/PAn. • Another suggested rule is to consider any value greater than 1 or 2 as indicating that an observation requires a careful look. • Finally, some researchers look for gaps between the D values.

  18. Checklist For Screening Data • Inspect univariate descriptive statistics for accuracy of input • out-of-range values, be aware of measurement scales • plausible means and standard deviations • coefficient of variation • Evaluate amount and distribution of missing data: deal with problem • Independence of variables • Identify and deal with nonnormal variables • check skewness and kurtosis, probability plots • transform variables (if desirable) • check results of transformations • Identify and deal with outliers • univariate outliers • multivariate outliers • Check pairwise plots for nonlinearity and heteroscedasticity • Evaluate variables for multicollinearity and singularity • Check for spatial autocorrelation Adapted from Tabachnick & Fidell

  19. Best practice flow chart Plausible range, missing, normality, outliers, homoscedascity Pairwiselinearity (differential skewness?) Studentizeddeletedresiduals, leverage, Cooks‘s D … e.g. squareroot, lg10, arcsin

  20. Take home message • Detecting nasty data is important • Knowing how to handle them is better • Understanding WHY they are there is most important

  21. Francis Bacon in Novum Organum: » For whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately describe her ways «

More Related