1 / 37

Statistical Data Editing and Imputation

Statistical Data Editing and Imputation. Presented by. Sander Scholtus Statistics Netherlands. Introduction. Data arrive at a statistical institute. Introduction. Data arrive at a statistical institute... …containing errors and implausible values …containing missing values

dinesh
Download Presentation

Statistical Data Editing and Imputation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Data EditingandImputation

  2. Presented by • Sander Scholtus • Statistics Netherlands

  3. Introduction • Data arrive at a statistical institute...

  4. Introduction • Data arrive at a statistical institute... • …containing errors and implausible values • …containing missing values • To produce statistical output of sufficient quality, these data problems have to be treated • Statistical data editing deals with errors • Imputation deals with missing values

  5. Statistical data editing • Overview • Goals • Edit rules • Different editing methods and how to combine them • Modules in the handbook

  6. Statistical data editing – goals • Traditional goal of editing: • Detect and correct all errors in the collected data • Problems: • Very labour-intensive • Very time-consuming • Highly inefficient: measurement error is not the only source of error in statistical output

  7. Statistical data editing – goals • Modern goals of editing: • To identify possible sources of errors so that the statistical process may be improved in the future. • To provide information about the quality of the data collected and published. • To detect and correct influential errors in the collected data. • If necessary, to provide complete and consistent micro-data. sources: Granquist (1997), EDIMBUS (2007)

  8. Statistical data editing – edit rules • Edit rules (edits, edit checks, checking rules) • Used to detect errors • Can be either hard or soft • General form: IF (unitedit group) THEN (test variableacceptance region)

  9. Statistical data editing – edit rules • Examples of edit rules: • Turnover ≥ 0(non-negativity edit, hard) • Profit = Turnover – Total costs(balance edit, hard) • IF (Size class = “Small”)THEN (0 ≤ Number of employees < 10)(conditional edit, soft) • IF (Economic activity = “Construction”)THEN (a < Turnover / Number of employees < b)(ratio edit, soft)

  10. not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods

  11. Statistical data editing – methods • Deductive editing • Directed at systematic errors • Deterministic detection and amendment • if-then rules • algorithms • Examples: • unit of measurement errors (e.g. “4,000,000” instead of “4,000”) • sign errors (e.g. “–10” instead of “10”) • simple typing errors (e.g. “192” instead of “129”) • subject-matter specific errors

  12. not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods

  13. Statistical data editing – methods • Selective editing • Prioritise records according to expected benefit of their manual amendment on target estimates • Records can be selected as they arrive (input editing) • Common approach based on score functions • Local scores for key target variables, e.g., • Use global score to summarise local scores (e.g., sumor maximum)

  14. not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods

  15. Statistical data editing – methods • Manual editing • Requires: • Human editors (subject-matter specialists) • Dedicated software (interactive editing) • Edit rules (hard and soft) • Editing instructions • Re-contacts with businesses are sometimes used • Important as a source for improvements in future rounds of a repeated survey

  16. not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods

  17. Statistical data editing – methods • Automatic editing • Obtain consistent micro-data for non-influential records • Paradigm of Fellegi and Holt (1976): Data should be made consistent with the edit rules by changing the fewest possible (weighted) number of items. • Leads to error localisation as a mathematical optimisation problem • Imputation of new values as a separate step • Requires: • (Hard) edit rules • Dedicated software (e.g.: Banff by Statistics Canada; SLICE by Statistics Netherlands; R package editrules)

  18. not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods

  19. Statistical data editing – methods • Macro-editing • Also known as output editing • Same purpose as selective editing • Uses data from all available records at once • Aggregate method: • Compute high-level aggregates • Check their plausibility • Drill down to suspicious lower-level aggregates • Eventually: Drill down to suspicious individual records • Feedback to manual editing • Graphical aids (scatter plots, etc.) to find outliers

  20. Statistical data editing – modules • Modules in the handbook: • Main theme module • Deductive editing • Selective editing • Automatic editing • Manual editing • Macro-editing • Editing administrative data • Editing for longitudinal data

  21. Imputation • Overview • Missing data • Imputation methods • Special topics • Modules in the handbook

  22. Imputation – missing data • Missing data may occur because of • Logical reasons • A particular question does not apply to a particular unit • Unit non-response • No data observed at all for a particular unit • Item non-response • Unit is not able to answer a particular question • Unit is not willing to answer a particular question • Editing • Originally observed value discarded during automatic editing

  23. Imputation – missing data • Imputation: filling in new (estimated) values for data items that are missing • Commonly used for missing data due to item non-response and editing • Obtain a completed micro-data file prior to estimation • Simplifies the estimation step • Prevents inconsistencies in the output

  24. Imputation – methods • Deductive imputation • Model-based imputation • Donor imputation • Assumption: All observed values are correct • Imputation applied after error localisation

  25. Imputation – methods • Deductive imputation • Derive (rather than estimate) missing values from observed values based on • logical relations (edit rules) • substantive imputation rules • Can be very useful as a first imputation step

  26. Imputation – methods • Model-based imputation • Imputations based on a predictive model • Model fitted on the observed data, then used to impute the missing data

  27. Imputation – methods • Model-based imputation • Special cases: • Mean imputationModel: , with Imputed value: • Ratio imputationModel: , with Imputed value: • (Linear) regression imputationModel: Imputed value:

  28. Imputation – methods • Model-based imputation • Choice of model depends on intended use of data • Estimating means and totals: mean or ratio imputation may be sufficient • General purpose micro-data: important to model relationships • Multivariate model-based imputation • Multivariate regression imputation(joint model for all variables) • Sequential regression / chained equations(separate model for each variable, conditional on the other variables)

  29. Imputation – methods • Donor imputation • Missing values imputed by ‘borrowing’ observed values from other (similar) units • Unit with observed value: donor • Unit with missing value: recipient • Hot deck: donor and recipient in the same data file

  30. Imputation – methods • Donor imputation • Special cases: • Random hot deck imputationDonor selected at random (within classes)Use auxiliary variables to define imputation classes • Nearest-neighbour imputationDonor selected with minimal distance to recipientUse auxiliary variables to define distance • Predictive mean matchingSpecial case of nearest-neighbour imputationDistance based on predicted values from a regression model

  31. Imputation – special topics • Choice of method/model/auxiliary variables • General problem in multivariate analysis • Auxiliary variables should explain • the target variable(s) • the missing data mechanism • Compare model fit among item respondents • Can be misleading (“imputation bias”) • Simulation experiments with historical data

  32. Imputation – special topics • Imputation for longitudinal data • Repeated cross-sectional surveys • Panel studies • Special imputation methods for longitudinal data • Last observation carried forward • Interpolation • Extrapolation • Little and Su method

  33. Imputation – special topics • Imputations are estimates • Imputed values should be flagged • Variance estimation with imputed data • Variance likely to be underestimated when… • …imputations are treated as observed variables • …model predictions are imputed without a disturbance term • …single imputation is used • Alternative approach: Multiple imputation • Not often used in official statistics (yet)

  34. Imputation – special topics • Imputed values may be invalid/inconsistent • Examples: • Turnover = –100 (invalid) • Labour costs = 0, Number of employees = 15 (inconsistent) • Need not be a problem for estimating aggregates • Can be a problemif micro-data are distributed further • Imputation under edit constraints • One-step method: constrained imputation model • Two-step method: imputation followed by data reconciliation

  35. Imputation – modules • Modules in the handbook: • Main theme module • Deductive imputation • Model-based imputation • Donor imputation • Imputation for longitudinal data • Little and Su method • Imputation under edit constraints

  36. Thank you for your attention!

  37. References • EDIMBUS (2007), Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys. • Fellegi, I.P. and D. Holt (1976), A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association71, pp. 17–35. • Granquist, L. (1997), The New View on Editing. International Statistical Review65, pp. 381–387.

More Related