400 likes | 625 Views
Statistical Data Editing and Imputation. Presented by. Sander Scholtus Statistics Netherlands. Introduction. Data arrive at a statistical institute. Introduction. Data arrive at a statistical institute... …containing errors and implausible values …containing missing values
E N D
Presented by • Sander Scholtus • Statistics Netherlands
Introduction • Data arrive at a statistical institute...
Introduction • Data arrive at a statistical institute... • …containing errors and implausible values • …containing missing values • To produce statistical output of sufficient quality, these data problems have to be treated • Statistical data editing deals with errors • Imputation deals with missing values
Statistical data editing • Overview • Goals • Edit rules • Different editing methods and how to combine them • Modules in the handbook
Statistical data editing – goals • Traditional goal of editing: • Detect and correct all errors in the collected data • Problems: • Very labour-intensive • Very time-consuming • Highly inefficient: measurement error is not the only source of error in statistical output
Statistical data editing – goals • Modern goals of editing: • To identify possible sources of errors so that the statistical process may be improved in the future. • To provide information about the quality of the data collected and published. • To detect and correct influential errors in the collected data. • If necessary, to provide complete and consistent micro-data. sources: Granquist (1997), EDIMBUS (2007)
Statistical data editing – edit rules • Edit rules (edits, edit checks, checking rules) • Used to detect errors • Can be either hard or soft • General form: IF (unitedit group) THEN (test variableacceptance region)
Statistical data editing – edit rules • Examples of edit rules: • Turnover ≥ 0(non-negativity edit, hard) • Profit = Turnover – Total costs(balance edit, hard) • IF (Size class = “Small”)THEN (0 ≤ Number of employees < 10)(conditional edit, soft) • IF (Economic activity = “Construction”)THEN (a < Turnover / Number of employees < b)(ratio edit, soft)
not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods
Statistical data editing – methods • Deductive editing • Directed at systematic errors • Deterministic detection and amendment • if-then rules • algorithms • Examples: • unit of measurement errors (e.g. “4,000,000” instead of “4,000”) • sign errors (e.g. “–10” instead of “10”) • simple typing errors (e.g. “192” instead of “129”) • subject-matter specific errors
not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods
Statistical data editing – methods • Selective editing • Prioritise records according to expected benefit of their manual amendment on target estimates • Records can be selected as they arrive (input editing) • Common approach based on score functions • Local scores for key target variables, e.g., • Use global score to summarise local scores (e.g., sumor maximum)
not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods
Statistical data editing – methods • Manual editing • Requires: • Human editors (subject-matter specialists) • Dedicated software (interactive editing) • Edit rules (hard and soft) • Editing instructions • Re-contacts with businesses are sometimes used • Important as a source for improvements in future rounds of a repeated survey
not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods
Statistical data editing – methods • Automatic editing • Obtain consistent micro-data for non-influential records • Paradigm of Fellegi and Holt (1976): Data should be made consistent with the edit rules by changing the fewest possible (weighted) number of items. • Leads to error localisation as a mathematical optimisation problem • Imputation of new values as a separate step • Requires: • (Hard) edit rules • Dedicated software (e.g.: Banff by Statistics Canada; SLICE by Statistics Netherlands; R package editrules)
not selected raw microdata deductive editing selective editing automatic editing selected manual editing macro-editing statistical microdata Statistical data editing – methods
Statistical data editing – methods • Macro-editing • Also known as output editing • Same purpose as selective editing • Uses data from all available records at once • Aggregate method: • Compute high-level aggregates • Check their plausibility • Drill down to suspicious lower-level aggregates • Eventually: Drill down to suspicious individual records • Feedback to manual editing • Graphical aids (scatter plots, etc.) to find outliers
Statistical data editing – modules • Modules in the handbook: • Main theme module • Deductive editing • Selective editing • Automatic editing • Manual editing • Macro-editing • Editing administrative data • Editing for longitudinal data
Imputation • Overview • Missing data • Imputation methods • Special topics • Modules in the handbook
Imputation – missing data • Missing data may occur because of • Logical reasons • A particular question does not apply to a particular unit • Unit non-response • No data observed at all for a particular unit • Item non-response • Unit is not able to answer a particular question • Unit is not willing to answer a particular question • Editing • Originally observed value discarded during automatic editing
Imputation – missing data • Imputation: filling in new (estimated) values for data items that are missing • Commonly used for missing data due to item non-response and editing • Obtain a completed micro-data file prior to estimation • Simplifies the estimation step • Prevents inconsistencies in the output
Imputation – methods • Deductive imputation • Model-based imputation • Donor imputation • Assumption: All observed values are correct • Imputation applied after error localisation
Imputation – methods • Deductive imputation • Derive (rather than estimate) missing values from observed values based on • logical relations (edit rules) • substantive imputation rules • Can be very useful as a first imputation step
Imputation – methods • Model-based imputation • Imputations based on a predictive model • Model fitted on the observed data, then used to impute the missing data
Imputation – methods • Model-based imputation • Special cases: • Mean imputationModel: , with Imputed value: • Ratio imputationModel: , with Imputed value: • (Linear) regression imputationModel: Imputed value:
Imputation – methods • Model-based imputation • Choice of model depends on intended use of data • Estimating means and totals: mean or ratio imputation may be sufficient • General purpose micro-data: important to model relationships • Multivariate model-based imputation • Multivariate regression imputation(joint model for all variables) • Sequential regression / chained equations(separate model for each variable, conditional on the other variables)
Imputation – methods • Donor imputation • Missing values imputed by ‘borrowing’ observed values from other (similar) units • Unit with observed value: donor • Unit with missing value: recipient • Hot deck: donor and recipient in the same data file
Imputation – methods • Donor imputation • Special cases: • Random hot deck imputationDonor selected at random (within classes)Use auxiliary variables to define imputation classes • Nearest-neighbour imputationDonor selected with minimal distance to recipientUse auxiliary variables to define distance • Predictive mean matchingSpecial case of nearest-neighbour imputationDistance based on predicted values from a regression model
Imputation – special topics • Choice of method/model/auxiliary variables • General problem in multivariate analysis • Auxiliary variables should explain • the target variable(s) • the missing data mechanism • Compare model fit among item respondents • Can be misleading (“imputation bias”) • Simulation experiments with historical data
Imputation – special topics • Imputation for longitudinal data • Repeated cross-sectional surveys • Panel studies • Special imputation methods for longitudinal data • Last observation carried forward • Interpolation • Extrapolation • Little and Su method
Imputation – special topics • Imputations are estimates • Imputed values should be flagged • Variance estimation with imputed data • Variance likely to be underestimated when… • …imputations are treated as observed variables • …model predictions are imputed without a disturbance term • …single imputation is used • Alternative approach: Multiple imputation • Not often used in official statistics (yet)
Imputation – special topics • Imputed values may be invalid/inconsistent • Examples: • Turnover = –100 (invalid) • Labour costs = 0, Number of employees = 15 (inconsistent) • Need not be a problem for estimating aggregates • Can be a problemif micro-data are distributed further • Imputation under edit constraints • One-step method: constrained imputation model • Two-step method: imputation followed by data reconciliation
Imputation – modules • Modules in the handbook: • Main theme module • Deductive imputation • Model-based imputation • Donor imputation • Imputation for longitudinal data • Little and Su method • Imputation under edit constraints
References • EDIMBUS (2007), Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys. • Fellegi, I.P. and D. Holt (1976), A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association71, pp. 17–35. • Granquist, L. (1997), The New View on Editing. International Statistical Review65, pp. 381–387.