140 likes | 152 Views
This paper explores the common statistical functions used in the overall data editing process and proposes a decomposition of the process for improved process design and reusability. It includes classification of functions by purpose, examples of process flow, and concluding remarks.
E N D
On the general flow of editing Jeroen Pannekoek and Li-Chun Zhang Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012
Introduction • An overall data editing process involves all activities to transform raw micro-data with errors and missing values into edited statistical micro-data that are suitable for production of publication figures. GSBPM: review, validate and edit, impute, output control. • For implementation of an E&I system we need more detailed descriptions called statistical functions that each perform some action on the data. • This paper tries to identify common statistical functions that are used as building blocks in different overall E&I processes or strategies. • The decomposition of the overall process can facilitate process design, re-use of methodological components and documentation and generic software tools.
Contents • Some classifications of data editing functions that are relevant for the process design. • A summary of statistical data editing functions in some detail. • Some process flow examples, using the statistical functions as building blocks, from the Netherlands and Norway. • Concluding remarks
Classification of functions by purpose • Verification Checking of hard and soft edit rules, calculation scores, detection of systematic errors. Input: rules and data → Output: quality indicators and measures Less formal: graphical macro-editing, output control. • Selection (for further processing) Selection of units for manual editing. Selection of variables to change, error localisation. Input: quality indicators and data → Output: selection of records or fields • Amending Modifying selected data values to resolve problems detected by verification, including imputation of missing values.
Unit-mode versus batch-mode operation Since manual editing is time-consuming it should start during the sometimes lengthy data collection period. This must then also hold for any automatic editing function that is applied before manual editing. • Unit-mode functions Proceed on a record-by-record basis and can be applied during the data collection phase. • Bach-mode functions Use all of the data (or a large subset) and can only be applied near the end of the data collection phase.
Editing functions: verification (1/2) • Edit-rules (unit-mode) Systems of connected balance edits: profit=turnover-total costs. total costs = costs of employees + costs of purchases + Non-negativity edits and inequalities. Ratio edits (soft). • Score functions • Measure the potential effect that editing a unit may have on estimates of totals or other aggregate parameters of interest. Based on measures of the deviation between observed values and predicted or “anticipated” values si =f(xj,xja). • Unit-mode: xja is based historical data or other external source. Batch-mode: xja is based on current data. • Also applied to measure and check the actual effect of (automatic) editing instead of the potential effect of editing. Then xja is the edited value.
Editing functions: verification (2/2) • Extended score functions Score functions can be extended by adding indicators for further processing based on simple criteria, other than the regular score function. For instance: >0: regular score value -9: “crucial” (dominates the totals in its branch) → manual editing -8: influential and main variables are missing → re-contact -7: non-influential and main variables missing → unit nonrespons • Macro-verification Macro-verification functions are batch-mode by definition. They include all macro-editing activities: verifying aggregates, graphical inspection of distributions, graphical or model-based outlier detection etc.
Editing functions: selection • Selection of units for manual editing using regular scores By comparing to a predetermined threshold value – unit-mode. By ordering units on scores and select the highest ranking – batch-mode • Selection of variables for amendment: error localization (unit-mode). To resolve edit-failures, some values need to be changed. The error localization problem is the selection of which variables to be changed. A generic automatic approach (Felligi-Holt): select the fewest (weighted) number of variables to change • Macro-selection (batch mode) of units for manual editing Implausibleaggregates eventually lead to suspect units (down-drilling) Graphical verification leads to selection of the most extraordinary units.
Editing functions: amendment • Amendment of systematic errors (unit-mode) Errors with a detectable cause and reliable correction mechanism. Generic: Thousand errors, recognizable typos, rounding errors. Subject-related: specific “if-then” type of correction rules. • Deductive imputation of missing values (unit-mode) Some missing values can univocally be determined by the hard edit-rules. Which gives the only possible feasible imputation. • Model based imputation (batch- or unit-mode) For most missing value we need model-based predicted values to impute. Batch-mode if current data are used to estimate parameters. • Adjustment for inconsistency (unit-mode) Adjustment of imputation to ensure consistency with edit-rules
Illustration of automatic editing Data from child day care institutions: 500 records with 68 SBS-type variables and 40 hard edit-rules.
Process flow. Scenario A: Selective editing Input micro data 1a. Systematic errors 1b. Evaluation of scores 1. Primary automated processing 2. Micro-selection Yes 2a. Selection using scores 2b. (FH-)selection of fields No 4. Automatic amendment of uncritical units 3. Clerical interactive editing 4a. Imputation of missings 4b. Adjustments 5. Macro-selection 5. Macro-verification and selection Yes No Edited micro data
Process flow. Scenario B: More automatic editing Input micro data 1. All unit-mode automatic editing 1a. Systematic errors 1b. (FH-)selection of fields 1c. Imputation 1d. Adjustments 1e. Evaluation of scores 1. Primary automated processing 2. Micro-selection Yes No 4. Automatic amendment of uncritical units 3. Clerical interactive editing 4a. Batch-mode Imputation 4b. Adjustments 5. Macro-selection 5. Macro-verification and selection Yes No Edited micro data
Process flow: Scenario C. No timeliness problems, Input micro data 1. Systematic errors 1. Primary automated processing 3. (Partial) Clerical interactive editing 2. Macro-selection 2. Macro-verification and selection. Including batch-mode scores Yes No 3. (Partial) Clerical interactive editing. 4. Automatic amendment 4a. Imputation of missings 4b. Adjustments No Edited micro data
Concluding remarks • The shown description of the overall process can be helpful in the communication between editing staff, project managers, process designers and methodologists. It clarifies the organization of the process and the choices that must be made. • It also helps to define the functionalities and interfaces of generic software components by placing them in the context of the overall process scheme. • Increasing automatic editing can greatly reduce the amount of manual editing. This may involve automatic editing of influential units and subject specific “if-then” rules.