390 likes | 686 Views
Integrated Data Editing and Imputation. Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007. What is statistical data editing and imputation?. Observed data generally contain errors and missing values
E N D
Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007
What is statistical data editing and imputation? • Observed data generally contain errors and missing values • Statistical Data Editing (SDE): • process of checking observed data, and, when necessary, correcting them • Imputation: • process of estimating missing data and filling these values in into data set
What is integrated SDE and imputation? • Integration of error localization and imputation • Integration of several edit and imputation techniques to optimize edit and imputation process • Integration of statistical data editing into rest of statistical process
What is integrated SDE and imputation? • Integration of error localization and imputation • Integration of several edit and imputation techniques to optimize edit and imputation process • Integration of statistical data editing into rest of statistical process
SDE and the survey process • We will focus on identifying and correcting errors • Other goals of SDE are • identify error sources in order to provide feedback on entire survey process • provide information about the quality of incoming and outgoing data • Role of SDE is slowly shifting towards these goals • feedback on other survey phases can be used to improve those phases and reduce amount of errors arising in these phases
Edits • Edit rules, or editsfor short, often used to determine whether record is consistent or not • Inconsistent records are considered to contain errors • Consistent records that are also not suspicious otherwise, e.g. are not outlying with respect to the bulk of the data, are considered error-free • Example of edits (T turnover, P profit, and C costs): • T = P + C (balance edit) • T≥ 0
SDE and imputation • Three related problems: • Error localization: determine which values are erroneous • Correction: correct missing and erroneous data in best possible way • Consistency: adjust values such that all edits become satisfied • Correction often done by means of imputation
SDE and imputation • Three related problems: • Error localization: determine which values are erroneous • Imputation: impute missing and erroneous data in best possible way • Consistency: adjust imputed values such that all edits become satisfied
SDE and imputation • Three related problems: • Error localization: determine which values are erroneous • Imputation: impute missing and erroneous data in best possible way • Consistency: adjust imputed values such that all edits become satisfied • Most SDE techniques focus on error localization
SDE in the “old” days • Use of computers in SDE started many years ago • In early years role of computers restricted to checking which edits were violated • Subject-matter specialists retrieved paper questionnaires that did not pass all edits and corrected them • After correction, data were again entered into computer, and again checked whether all edits were satisfied • Major problem: during manual correction process records were not checked for consistency
Modern SDE techniques • Interactive editing • Selective editing • Automatic editing • Macro-editing
Interactive editing • During interactive editing a modern survey processing system (e.g. BLAISE) is used • Such a system allows one to check and – if necessary – correct in a single step • Advantages: • number of variables, edits and records may be high • quality of interactively edited data is generally high • Disadvantage: • all records have to be edited: costly in terms of budget and time • not transparent
Selective editing • Umbrella term for several methods to identify the influential errors • Aim is to split data into two streams: • critical stream: records that are the most likely ones to contain influential errors • non-critical stream: records that are unlikely to contain influential errors • Records in critical stream are edited interactively • Records in non-critical stream are either not edited or are edited automatically
Selective editing • Many selective editing methods are based on common sense • Most often applied basic idea is to use a score function • Two important components • influence: measures relative influence of record on publication figure • risk: measures deviation of observed values from “anticipated” values (e.g. medians or values from previous years)
Selective editing • Local score for single variable within record • usually defined as distance between observed and anticipated values, taking influence of record into account • Example: W x |Y – Y*| • W raising weight, Y observed value, Y* anticipated value • influence component: W x Y* • risk component: |Y – Y*| / Y* • Local scores combined into global score for entire record by • sum of local scores • maximum of local scores • Records with global score above certain cut-off value edited interactively
Selective editing: (dis)advantages • Advantage: • selective editing improves efficiency in terms of budget and time • Disadvantage: • no good techniques for combining local scores into global score are available if there are many variables • Selective editing has gradually become popular method to edit business data
Automatic editing • Two kinds of errors: systematic ones and random ones • Systematic error: error reported consistently among (some) responding units • gross values reported instead of net values • values reported in units instead of requested thousands of units (so-called “thousand-errors”) • Random error: error caused but by accident • observed value where respondent by mistake typed in a digit too many
Automatic editing of systematic errors • Can often be detected by • comparing respondent’s present values with those from previous years • comparing responses to questionnaire variables with values of register variables • using subject-matter knowledge • Once detected, systematic error is often simple to correct
Automatic editing of random errors • Three classes of methods: • methods based on statistical models (e.g. outlier detection techniques and neural networks) • methods based on deterministic checking rules • methods based on solving a mathematical optimization problem
Deterministic checking rules • State which values are considered erroneous when record violates edits • Example: if component variables do not sum up to total, total variable is considered to be erroneous • Advantages: • drastically improves efficiency in terms of budget and time • transparency and simplicity • Disadvantages: • many rules have to be specified, maintained and checked for validity • bias may be introduced as one aims to detect random errors in a systematic manner
Error localization as mathematical optimization problem • Guiding principle is needed • Freund and Hartley (1967): minimize sum of the distance between observed and “corrected” data and a measure for violation of edits • Casado Valera et al. (90’s): minimize quadratic function measuring distance between observed and “corrected” data such that “corrected” data satisfy all edits • Bankier (90’s): impute missing data and potentially erroneous values by means of donor imputation, and select imputed record that satisfies all edits and that is “closest” to original record
Fellegi-Holt paradigm (1976) • Data should be made to satisfy all edits by changing values of fewest possible number of variables • Generalization: data should be made to satisfy all edits by changing values of variables with smallest possible sum of reliabilityweights • reliability weight expresses how reliable one considers values of this variable to be • high reliability weight corresponds to variable of which values are considered trustworthy
Fellegi-Holt paradigm: (dis)advantages • Advantages: • drastically improves efficiency in terms of budget and time • in comparison to deterministic checking rules less, and less detailed, rules have to be specified • Disadvantages: • class of errors that can safely be treated is limited to random errors • class of edits that can be handled is restricted to so-called hard (or logical) edits which hold true for all correctly observed records • risky to treat influential errors by means of automatic editing
Macro-editing • Macro-editing techniques often examine potential impact on survey estimates to identify suspicious data in individual records • Two forms of macro-editing • aggregationmethod • distributionmethod
Macro-editing: aggregation method • Verification whether figures to be published seem plausible • Compare quantities in publication tables with • same quantities in previous publications • quantities based on register data • related quantities from other sources
Macro-editing: distribution method • Available data used to characterize distribution of variables • Individual values compared with this distribution • Records containing values that are considered uncommon given the distribution are candidates for further inspection and possibly for editing
Macro-editing: graphical techniques • ExploratoryDataAnalysis techniques can be applied • box plots • scatter plots • (outlier robust) fitting • Other often used techniques in software applications • anomaly plots: graphical overviews of important estimates, where unusual estimates are highlighted • time series analysis • outlier detection methods • Once suspicious data have been detected on a macro-level one can drill-down to sub-populations and individual units
Macro-editing: (dis)advantages • Advantages: • directly related to publication figures or distribution • efficient in term of budget and time • Disadvantages: • records that are considered non-suspicious may still contain influential errors • publication of unexpected (but true) changes in trend may be prevented • for data sets with many important variables graphical macro-editing is not the most suitable SDE method • most persons cannot interpret 10 scatter plots at the same time
Integrating SDE techniques • We advocate an SDE approach that consists of the following phases: • correction of “evident” systematic errors • application of selective editing to split records in critical stream and non-critical stream • editing of data: • records in critical stream edited interactively • records in non-critical stream edited automatically • validation of the publication figures by means of (graphical) macro-editing
Imputation • Expert guess • Deductive imputation • Multivariate regression imputation • Nearest neighbor hot-deck imputation • Ratio hot-deck imputation
Deductive imputation • Sometimes missing values can be determined unambiguously from edits • Examples: • single missing value involved in balance edit • for non-negative variables: if a total variable has zero value all missing subtotal (component) variables are zero
Regression imputation • Regression model per variable to be imputed Y = A + BX + e • Imputations for missing data can be obtained from Y = Aest + BestX or from Y = Aest + BestX + e* where e* is drawn from appropriate distribution
Regression imputation • Imputation can also be based on multivariate regression model that relates each missing value to all observed valuesYmis = Meanmis + B(Yobs – Meanobs) + e • Estimates of model parameters can be obtained by using EM-algorithm • Imputations for missing data can be obtained from Ymis = Meanest,mis + Best(Yobs – Meanest,obs) or from Ymis = Meanest,mis + Best(Yobs – Meanest,obs) + e* where e* is drawn from appropriate distribution
Nearest neighbor hot deck imputation • For each receptor record with missing values on some (target) variables a donor record is selected that has • no missing values on auxiliary and target variables • smallest distance to receptor • Replace missing values by values from donor • Often used distance measure is minimax distance • Zsi: value of scaled auxiliary variable i in record s • distance between records s and t: D(s,t) = max_i |Zsi – Zti|
Ratio hot deck imputation • Modified version of nearest neighbor hot-deck for variables that are part of balance edit • Calculate difference between total variable and sum of observed components • this difference equals the sum of the missing components • Sum of missing components are distributed over missing components using ratios (of missing components to sum of missing components) from donor record • level of imputed components is determined by total variable but their ratios are determined by donor • imputed and observed components add up to total
Example of ratio hot deck • P + C = T • Record to be imputed given by • T = 400, P = ?, C = ? • Donor record • T = 100, P = 25, C = 75 • Imputed record • T = 400, P = 100, C = 300
Consistency • If imputed values violate edits, adjust them slightly • Observed values not adjusted • Minimize Σi wi |Yi,final – Yi,imp| subject to restriction that Yi,final in combination with observed values satisfy all edits • Yi,imp: imputed values (possibly failing edits) • Yi,final: final values • wi: user-specified weights • As numerical edits are generally linear (in)equalities, resulting problem is a linear programming problem
Consistency • Prerequisite: • it should be possible to find values Yi,final such that all edits become satisfied • this is the case if Fellegi-Holt paradigm has been applied to identify errors • Instead of first imputing and then adjusting values, better (but more complicated) approach is to impute under restriction that edits become satisfy • see doctorate thesis by Caren Tempelman (Statistics Netherlands, www.cbs.nl)
Conclusion • All editing and imputation methods have their own (dis)advantages • Integrated use of editing techniques (selective editing, interactive editing, automatic editing, and macro-editing) as well as various imputation techniques can improve efficiency of SDE and imputation process while at same time maintaining or even enhancing statistical quality of produced data