280 likes | 516 Views
Data Editing. United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile. Editing and Imputation Defined. Data editing: Identification and flagging of missing , invalid , inconsistent or anomalous entries Imputation: Resolves problems identified in editing. 2.
E N D
Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile
Editing and Imputation Defined • Data editing: Identification and flagging of missing, invalid, inconsistent or anomalous entries • Imputation: Resolves problems identified in editing 2
Editing and Imputation Process Flow 1. 3. 2. 3
A General Editing and Imputation Process • Identify and treat initial errors • At the data capture stage • At the data entry stage • Ex: Data entered into a table is shifted by a row • Identify and treat errors a: Interactively/Manually treat influential errors b: Automatically treat non-influential errors • Check the aggregated output 4
Editing and Imputation Process Flow 1. 3. 2. 5
Editing Errors • Two categories of errors • Systematic – reported consistently by some of the respondents • Ex: Gross values are reported instead of net values • Ex: Units are reported in thousands • Random – non-systematic or caused by accident • Ex: An extra digit is accidentally typed in the response • Manifestations of errors can be systematic or random • Missing • Ex: A variable is left blank because the respondent does not know the answer to the question, does not want to answer the question or does not understand the question • Outliers – values that deviate from a model • Ex: Unanticipated large values as compared to historic trend • Violation of logical or consistency rules • Ex: A total value is larger than the sum of its components • Edit rules are used to detect errors and often define how they should be treated
Systematic Errors Errors that are reported consistently over time. Unit error Ex: xt-1 / xt <= 300 Sign error Bugs in the collection vehicle Misunderstanding a question or skip rules Ex: systematic missing values Detection High failure rates of edits Outlier detection (e.g. for unit errors) Knowledge of the survey and the raw data processing 7
Systematic errors (2) Suggestions Improvements in the survey or processing procedures should be made When systematic errors are identified, they should be turned into edit rules Detecting and correcting is cost effective Should be treated before random errors 8
Missing Values Stem from questions a respondent did not answer Detection is usually simple Suggestions Do not ignore missing values (→ bias and loss of estimate precision) Missing values may not be missing at random Do not replace with zeros (→ inaccurate results) Nonresponse indicators should be compiled and analyzed because missing values may be systematic 9
Outliers Observations that do not fit well to a model Ex: Median-k*IQR < value < Median+k*IQR Ex: Month-on-month change <= 50% May be defined by one variable (univariate) or a set of variables (multivariate) Two types Representative: correct with similar units in population Non-representative: either incorrect or correct but unique Ex: correct – isolated labor strike at a plant 10
Outliers (2) • Detection • Univariate • Multivariate • Periodic data (e.g. Hidiroglou-Berthelot) • Regression models or tree-models 11
Edit Rules • Edit rules are used to determine whether a value is consistent or may be erroneous • Surveys are often created to allow these rules • Edit rules flag data in two ways • Fatal edit – indicates a value that is (almost) certainly in error • Query edit – indicates values that may be in error
Types of Edit Rules • Validation edits – often in the form of if-then statements • Ex: if total hours worked > 0 then employees > 0 • Ex: if Σproduction quantity > 0 then Σproduction value > 0 • Ex: if revenue from manufacturing plant> 0 then • hours worked by machinery technicians > 0 • plant capacity utilization > 0 • Σproduction volume > 0 • Σproduction value > 0 • Balance edits – detail items must add to total • Ex: total employee remuneration = wages + salaries + employer contributions to social security + welfare benefits + profits distributed to workers
Types of Edit Rules (2) • Ratio edits – the ratio of two data items is bounded by lower and upper bounds. The pairs should be correlated. • Ex: total hours/employee/day is between 6 and 10 (very correlated) • Ex: plant capacity utilization <= 20% change from prvs month • Ex: wages (W) should change within 10% of the change in total employment (E)(Et/Et-1 - 1) - .1 <= Wt/Wt-1 -1 <= (Et/Et-1 - 1) + .1 • Ex: Σproduct value / Σ product quantity <= 10% change from previous month
Types of Edit Rules (3) • Hidiroglou-Berthelot is a particular type of ratio edit • Ex : Employee month-on-month change<=100 employees: <= 50% change from prvs month 100< emp < =200: <= 20% change from prvs month >200 emp: <= 10% change from prvs month
Editing & Imputation Process • Interactive/Manual – a record with flagged data is manually reviewed, preferably by a subject matter expert • Automatic – a record with flagged data is automatically reviewed and corrected by a computer • Selective – designed to route edits/imputations into interactive or automatic streams • based on influential vs. non-influential errors • Marcroediting
Editing and Imputation Process Flow 1. 3. 2. 17
Selective Editing • Distinguishes between errors in values that have a significant influence on survey estimate and those that are insignificant to the estimate • Selective editing splits raw data into two streams: • critical stream: records that most likely contain influential errors and large companies • non-critical stream: records that are unlikely to contain influential errors • A score function determines which responses go into which stream 18
Selective Editing (2) Local score function = influence * risk For example: Influence = Risk = Raw value Anticipated value Sampling weight 19
Selective Editing (3) • Local score functions are aggregated into global score functions for each record • First local scores are scaled, e.g. dividing observed values by mean values • Scaled local scores are combined into a global score. For example: Minkowski metric (a common approach) • The influence of large local scores increases with αα = 1 : simple sum of local scoresα = 2 : Euclidean metricα -> ∞ : max local score
Selective Editing (4) • GS cut-off threshold must be determined • All records above the cut-off are selected for interactive editing • A simulation can be performed on previous data to determine a threshold • Raw unedited values and corresponding edited values are used • The first p% of records are edited and the resultant estimate is compared with the fully edited estimate • Trial and error will lead to estimates that are the same and a corresponding cut-off value • Alternatively, a threshold doesn’t need to be used • Records can be edited in priority order until time or budget constraints tell one to stop
Selective Editing (5) • A score function can be augmented in many ways • E.g. Size criteria where large enterprises are always selected for critical stream (influence irrespective of risk) • Selective editing improves efficiency 22
Macro-Editing • Macro-editing techniques account for the distribution of variables and for the plausibility of estimates • Two forms of macro-editing • Aggregation method • Distribution method 23
Macro-Editing - Aggregation • Verification whether figures to be published seem plausible • Compare estimates with • Previous estimate values • Values from other related sources • Related estimates (such as electricity production and consumption)
Macro-Editing - Distribution • Available data used to characterize distribution of variables • Individual values are compared with this distribution • Records that contain values that are uncommon may require further inspection and possibly for editing
Macro-Editing Example: Graphical Editing • Univariate plot • Bivariate scatter plot 26
Editing and Imputation Process Flow 1. 3. 2. 27