190 likes | 302 Views
Automatic Editing with Soft Edits. Sander Scholtus (Statistics Netherlands). Automatic editing. Goal: Detect and correct errors and missing values without human intervention Data is made consistent with respect to a set of edits Two steps:
E N D
Automatic Editingwith Soft Edits Sander Scholtus (Statistics Netherlands)
Automatic editing • Goal: Detect and correct errors and missing values without human intervention • Data is made consistent with respect to a set of edits • Two steps: • detecting erroneous and missing values (error localisation) • imputation of new values Automatic Editing with Soft Edits
Automatic editing (2) • Fellegi-Holt paradigm for error localisation: Find the smallest subset of the variables that can be imputed to satisfy all edits • Generalised version uses confidence weights • At Statistics Netherlands: SLICE software Automatic Editing with Soft Edits
SLICE • Branch-and-bound algorithm: x1 x1 erroneous x1 correct x2 x2 x2 erroneous x2 correct x2 erroneous x2 correct x3 x3 x3 x3 Automatic Editing with Soft Edits
SLICE • Branch-and-bound algorithm: x1 eliminate x1 fix x1 x2 x2 eliminate x2 fix x2 eliminate x2 fix x2 x3 x3 x3 x3 Automatic Editing with Soft Edits
SLICE (2) • Leaf nodes of the tree: • all variables have been either fixed or eliminated • interpretation: eliminated variables are incorrect • Associated sets of edits: • contain no variables • either empty or contain only trivial statements • Theorem (De Waal and Quere, 2003): A leaf node corresponds to a feasible solution of the error localisation problem, if and only if the associated set of edits contains no contradictions Automatic Editing with Soft Edits
SLICE (3) • Application of SLICE in the production process: • automatic editing of micro data for the Dutch structural business statistics • approximately 100 variables and 100 edits • evaluation studies: sometimes large differences between automatic and manual editing Automatic Editing with Soft Edits
Hard edits and soft edits • Examples of edits: • Profit = Turnover – Costs • Profit < 0.6 x Turnover • First example: • hard edit • has to hold by definition • Second example: • soft edit • can also be failed by correct values Automatic Editing with Soft Edits
Hard edits and soft edits (2) • Manual editing uses both hard and soft edits • Current methods for automatic editing can only handle hard edits • Practical solutions: • ignore all soft edits • treat soft edits as hard edits • Can this be improved? Automatic Editing with Soft Edits
Error localisation with soft edits • Current error localisation problem: Minimise, among subsets of variables that can be imputed to satisfy all edits, the sum of the confidence weights • Suggested new error localisation problem: Minimise, among subsets of variables that can be imputed to satisfy all hard edits, the sum of the confidence weights plus a cost term for failed soft edits Automatic Editing with Soft Edits
Error localisation with soft edits (2) • The new error localisation problem can be solved by an extended version of the SLICE algorithm x1 eliminate x1 fix x1 x2 x2 eliminate x2 fix x2 eliminate x2 fix x2 x3 x3 x3 x3 Automatic Editing with Soft Edits
Example • Variables: Turnover (T), Profit (P), Costs (C), Number of Employees (N) • Edits: Hard edits: Soft edits: • Confidence weights: Turnover: 2; Profit: 1; Costs: 1; Number of Employees: 3 • Contribution of each failed soft edit: 2 Automatic Editing with Soft Edits
Example (2) • Original data and edits: T = 100; P = 40000; C = 60000; N = 5 Hard edits: Soft edits: Automatic Editing with Soft Edits
Example (3) • Original data and edits: T = 100; P = 40000; C = 60000; N = 5 Hard edits: Soft edits: • Eliminate P from the original edits: Implied hard edits: Implied soft edits: Automatic Editing with Soft Edits
Example (4) • According to the theory, P can be imputed to satisfy all hard edits, but the second soft edit is failed • Imputing only P is a feasible solution to the error localisation problem • The value of the target function equals 1 + 2 = 3 Automatic Editing with Soft Edits
Example (5) • Data and edits after eliminating P: T = 100; C = 60000; N = 5 Implied hard edits: Implied soft edits: • Eliminate C from these edits: Implied hard edits: Implied soft edits: Automatic Editing with Soft Edits
Example (6) • According to the theory, P and C can be imputed to satisfy all hard and soft edits • Imputing P and C is a feasible solution to the error localisation problem • The value of the target function equals 1 + 1 = 2 • This turns out to be the optimal solution • Possible corrected version of the record: T = 100; P = 40; C = 60; N = 5 Automatic Editing with Soft Edits
Example (7) • Imputing only P is the optimal solution if the soft edits are ignored • Corrected version of the record: T = 100; P = -59900; C = 60000; N = 5 Automatic Editing with Soft Edits
Discussion • Future work: • Implementation of the algorithm in R (in progress) • Test on realistic data (Dutch structural business statistics) • How to model the costs of failed soft edits Thank you for your attention! Automatic Editing with Soft Edits