230 likes | 403 Views
Deliverable 2.6: Selective Editing. Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy. Overview. Introduction Related projects Combining data sources Selective editing – data sources and tools Selective editing in SDWH Framework Proposed case studies
E N D
Deliverable 2.6: Selective Editing Hannah Finselbach1 and Orietta Luzi2 1ONS, UK 2ISTAT, Italy
Overview • Introduction • Related projects • Combining data sources • Selective editing – data sources and tools • Selective editing in SDWH Framework • Proposed case studies • Deliverable outcomes and recommendations
Introduction • Selective editing options for a Statistical Data Warehouse – including options for weighting the importance of different outputs • UK and Italy • Review or quality assure – Sweden (SELEKT) • Q1: Would you like to review and give comments? (Yes/No)
Statistical Data Warehouse (SDWH) • Benefits: • Decreased cost of data access and analysis • Common data model • Common tools • Drive increased use of administrative data • Faster and more automated data management and dissemination
Statistical Data Warehouse (SDWH) • Drawbacks: • Can have high cost – maintenance and implement changes • Tools may need to be developed for statistical processes • Methodological issues of SDWH framework – covered by WP2 • Phase 1 (SGA-1) “Work in progress” for most NSIs
Combining data sources • Many NSIs using admin data or registers to produce statistics • Advantages include: • Reduction in data collection and statistical production costs; large amount of data available; re-use data to reduce respondent burden. • Drawbacks include: • Different unit types (statistical and legal); timeliness; variable definition discrepancies. • Mixed source usually required
Editing • UNECE Glossary of terms on Statistical Data Editing: • “an activity that involves assessing and understanding data, and the three phases of detection, resolving, and treating anomalies…” • Large amount of literature on: • Editing business surveys • Editing administrative data
Aims and related projects • This deliverable aims to add value by investigating how to edit (selective editing) when combining sources • Mapping with other projects: • EssNet on Data Integration • EssNet on Administrative Data • MEMOBUST • EDIMBUS Project (2007) • EUREDIT Project (2000-2003) • BLUE-ETS • Q2: Do you know of any other relevant projects? (Yes/No)
Editing combined data sources • SDWH will combine survey, register and admin data sources • Editing required for: • maintaining business register and its quality; • a specific output and its integrated sources; • Improving the statistical system. • Part of quality control in SDWH • Split processes for data sources? (e.g. France)
Combined sources - Questions… • Q3: Do you currently combine data sources? • A. Yes; B. No; C. Unsure. • Q4: Do you have separate editing processes for each data source? • A. Only survey data edited (do not edit admin data); • B. Data sources edited separately; • C. Data sources edited separately, but units/variables in both sources edited for coherence; • D. Other.
Selective editing • Editing – traditionally time consuming and expensive • Selective / significance editing: • Prioritises based on score function that expresses the impact of their potential error on estimates • Score should consist of risk (suspicion) and influence (potential impact) components • Divide anomalies into a critical and a noncritical stream for possible clerical or manual resolution (possibly including follow-up) • More efficient editing process
Selective editing – Survey and Admin data • Use as auxiliary data in selective editing score function for survey data (e.g. UK, Italy) • Use score of differences between data sources to determine which need manual intervention (e.g. France) • Use scores based on historical data • Apply selective editing to admin data, same score function as survey data, but weights=1 (e.g. France SBS system)
Selective editing – question • Q5: Is selective editing used in the processing of admin/register data at your organisation? • A. No; • B. No, but admin data used as auxiliary for selective editing of survey data; • C. No, but a score function is used to compare data sources; • D. Yes, selective editing is applied to admin data; • E. Not sure.
Selective editing – tools • SELEMIX – ISTAT • SELEKT – Statistics Sweden • Significance Editing Engine (SEE) – ABS • SLICE – Statistics Netherlands • Q6: Are you aware of any other selective editing tools? • A. Yes, I can provide documentation; • B. Yes; • C. No.
Selective editing in SDWH • Methodological issues: • Survey weight not meaningful in SDWH • Weight=1? • Several sets of weights tailored for different uses? • Selective editing data “without purpose” • Importance weight for all potential uses? • Alternative editing approach? • Scores to compare data sources • Should score functions be used, or all discrepancies be followed up, or automatically corrected? • Selective editing of admin data – manual intervention? • Is selective editing appropriate if manual intervention is not possible? • Should automatic correction be applied to admin data identified as suspicious?
Any solutions? … • Survey weights used in selective editing score not meaningful • Q7: What do you think would be the best options: • A. Everything in SDHW represents itself and therefore weights=1 • B. Calculate several survey weights for all known uses of unit data item and incorporate into one global score • C. Calculate separate scores for all outputs, and combine (max, average, sum) • D. Other – discuss!
Any solutions? … • Selective editing data “without purpose” • Q8: Is selective editing appropriate if the data will be used multiple times, with unknown purpose at collection? • A. No; • B. No, another editing approach would be better; • C. Yes, we would use key known/likely outputs to calculate the score; • D. Yes, I can suggest/recommend a solution; • E. Not sure;
Any solutions? … • Scores to compare data sources • Q9: Should score functions be used to compare sources, or all discrepancies be followed up, or automatically corrected? • A. All discrepancies need to be investigated by a data expert; • B. All discrepancies need to be flagged, and can then be corrected automatically; • C. Scores should be used to flag only significant/influential discrepancies, which should be investigated by a data expert; • D. Scores should be used to flag only significant/influential discrepancies, which can then be corrected automatically; • E. Other – discuss? • F. Not sure.
Any solutions? … • Selective editing of admin data • Q10: Is selective editing appropriate if manual intervention is not possible? • A. No, only correct for fatal errors, systematic errors (e.g. unit errors), and suspicious reporting patterns; • B. No, identify all errors/suspicious values and automatically correct/impute; • C. Yes, identify only influential errors to avoid over editing/imputing admin source; • D. Yes, as well as fatal errors, systematic errors and suspicious reporting patterns – to also identify influential errors; • E. Other; • F. Not sure.
Experimental studies • ISTAT: Prototype DWH for SBS • Use SELEMIX • Combine statistical and admin data sources at micro level to estimate variables on economic accounts, known domains • Evaluate the quality of model-based selective editing and automatic correction • Re-use available data for other output • ONS: Combined sources for STS • Use SELEKT • Monthly business survey and VAT Turnover data • Compare selective editing or traditional editing of admin data (followed by automatic correction), known domains • Re-use available data for other output
Deliverable outcome - recommendations • Draft report put on CROS-portal – will include input from this workshop • Provide recommendations for methodological issues of using selective editing in SDWH • Using best practice from NSIs, and • Outcome from experimental studies. • Metadata checklist
Metadata requirements • Input to editing: • Quality indictors (e.g. of data source) • Threshold for selective editing score • Potential publication domains • Question number • Predictor/Expected value for score (e.g. historical data, register data) • Domain total and/or standard error estimate for score • Edit identification • … • Output from editing: • Raw and edited value • Selective editing score • Error number/description/type • Flag if suspicious • Flag if changed • …