300 likes | 489 Views
New and Emerging Methods. Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa. Introduction. New methods of data editing and imputation Subdivided into 5 different themes: Automatic editing Imputation E & I for demographic variables
E N D
New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa
Introduction • New methods of data editing and imputation • Subdivided into 5 different themes: • Automatic editing • Imputation • E & I for demographic variables • Selective editing • Software
Invited Papers • WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) • WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US) • WP 31: Smoothing Imputations for categorical data in the linear regression paradigm (USCB, US)
Automatic editing: papers (1/2) Six papers: • WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) • WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US) • WP 33: Data editing and logic (Australia)
Automatic editing: papers (2/2) • WP 43: Automatic editing system for the case of two short-term business surveys (Republic of Slovenia) • WP 44: A variable neighbourhood local search approach for the continuous data editing problem (Spain) • WP 46: Implicit linear inequality edits and error localization in the SPEER edit system (USCB, US)
Automatic Editing: main developments Methods based on Fellegi-Holt model • Developments at SORS • General system combines error localization with outlier detection • Plans for automation of implied edit generation • Further improvements of SPEER • Preprocessing program for generation of implied edits • Improve error localization
Automatic Editing: main developments • Framework of Fellegi-Holt theory in propositional logic • Generation of implied edits framed as logical deduction • Automatic tools that can potentially be used for finding minimal deletion set
Automatic Editing: main developments Methods based on some other approach • Erroneous unit measures • Model as cluster analysis problem • Ratio and balance constraints • Hybrid ratio editing and quadratic programming • Controlled rounding • Error localization as a combinatorial optimization problem • Continuous data • Successful on very large data sets
Imputation: papers (1/2) Six papers: • WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) • WP 31: Smoothing imputations for categorical data in the linear regression paradigm (USCB, US) • WP 36: Integrated modeling approach to imputation and discussion on imputation variance (Statistics Finland)
Imputation: papers (2/2) • WP 40: Imputation of data subject to balance and inequality restrictions using the truncated normal distribution (Statistics Netherlands) • WP 41: On the imputation of categorical data subject to edit restrictions using loglinear models (Statistics Netherlands) • WP 48: Improving imputation: the plan to examine count, status, vacancy and item imputation in the decennial census (USCB, US)
Imputation: main developments Model based methods • Discrete Data • Constrained loglinear model • Linear regression model • Continuous Data • Truncated normal distribution followed by MCEM
Imputation: main developments Implementation of imputation methods • Use Bayesian networks for imputation of discrete data • Development of QUIS for imputation of continuous data • written in SAS • uses EM algorithm, nearest neighbor, and MI
Imputation: main developments Implementation of imputation methods • Integrated Modeling Approach (IMAI) • Summary and analysis of principles of IMAI • Estimation of imputation variance • U.S. Decennial Census • Research on alternative imputation options • Administrative records, model based imputation, CANCEIS, hot deck • Development of a truth deck for evaluation
E & I for demographic variables: papers Three papers: • WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) • WP 35: Edit and imputation for the 2006 Canadian Census (Statistics Canada) • WP 38: New procedures for editing and imputation of demographic variables (ISTAT, Italy)
E & I for demographic variables: main developments • Further improvement of CANCEIS • capability of processing all census variables • improved editing and imputation of alphanumeric, discrete, continuous and coded variables • improved user interface • Development of DIESIS • combined use of “data driven” approach (NIM) and “minimum change” approach (Fellegi-Holt)
E & I for demographic variables: main developments • Development of DIESIS • Use of graph theory to improve quality of sequential imputation • Optimization procedure to locate the household reference person • New approach for selection of donors • based on partitioning passed records into smaller subsets of similar characteristics • search for donor records within the smaller clusters
Selective editing: papers Two papers: • WP 42: Evaluation of score functions for selective editing of annual structural business statistics (Statistics Netherlands) • WP 45: An editing procedure for low pay data in the annual survey of hours and earning (Office for National Statistics, UK)
Selective editing: main developments • Continued use and development of selective editing • Evaluation of selective editing approaches • experiments with different sets of score functions • Development of “hybrid editing” • validate a sample of failed records • use associated data to impute remaining records
Software: papers Four papers: • WP 34: The transition from GEIS to BANFF (Statistics Canada) • WP 37: Concepts, materials and IT modules for data editing of German statistics (Destatis, Germany) • WP 39: SLICE 1.5: a software framework for automatic edit and imputation (Statistics Netherlands) • WP 47: Improving an edit and imputation system for the US Census of agriculture (NASS, US)
Software: main developments • Flexibility • modules rather than large systems are developed • standard statistical packages are used (SAS in BANFF and US Census of Agriculture) • Testing and implementation of the software • Quality control measures • e.g. for (donor) imputation • Integration of the edit and imputation software in entire production process • process chain: planning, data collection, edit and imputation
General points for discussion • Are there any really new approaches? • new approaches extensions of existing ideas? • new approaches combinations of old ones? • Develop new approaches or consolidate old approaches? • development versus evaluation studies and testing • prototype software versus implementation of production software • Is our focus shifting? • from editing towards imputation? • from development towards implementation? • from computational aspects towards quality issues?
Automatic editing: points for discussion • Can operations research techniques be combined with techniques from mathematical logic? • What are the (dis)advantages of using SAT solvers when compare to direct integer programming methods? • What is the quality of the imputations when editing data using the quadratic programming approach?
Automatic editing: points for discussion • What is the quality of the solutions found by using the combinatorial optimization approach on real survey data? How fast is this approach on realistic data? • Can finite mixture models be used for detection of other types of systematic errors? • Should we invest on developing generic tools or software tools tailored to a particular application?
Automatic editing: points for discussion • Are there any other types of surveys that are worth the effort of generating implied edits prior to error localization? • What are the most cost-effective methods for edit/imputation in terms of resources, time, clerical intervention, quality of results?
Imputation: points for discussion • What are the (dis)advantages of using complex mathematical models for missing data imputation? Are these models too complex for survey practitioners? • What are the expected computational difficulties of applying complex models to real survey data? • What are the largest (most complex) surveys that can be imputed using these models?
Imputation: points for discussion • What is the quality of the imputations carried out using model based methods for filling-in missing data? • Can we compare the different imputation models?
Imputation: points for discussion • Can more guidelines for the IMAI process be developed? • To what extent can we develop a systematic way of applying IMAI? • Is imputation variance an important issue at the moment, or should we (still) focus on imputation bias?
E & I for demographic variables: points for discussion • Can CANCEIS/DIESIS be used for other data besides demographic census data? • Can CANCEIS/DIESIS be further developed? • Should we use a combination of edit and imputation methods or a single method for demographic variables?
Selective editing: points for discussion • Can selective editing be successfully applied to large/complex surveys? • Can current methods for selective editing be further developed? • Can a general theory for selective editing be developed? • How promising is hybrid editing?
Software: points for discussion • Should we develop generic software or software tools for particular applications? • How can we ensure the flexibility of software? • Are the software tools fast enough for large/complex data sets? • To what extent should we aim to automate the editing process?