210 likes | 225 Views
Data Editing: Introduction. Objectives of Session. Editing is the procedure for detecting and correcting errors from data. Imputation is the procedure of assigning values to missing or inconsistent data
E N D
Data Editing: Introduction UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Objectives of Session Editing is the procedure for detecting and correcting errors from data. Imputation is the procedure of assigning values to missing or inconsistent data The objective of the session is to present an overview of the concepts and definitions, and discuss the application and issues UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Summary Types of Errors in the census process Objectives of Editing: Why do it? How to and Why Edit? Some illustrative examples Principles of Editing: How to do it Fatal versus Query Edits Micro-editing versus Macro-editing Manual versus automatic editing Impact of capture mode on editing Pitfalls of Over-editing Other considerations UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Types of Errors in the Census Process • Coverage Errors • Incomplete/inaccurate maps or EAs • Incomplete canvassing of all units • Duplicate counting • Omission of persons unwilling to be enumerated • Erroneous treatment of visitors or non-resident aliens (especially in relation to de jure versus de facto methods) • Loss or destruction of census records after enumeration • …… UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Types of Errors in the Census Process • Content Errors • Errors in questionnaire design • Enumerator errors • Respondent errors • Coding errors • Data entry errors • Errors in computer editing • Errors in tabulation UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Types of Errors in the Census Process • Two types of errors at processing stage: • Those that block further processing and • Those that produce invalid/ inconsistent results without interrupting logical flow of subsequent processing operations • ALL errors of first kind must be corrected and as many of second kind as possible UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Objectives of Editing : Why do it? • Objectives of editing (Granquist, 1984) • “Tidy up data” so as to facilitate analysis (creation of complete file) • Identify types and sources of errors (for reporting on data quality) • Improve quality of census data (for current and future census) • Important not only to detect errors but also to identify causes, in order to take appropriate corrective measures and improve overall quality UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
How to Edit? TABLE 1: 2010 Population by Age and Sex, Unedited and Edited UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
How to Edit? TABLE 1: Population by Age and Sex, Unedited and Edited • How to deal with data “not reported”? • Distribute the age unknowns and the sex unknowns in same proportion as the corresponding known values • For example, for 23 sex unknowns, distribute (2033/4147)*23 = 12 to males (and remaining 11 to females by subtraction); see RHS of Table 1 • Similarly, distribute 15 age unknowns across 6 age groups in proportion to known values, see RHS of Table 1 • This method could render biased results if number of unknowns (number of non-responses) high since distribution of knowns and unknowns may be very different • An improved strategy would be to use multivariate distributions involving other variables such as relationship between spouses, having a positive entry for number of children born, etc, UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Why Edit? TABLE 2: Population by Age with Unknowns for 2000 and 2010 UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Why Edit? TABLES 2 and 3: Population by Age with and without Unknowns for 2000 and 2010 • Another problem is that unknowns may affect the analysis of trends • In Table 2, if unknowns not taken into account, percentage of persons aged 15-29 years appears to increase from 27.2% in 2000 to 30.3% in 2010 • Redistributing unknowns may change this trend • In Table 3, after distributing unknowns, there is only an increase from 28.7% in 2000 to 29.3% in 2010 UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Why Edit? TABLE 3: Population by Age without Unknowns for 2000 and 2010 UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Principles of Editing : How to do it • In general the editing system should be: • Minimalist (change only obvious errors and as few as possible) • Automated (as much as possible, for both detection and correction) • Systematic • Consistent with other NSO statistical collections • Compliant with UN or other international standards UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Fatal versus Query Edits • Types of edits: • Fatal Edits: identify errors with certainty • Query Edits: identify suspected errors • Fatal Edits identify fatal errors, which include invalid or missing entries as well as errors due to inconsistencies • Query Edits identify data items that fall outside subjective data bounds, or items that are relatively high or low as compared with other data on the same questionnaire • Fatal edits must be resolved but query edits more difficult to correct, have fewer benefits than the detection and resolution of fatal edits, and add more to the cost of the process • For query edits, subject-matter specialists should investigate edits developed for pilot censuses and those developed during processing to make sure that individual edit have the expected cost of census evaluation (e.g., look at hit rates or share of flags that result in changes to the original data) UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Micro-editing versus Macro-editing • Micro-editing: concerns ways to ensure validity and consistency of individual data records and relationships between records in a household • Macro-editing: checks aggregated data to make sure that they are reasonable • Example, If census results show large percentage of persons without a reported age, imputing for age (at micro level) will produce a complete data set. • BUT far more essential to make checks at macro (aggregate) level to ensure that imputation does not skew overall age distribution UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Impact of Capture Mode on Editing • Types of capture modes typically used: manual (key-entry), OMR, OCR/ICR, PDA, Internet • For key-entry, PDA, Internet: some limited detection and correction of errors can be done in “real time” • Not possible for OMR or OCR/ICR (from paper questionnaire) with scanning; limited to “batch editing” after the fact UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Manual versus Automated Editing • Manual edits may be done in several places along the editing chain – by enumerator, supervisor, field office worker, coder, key entry clerk, etc • Disadvantage is that manual editing expends enormous amount of time (months or years), energy (human resources) and cost • If data set is small, timing not so crucial and work force available, then manual editing may be feasible • Automated editing reduces time required, decreases introduction of human error, and allows for creation of edit trail (and is therefore reproducible) • Unlike manual editing, automated editing makes it feasible and efficient to impute responses based on other information in the questionnaire or on reported information for a unit with similar characteristics UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Pitfalls of Over-editing • Reduced timeliness • Increased costs • Potential distortion of true values • False sense of security UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Other Considerations • Determination of tolerance levels for error detection • For most items in a census, some small percentage of the respondents will not give “acceptable” responses, for whatever reason • Not every failure is pervasive and therefore may not be worthy of remedial action- see Pitfalls of Over-editing • Tolerance levels indicate number of invalid and inconsistent responses allowed before editing teams take remedial action • Decided by editing team including both subject-matter and data processing specialists • For key items such as age and sex, typically low (1%-2%) whereas less key items such as literacy and disability, typically higher (5%-10%) • Correction may occur by returning enumerators to field, conducting telephone re-interviews or by applying specific knowledge of an area • Learning from the editing process/ quality assurance systems • Positive and negative feed-back loops need to be recorded to improve the quality of both the current census and future censuses and surveys • Audit trails, performance measures and diagnostic statistics crucial • This is often the most important outcome of editing UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
Other Considerations • Cost of editing • Cost of editing has not decreased in the last much in the last 20 years, although process has been rationalized by continuous exploitation of technological developments • In general, editing activities take a disproportionate amount of time (and therefore staff costs) relative to other activities • Excessive editing can delay census results • Archiving • Both edited and unedited data files should be preserved for later analysis – and in several places • Documentation should be complete enough for census planners to be able to reconstruct the same processes at a later date UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008
THANK YOU! UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008