400 likes | 410 Views
This article discusses the steps of data validation and analysis of performance of imputation in the context of census data processing. It covers topics such as data quality checking, comparing statistics with previous censuses, checking population distribution, and dealing with missing data. The article also explores the performance assessment of imputation methods.
E N D
DATA VALIDATION and ANALYSIS OF PERFORMANCE OF IMPUTATION United Nations Statistics Division
Generic Statistical Business Process-Censuses Pre-enumeration operations Preliminary evaluation of data quality
Data validation during data processing • Steps of data processing depend on the technology used in general, the process covers the following steps: Review and validate data against predefined rules a. Identify potential problems such as missing data, inconsistency and inappropriate editing/ imputation BEFORE PRODUCING CENSUS OUTPUTS
Data analysis • Steps for Data Analysis • Checking data quality with appropriate methods, • Comparing the statistics with previous censuses and other relevant data sources (both internal and external) • Investigating inconsistencies in the statistics
Data validation • Checking population distribution by geographic areas • Checking the quality of editing/imputation • Checking internal consistency and missing data
Data validation-1 • Checking population distribution by geographic areas • enumerated persons/households may not be fully captured (undercoverage) or double captured (overcoverage) • Controlling captured records (people/housing units) with census documents such as: • Control forms –prepared by enumerators/supervisors • Reports –prepared by Local/Regional Census Committees • Number of questionnaires received from the fields-prepared by the head quarters • Number of scanned questionnaire-if applicable • Ensuring enumerated population is fully processed
Data validation - 2 • Checking the quality of editing/imputation • Editing rules may be insufficient to identify all types of errors • Imputation may introduce new errors in data because of incorrect application • Some unexpected patterns may not be identified with editing/consistency rules
Basic definitions • Editing: List of rules to determine invalid and inconsistent data • Imputation : The process of resolving problems concerning invalid or inconsistent data – and missing values- identified during editing • All records must respect a set of editing rules formulated to correct errors and finally disseminate reliable data
Some examples for invalid data • Age • Equal to 99 • Instruction – if it is greater or equal to 98, write 98 • If age is written in one digit, such as How to correct? 1 5
Some examples for inconsistent data • Children ever born alive, living and dead children • If number of children ever-born is not equal to the sum of number of living children and number of dead children • Last live birth and household deaths • There is an infant birth who is not alive, but no infant death registered in the household deaths • Age of father/mother and children • If age of father/mother is lower or few years higher than age of a child What will be decision?
Dealing with missing data • What are decisions for dealing with missing data: • Missing data –item non-response- will be imputed ? • What variables will be imputed for missing data ? • What methods will be used for imputation?
Assessing the performance of imputation • Objectives • Comparing the distribution of the observed values with the distribution of the imputed values • Comparing the distribution of observed values to the complete distribution including the imputed values • To analyze the effect of imputation on original data set • To ensure the distribution of imputed values is reasonable or meets with the expected pattern
Assessing the performance of imputation • Method for assessing the performance: After implementation of editing/imputation, data should be classified as follows : • Observed (consistent) data: the values which meet with all editing rules • Non-response or unknown : no value • Inconsistent data : the values which failed at least one editing rule • Imputed data for inconsistency –and non-response • For this analysis, all procedures performed in the database should be identifiable
Assessing the performance of imputation • Compare the distribution of the observed values with the distribution of the imputed values • if non-response and inconsistent data are distributed randomly, • no difference is expected between the distribution of the observed and the imputed values • If there are differences between the people who responded and those who did not or not give accurate data • The imputed data should not follow the same distribution as the observed data
Assessing the performance of the imputation • Compare the distribution of the observed values with the distribution of all values including the imputed values • In general, imputed values should have a minimal effect on the distribution of the complete data • Unless the non-response rate is particularly high or the bias for certain characteristics
Understanding data editing and potential errors • Data on deaths in the household – cases where age of deceased was hot-decked show different age pattern of mortality than cases that were not subject to imputation Source: Estimation of mortality using the 2001 South Africa census data, Rob Dorrington, Tom Moultrie and Ian Timaeus, Centre for Actuarial Research, University of Cape Town
Understanding data editing and potential errors Boundary of school age Boundary of working age Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Max Change Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation Maximum change
Assessing the performance of imputation Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation • Summary indexes at the variable level • Maximum absolute percent change • Maximum absolute percent change across all categories for each variable • Dissimilarity Index • Degree of change of two distributions (observed and total including imputed values) at the variable level • Imputation rate • Share of the imputed records in the total records
Assessing the performance of imputation Maximum absolute percent change between the observed and final (imputed) distributions across all categories within each of the questions Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation Maximum absolute percent change between the observed and final (imputed) distributions across all categories within each of the questions Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Index of dissimilarity • To assess the degree of change induced by imputation on the initial distribution of variables Where; k : categories of the variable f : percentage distribution of the variable before imputation f * : percentage distribution of the variable after imputation
Index of dissimilarity 0 ≤ ID ≤ 100 • It assumes a 0 value when the two distributions before and after imputation are equal • It is greater than 0 when they are different and reaches its maximum value of 100 when there is maximum dissimilarity between the two distributions • when both are concentrated in one category which is different from each other
Index of dissimilarity ID 1.9 Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation Source: Albania, Quality Dimensions of 2011 Population and Housing Census, May 2014
Assessing the performance of imputation Source: Albania, Quality Dimensions of 2011 Population and Housing Census, May 2014
Data validation-3 Checking internal consistency • Objectives: • Ensuring all records meet with editing rules • Ensuring there is no unusual/unexpected values
How to validate • Prepare tables for preliminary analysis of census results • The list of tables should be prepared based on editing rules and relation between variables • Tables should present all possible conditions in data without eliminating any category to verify the results for example: • Marital status by all age groups, • Completed level of education by all age groups • Tables should present missing data
Some examples of tables • Tables for analyzing age difference between members of households • Age interval between father/mother and children • At least 12-14 years and at most 65 for males, 50 for females • Age interval between grand parents and grand children • At least 30 years
Some examples of tables • Distribution of household size • Accuracy of household size considering the number of persons enumerated in one page– such as 5, 10, … • There might be errors in combining the census forms belonging to same household
Some examples of tables • CEB, CS and CD • Relation between number of children ever-born, number of living children and number of dead children – CEB=CS+CD • Relation between age and number of children ever born
Fertility CEB – quality assessment Parities wrong ?
CEB – quality assessment Mongolia, 1989 Census (Source: IPUMS) Parities wrong ?
Quality assessment Age at death of children (in month) declared by the mother, Nepal 1975
Some examples of tables • Education • Educational attainment- highest level completed • Consistency with school attendance • Relation with age –minimum age for completing school • Usually it is calculated by taking minimum age for entering school plus number of years required for completing a school. Example: • Minimum age for primary education is age 6 • If primary education requires 8 years, minimum age for completing primary school would be age 13
School attendance – quality assessment Expected pattern ? Expected pattern ?