1 / 40

DATA VALIDATION and ANALYSIS OF PERFORMANCE OF IMPUTATION United Nations Statistics Division

This article discusses the steps of data validation and analysis of performance of imputation in the context of census data processing. It covers topics such as data quality checking, comparing statistics with previous censuses, checking population distribution, and dealing with missing data. The article also explores the performance assessment of imputation methods.

ther
Download Presentation

DATA VALIDATION and ANALYSIS OF PERFORMANCE OF IMPUTATION United Nations Statistics Division

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DATA VALIDATION and ANALYSIS OF PERFORMANCE OF IMPUTATION United Nations Statistics Division

  2. Generic Statistical Business Process-Censuses Pre-enumeration operations Preliminary evaluation of data quality

  3. Data validation during data processing • Steps of data processing depend on the technology used in general, the process covers the following steps: Review and validate data against predefined rules a. Identify potential problems such as missing data, inconsistency and inappropriate editing/ imputation BEFORE PRODUCING CENSUS OUTPUTS

  4. Data analysis • Steps for Data Analysis • Checking data quality with appropriate methods, • Comparing the statistics with previous censuses and other relevant data sources (both internal and external) • Investigating inconsistencies in the statistics

  5. Data validation • Checking population distribution by geographic areas • Checking the quality of editing/imputation • Checking internal consistency and missing data

  6. Data validation-1 • Checking population distribution by geographic areas • enumerated persons/households may not be fully captured (undercoverage) or double captured (overcoverage) • Controlling captured records (people/housing units) with census documents such as: • Control forms –prepared by enumerators/supervisors • Reports –prepared by Local/Regional Census Committees • Number of questionnaires received from the fields-prepared by the head quarters • Number of scanned questionnaire-if applicable • Ensuring enumerated population is fully processed

  7. Data validation - 2 • Checking the quality of editing/imputation • Editing rules may be insufficient to identify all types of errors • Imputation may introduce new errors in data because of incorrect application • Some unexpected patterns may not be identified with editing/consistency rules

  8. Basic definitions • Editing: List of rules to determine invalid and inconsistent data • Imputation : The process of resolving problems concerning invalid or inconsistent data – and missing values- identified during editing • All records must respect a set of editing rules formulated to correct errors and finally disseminate reliable data

  9. Some examples for invalid data • Age • Equal to 99 • Instruction – if it is greater or equal to 98, write 98 • If age is written in one digit, such as How to correct? 1 5

  10. Some examples for inconsistent data • Children ever born alive, living and dead children • If number of children ever-born is not equal to the sum of number of living children and number of dead children • Last live birth and household deaths • There is an infant birth who is not alive, but no infant death registered in the household deaths • Age of father/mother and children • If age of father/mother is lower or few years higher than age of a child What will be decision?

  11. Dealing with missing data • What are decisions for dealing with missing data: • Missing data –item non-response- will be imputed ? • What variables will be imputed for missing data ? • What methods will be used for imputation?

  12. Assessing the performance of imputation • Objectives • Comparing the distribution of the observed values with the distribution of the imputed values • Comparing the distribution of observed values to the complete distribution including the imputed values • To analyze the effect of imputation on original data set • To ensure the distribution of imputed values is reasonable or meets with the expected pattern

  13. Assessing the performance of imputation • Method for assessing the performance: After implementation of editing/imputation, data should be classified as follows : • Observed (consistent) data: the values which meet with all editing rules • Non-response or unknown : no value • Inconsistent data : the values which failed at least one editing rule • Imputed data for inconsistency –and non-response • For this analysis, all procedures performed in the database should be identifiable

  14. Assessing the performance of imputation • Compare the distribution of the observed values with the distribution of the imputed values • if non-response and inconsistent data are distributed randomly, • no difference is expected between the distribution of the observed and the imputed values • If there are differences between the people who responded and those who did not or not give accurate data • The imputed data should not follow the same distribution as the observed data

  15. Assessing the performance of the imputation • Compare the distribution of the observed values with the distribution of all values including the imputed values • In general, imputed values should have a minimal effect on the distribution of the complete data • Unless the non-response rate is particularly high or the bias for certain characteristics

  16. Understanding data editing and potential errors • Data on deaths in the household – cases where age of deceased was hot-decked show different age pattern of mortality than cases that were not subject to imputation Source: Estimation of mortality using the 2001 South Africa census data, Rob Dorrington, Tom Moultrie and Ian Timaeus, Centre for Actuarial Research, University of Cape Town

  17. Understanding data editing and potential errors Boundary of school age Boundary of working age Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012

  18. Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012

  19. Max Change Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012

  20. Assessing the performance of imputation Maximum change

  21. Assessing the performance of imputation Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012

  22. Assessing the performance of imputation Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012

  23. Assessing the performance of imputation • Summary indexes at the variable level • Maximum absolute percent change • Maximum absolute percent change across all categories for each variable • Dissimilarity Index • Degree of change of two distributions (observed and total including imputed values) at the variable level • Imputation rate • Share of the imputed records in the total records

  24. Assessing the performance of imputation Maximum absolute percent change between the observed and final (imputed) distributions across all categories within each of the questions Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012

  25. Assessing the performance of imputation Maximum absolute percent change between the observed and final (imputed) distributions across all categories within each of the questions Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012

  26. Index of dissimilarity • To assess the degree of change induced by imputation on the initial distribution of variables Where; k : categories of the variable f : percentage distribution of the variable before imputation f * : percentage distribution of the variable after imputation

  27. Index of dissimilarity 0 ≤ ID ≤ 100 • It assumes a 0 value when the two distributions before and after imputation are equal • It is greater than 0 when they are different and reaches its maximum value of 100 when there is maximum dissimilarity between the two distributions • when both are concentrated in one category which is different from each other

  28. Index of dissimilarity ID 1.9 Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012

  29. Assessing the performance of imputation Source: Albania, Quality Dimensions of 2011 Population and Housing Census, May 2014

  30. Assessing the performance of imputation Source: Albania, Quality Dimensions of 2011 Population and Housing Census, May 2014

  31. Data validation-3 Checking internal consistency • Objectives: • Ensuring all records meet with editing rules • Ensuring there is no unusual/unexpected values

  32. How to validate • Prepare tables for preliminary analysis of census results • The list of tables should be prepared based on editing rules and relation between variables • Tables should present all possible conditions in data without eliminating any category to verify the results for example: • Marital status by all age groups, • Completed level of education by all age groups • Tables should present missing data

  33. Some examples of tables • Tables for analyzing age difference between members of households • Age interval between father/mother and children • At least 12-14 years and at most 65 for males, 50 for females • Age interval between grand parents and grand children • At least 30 years

  34. Some examples of tables • Distribution of household size • Accuracy of household size considering the number of persons enumerated in one page– such as 5, 10, … • There might be errors in combining the census forms belonging to same household

  35. Some examples of tables • CEB, CS and CD • Relation between number of children ever-born, number of living children and number of dead children – CEB=CS+CD • Relation between age and number of children ever born

  36. Fertility CEB – quality assessment Parities wrong ?

  37. CEB – quality assessment Mongolia, 1989 Census (Source: IPUMS) Parities wrong ?

  38. Quality assessment Age at death of children (in month) declared by the mother, Nepal 1975

  39. Some examples of tables • Education • Educational attainment- highest level completed • Consistency with school attendance • Relation with age –minimum age for completing school • Usually it is calculated by taking minimum age for entering school plus number of years required for completing a school. Example: • Minimum age for primary education is age 6 • If primary education requires 8 years, minimum age for completing primary school would be age 13

  40. School attendance – quality assessment Expected pattern ? Expected pattern ?

More Related