310 likes | 461 Views
Census Data Editing: Structure and Within Record Editing. Part I: Structure Editing. Summary. Part I: Structure Edits What are structure edits? Geography edits Hierarchy of records Correspondence between housing and population records Editing relationships in a household Family nuclei.
E N D
Census Data Editing: Structure and Within Record Editing UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Part I: Structure Editing UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Summary Part I: Structure Edits What are structure edits? Geography edits Hierarchy of records Correspondence between housing and population records Editing relationships in a household Family nuclei UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
What are structure edits? Structure edits check coverage and relationships between different units: persons, households, housing units, enumeration areas, etc. Specifically, they check that: all households and collective quarters records within an enumeration area are present and are in the proper order; all occupied housing units have person records, but vacant units have no person records; households must have neither duplicate person records, nor missing person records; enumeration areas must have neither duplicate nor missing housing records. UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Geography edits Each EA must have the right geographic codes (city, province, region...) Every housing unit in an EA should be entered and every record must have a valid EA code The capture process must check this before editing of data commences If errors remain, it is best to find the right code by returning to the enumeration documents and correcting manually, for example. UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Hierarchy of records UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Hierarchy of records UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 1_EA 2_Housing unit 4_Individual 4_Individual 2_Housing unit 3_Collective living quater 4_Individual 4_Individual 1_EA
Hierarchy of records Type 1 (EA) followed by new Type 1 (if original EA empty) or Type 2 (Housing unit) or Type 3 (Collective Living Quarter) Particular case of homeless people: create a dummy housing record to make structural checking easier Type 2 (Housing Unit) followed by Type 1, 2 or 3 (if original dwelling vacant) or Type 4 (if original dwelling occupied) Type 3 (Collective Living Quarter) followed by Type 4 (Individual) If not occupied, empty CLQ allowed? Type 4 (Individual) followed by Type 4 (other individual in the same dwelling or collective living quarter), or Type 2 or 3 (other dwelling or CLQ) or Type 1 (new EA) UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Correspondence between housing and population records An occupied unit should have at least one person and a vacant unit should have no people: if Type 2 (Housing Unit) & category (vacant) followed by Type 4 (individual) then change the category to occupied The number of occupants recorded on the Housing Unit form should be exactly the same as the sum of the individual records in the household. If not, change the number on the Housing Unit form Population records should be sequenced (numbered) Type 3 (CLQ) & category (Hospital) followed by multiple Type 4 (individual) of category “Retirement home” then change the category of the CLQ to “Retirement home” UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Editing relationships in a household Each individual has a relation to the first person: 1st person (or Head, or reference person) Spouse Child of the 1st or of his/her spouse Parent Other relative Friend Lodger ... UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Editing relationships in a household Household with potential inconsistencies in age reporting UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Family nuclei Father: Sex should be male and Age should be > minimum age Mother Sex should be female and Age should be > minimum age Child Age under a maximum limit ? UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Part II: Within Record Editing UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Summary Part II: Within Record Edits Validity and Consistency Checks Top-down Editing versus Multiple-variable Editing Example of Multiple-Variable Editing Methods of Correcting and Imputing Data Example of Hot Deck for Sample Household (Sex Only) Example of Hot Deck for Sample Household (Sex and Age) Issues Related to Hot Deck Methods of Correcting and Imputing Data: General Principles Edit Trails and the Use of Imputation Flags UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Validity and Consistency Checks Validity checks are performed to see if the value of individual variables are plausible or lie within a reasonable range Examples: 0<=AGE<=110 SEX= Female or SEX=Male Consistency checks are performed to ensure that there is coherence between two or more variables Examples: Head of Household should have AGE>=15 A child should be younger than a head of household A person with AGE<15 should never be married UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Top-down Editing versus Multiple-Variable Editing Top-down Editing approach starts by editing top priority variable (not necessarily first variable on questionnaire) and moves sequentially through all items in decreasing priority During editing process, some edits change the value of an item more than once; this can introduce one or more errors in dataset Example: Child’s age first imputed on basis of mother’s age. Later child’s age re-imputed on basis of reported years of schooling, which might be inconsistent with mother’s age In this case, child’s age should keep being re-imputed till it is consistent Important to avoid circular editing! UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Top-down Editing versus Multiple-Variable Editing Multiple-Editing approach uses a set of rules that state the relationship between variables Each statement is tested against data to see if true Edit system keeps track of all false statements relating to invalid entries or inconsistencies Assessment is then made on how to change record so that it will pass all edits and then decision is made Fellegi-Holt principle of “minimum change” should be used UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Example of Multiple-Variable EditingHead of household and spouse have same sex UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Example of Multiple-Variable EditingHead of household and spouse have same sex UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Methods of Correcting and Imputing Data The process of imputation changes one or more responses or missing values in a record or several records to ensure internally coherent records result Before using any imputation method, the best strategy is to start with manual study of responses; imputation can then handle the remaining unresolved edit failures Two methods of imputation: Cold Deck and Hot Deck Cold Deck Imputation: Used mainly for missing or unknown values (not for inconsistent/invalid values) Values are imputed on a proportional basis from a distribution of valid responses (e.g., from previous census) In doing so, cold deck draws values from a fixed (but possibly outdated) distribution of values UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Methods of Correcting and Imputing Data Hot Deck or Dynamic Imputation: Used for both missing data and inconsistent/invalid items Uses one or more variables to estimate the likely response based on data about individuals with similar characteristics The “donor set” (or imputation matrix) constantly changes through updating; therefore, imputations dynamically change during the process of editing all the records Thus, hot deck draws from a distribution that dynamically changes with each imputation and eventually (through modifications) “approaches” the distribution of current data set Caution: if the different items for a particular record have unknown values, hot deck may not use the same “donor” to impute for both missing values; in this case, it is preferable to use the same donor for both items UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Example of Hot Deck for Sample Household (Sex Only) UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Example of Hot Deck for Age (Sex and Relationship) Initial Imputation Matrix For Age Based on Sex and Relationship UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Example of Hot Deck for Age (Sex and Relationship) UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Example of Hot Deck for Age (Sex and Relationship)Initial Imputation Matrix For Age Based on Sex and Relationship Dynamic Imputation Matrix After Multiple Changes UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Issues Related to Hot Deck Devise dynamic imputation matrices based on people living in same small geographic area since they tend to be homogeneous with respect to many characteristics, i.e., different imputation matrices for different geographic areas should be created Sometimes the simplest approaches are best: for example, for a missing housing attribute, it may be preferable to use the value of a neighboring household rather than using a complex imputation matrix that may result in the assignment of a value from outside the neighborhood Before using dynamic imputation, an effort should be made to use related items instead. For example, if marital status is missing for an individual and there exists a spouse for that individual, then the value “married” should be assigned One should edit key items such as age and sex first so that these can be used in other imputation matrices for lower priority items UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Issues Related to Hot Deck Construct imputation matrices based on research from administrative sources or previous censuses and surveys Standardized imputation matrices, (i.e., having standard dimensions, such as age and sex (e.g., for language)) can streamline process since they can be tested and applied quickly BUT if language missing, first look to language of others in the same household or to race, ethnicity, birthplace before using dynamic imputation; i.e., an attempt should be made to use related information to assign values before resorting to imputation Some editing teams keep more than one value per cell in imputation matrices to protect against same value being imputed multiple times; e.g., in case of 4 male children in household all with ages unknown, different values will be assigned UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Issues Related to Hot Deck Imputation matrices that are too big (with too many dimensions) cannot be updated thoroughly, leading to inefficiencies and inaccuracies Imputation matrices that are too small (with too few dimensions or too few groupings within dimensions) may lead to the same donor value being used repeatedly in imputation before the matrix is updated Some items such as occupation and industry are notoriously difficult to edit since the large number of categories can make dynamic imputation very cumbersome; in such cases, may be counter-productive to impute and may be preferable to use “not stated” UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Methods of Correcting and Imputing Data: General Principles Imputed record should closely resemble the failed edit record; impute for a minimum number of variables Imputed record should satisfy all edits All imputed values should be flagged and methods and sources of imputation should be clearly specified Both un-imputed and imputed values should be stored to allow for evaluation of degree and effects of imputation UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
Edit Trails and the Use of Imputation Flags Important to generate edit trail showing all data changes and substituted values with their tallies Counters of several types are essential to process planning and management: i) number of cases of each type of error; ii) non-response rates for each item; iii) imputation rates for each item, …. Imputation flags are binary flags that change from initial value of 0 to 1 if original value of data is changed in any way; flags should be added onto each item that is imputed Although a separate file with imputation flags takes up considerable space, this information is critical for planning of future censuses; e.g., As a means to investigate age threshold below which female with “child ever born” triggers a query edit and to decide if threshold should be modified for future rounds UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008
THANK YOU! UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008