580 likes | 748 Views
Imputation in the 2011 Census. NILS Brownbag Talk – 6 May 2014 Richard Elliott. Overview. Background What is imputation How did we impute the 2011 Census Strategy Process Implementation Considerations Information Next steps. Background.
E N D
Imputation in the 2011 Census NILS Brownbag Talk – 6 May 2014 Richard Elliott
Overview • Background • What is imputation • How did we impute the 2011 Census • Strategy • Process • Implementation • Considerations • Information • Next steps
Background • Legal obligation on the public to complete a Census Questionnaire accurately • A minority didn’t provide such information • Item non-response • Leave questions unanswered • Make mistakes (i.e. neglect to follow questionnaire instructions) • Provide values that are out of range (e.g. Born in 1791) • Item inconsistency • Captured values not consistent with other values on the questionnaire (e.g. 6 year old mother) • Non-response • Don’t fill in the questionnaire at all
Background • It is NISRA’s policy to report estimates for the entire population. Therefore Imputation was utilised to: • Correct for non-response • Estimate the missing persons and households • Correct for Item non-response • Fill the gaps left by unanswered questions • Correct for Item inconsistency • Ensure that the information provided is consistent • These types of data quality issues apply equally to any data collection exercise • Not specific to Census • Census Office recognises that users need to be aware
Background • While imputation was used to “fill the gaps”, its strength comes from the information that was recorded • Therefore, it is important to recognise the following: • Responses to the Census represent a self-assessment of a respondents circumstances • Proxy responses given by main householder • Respondents didn’t always complete the questionnaire correctly • 85% of questionnaires were completed on paper forms • handwriting that had to be captured using an electronic character recognition system • While Service Levels in place for capture, errors still possible
Two Types of Imputation • Item Edit and Imputation • Correcting a dataset for inconsistencies and item non-response • Making each record “complete and consistent” • Record imputation • The addition of whole records to a dataset • Estimate and adjust for persons missed, duplicated and counted in the wrong place • Increases the accuracy of the overall estimates
Item Edit and Imputation Strategy • Primary Objective: • to produce a complete and consistent database where unobserved distributions were estimated accurately by the imputation process • There were three key principles • All changes made maintain the quality of the data • The number of changes to inconsistent data are minimised • As far as possible, missing data should be imputed for all variables to provide a complete and consistent database
Item Edit and Imputation Strategy • In adhering to these principles, the following key aims were defined • Editing must not introduce bias or distortion in the data • Editing facilitates the production of output data that is fit for purpose • Editing methods help to ensure that pre-determined levels of data quality are met • Highest priority given to variables which define population bases (e.g. Age and Sex) • Editing supports the production of the population estimates by ensuring that the basic population estimates are accurate
Item Edit and Imputation Strategy • Used a similar but enhanced version of the framework adopted in 2001 • One Number Census Process • Tried and tested in 2001 • Was undertaken as part of the Downstream Processing (DSP) project at ONS • Included both Item and Record Imputation • Supplemented by detailed QA at every stage by NISRA Census Office • NISRA benefitted from enhancements to the system found through ONS data processing • Ultimately NISRA responsible for processing of NI data and any parameter tweaking / re-runs
Imputation Process – 4 Key Stages 1. Cleansing the Data 2. Item Imputation
Imputation Process – 4 Key Stages 1. Cleansing the Data 2. Item Imputation
Implementation – Capture and Coding • Capture and coding rules • Turning tick and text responses into data that could be edited and imputed • Complex coding used to assign numerical values to written text and ticked boxes (e.g. occupation and industry coding) • Invalid responses flagged for imputation (V, W, Y and Z) • Determinations made to responses to resolve combinations of tick and text • Ticks that could not be determined set to W (failed multi-tick) • Text that was uncodeable set to V (uncodeable text response) • Data subject to checks to ensure each question response was within a predefined range (e.g. No year of birth before 1896 or after 2011) • Invalid responses set to Z (out of range) • Missing data flagged as Y (missing requires imputation)
Implementation – Capture and Coding • Determining combinations of ticks • Single tick
Implementation – Capture and Coding • Determining combinations of ticks • Resolvable multi-tick
Implementation – Capture and Coding • Determining combinations of ticks • Irresolvable multi- tick This will be assumed missing and imputed.
Implementation – Capture and Coding • Missing data This will be imputed.
Implementation – Capture and Coding • Resolving write-ins • Numbers 1 01
Implementation – Capture and Coding • Resolving write-ins • Numbers two 02
Implementation – Capture and Coding • Resolving write-ins • Range Check 199 This will be assumed missing and imputed.
Implementation – Capture and Coding • Resolving write-ins • Codeable response “FRANCE” gets coded to 250 F R A NC E
Implementation – Capture and Coding • Resolving write-ins • Uncodeable response “SUGAR” is clearly not a country so set to set to VVV. This will be assumed missing and imputed. S U G A R
Imputation Process – 4 Key Stages 1. Cleansing the Data 2. Item Imputation
Implementation – RMR • Reconcile Multiple Responses (RMR) • Removal of false persons • Removal of persons generated by capture anomalies • For example: strike throughs, inadequately completed questionnaires • Removal of duplicates (multiple persons / households) • Individuals who included themselves more than once • Separated parents who included their children at both addresses • Creating households / communals from multiple questionnaires • Consolidating H4 / HC4 / I4 etc • Validation • Renumbering person records within households / communals
Imputation Process – 4 Key Stages 1. Cleansing the Data 2. Item Imputation
Implementation – FRDVP • Filter Rules and Derived Variables for Processing (FRDVP) • Correct data by applying edits to correct for questionnaire routing errors • Apply hard edits to keep individual records consistent • Minimal at this stage (mostly applied within imputation system) • Information not required set to X • No imputation done on any variable set to X • Create high level variables to be used within the Item Imputation system • Blocking variables for donor searching • Makes it easier to find donors
Implementation – FRDVP • Surplus information – questionnaire routing In this scenario the respondent should have skipped question 6 and gone straight to question 7. Therefore, since the respondent should not have answered question 6, it is set to: X (not required)
Implementation – FRDVP • Surplus information – questionnaire consistency 0 1 0 1 2 0 1 1 9 L O RD WAR DE N S In this scenario, since the respondent is aged under 1 on Census day, and therefore did not have a usual address one year ago, the captured address information is set to X. C RE S C E N T B T 1 9 1 Y J
Imputation Process – 4 Key Stages 1. Cleansing the Data 2. Item Imputation
Implementation – Item Imputation • Achieved using CANCEIS • Canadian Census Edit and Imputation System • Developed specifically for Census type data • ie a mix of categorical and numerical variables • Donor-based edit and imputation system that can simultaneously: • Apply nearest-neighbour donor imputation • Apply deterministic edits and maintain consistency • Evaluated and endorsed as the 2011 Census imputation tool • Faster • Less resource intensive • Allowed for more joint-imputation
Implementation – CANCEIS • How did CANCEIS work in practice • The database was divided up into processing units for the purposes of resource management and maximising donor pools Household questions Individual imputation Donor unit = household Three Geographic units Household Persons 1 to 6 Joint Household imputation Between person edits and relationships Donor unit = household of same size Household Persons 7 to 30 Individual imputation Relationship to Person 1 Donor unit = individual person Person questions Communal Persons Individual imputation Donor unit = individual person
Implementation – CANCEIS • How did CANCEIS work in practice • The database was divided up into processing units for the purposes of resource management and maximising donor pools Household questions Individual imputation Donor unit = household Three Geographic units Household Persons 1 to 6 Joint Household imputation Between person edits and relationships Donor unit = household of same size Household Persons 7 to 30 Individual imputation Relationship to Person 1 Donor unit = individual person Person questions Communal Persons Individual imputation Donor unit = individual person
Implementation – CANCEIS • How did CANCEIS work in practice • The household questions were imputed within a single module • Person data was divided up into 4 modules • Aim was to group variables that help predict each other • Attempt to maximise the number of donors for a given group Demographics e.g. Age, Sex, Marital status, Student, Activity last week Culture e.g. Ethnicity, Country of birth, Language, Passports Health e.g. General health, Disability, Long-term condition Labour Market e.g. Economic activity, Hours worked, Qualifications
Implementation – CANCEIS • How were the donors selected? • Within each module a number of matching variables were used to select donors • Matching variables were weighted according to several factors • How well they would predict other values and how highly they should be prioritised when resolving inconsistencies • For example, age is often a good predictor of other demographic variables • Age was given a high weight, therefore observed ages were prioritised over other values if there was an inconsistency and changes were required • Northings and Eastings were used to control for geographical differences and find donors from similar areas • These were given a small weight compared to demographic characteristics like age, sex and marital status etc
Implementation – CANCEIS • Matching variables (example) • Suppose someone omitted to fill in their occupation details • The record would be flagged for imputation under the Labour Market module • Donor pool identified by matching on (for example): • Economic Activity • Industry • Hours worked • Qualifications • These variables deemed to influence Occupation • Occupation information imputed from a donor with similar Labour Market characteristics
Implementation – CANCEIS • Editing and Imputing was done simultaneously • Each record was checked for consistency before imputation • Any items that failed the checks were marked for imputation along with the missing items • A single donor was selected to resolve inconsistencies and non-response • Only values which satisfied the edit constraints were imputed into the recipient record • CANCEIS sought to minimise the number of changes required to repair a record when edit constraints were in place • There were 31 edit rules which were broadly based on 2001 • e.g. If aged between 5 and 15 then must be in full-time education • Some rules had to be updated to account for changes since 2001 • e.g. Removal of rule that did not allow same-sex couples • Replaced with rules that said married couples had to be opposite-sex and civil partners had to be same-sex
Implementation – CANCEIS • Say we have the following (oversimplified) example: • Student is missing • Requires imputation under the demographic module • This record is subject to two edit constraints • Must be aged 16+ to be married • Aged 5 to 15 must be a student • Fails Rule A since aged 10 and married • Therefore, both Age and Marital Status are also flagged for imputation
Implementation – CANCEIS • Say we have the following (oversimplified) example: • Student is missing • Requires imputation under the demographic module • This record is subject to two edit constraints • Must be aged 16+ to be married • Aged 5 to 15 must be a student • Fails Rule A since aged 10 and married • Therefore, both Age and Marital Status are also flagged for imputation
Implementation – CANCEIS • The system searches for potential donors • Matching on demographic variables • Uses Northings and Eastings to find a donor in the area • The following records are returned:
Implementation – CANCEIS • Must be aged 16+ to be married • Aged 5 to 15 must be a student
Implementation – CANCEIS • Must be aged 16+ to be married • Aged 5 to 15 must be a student • Donor1 • Using Donor1 would mean that “Single” is taken as well as “No”
Implementation – CANCEIS • Must be aged 16+ to be married • Aged 5 to 15 must be a student • Donor1 • Using Donor1 would mean that “Single” is taken as well as “No” • The new record fails Rule B • Therefore Age is taken from the donor as well
Implementation – CANCEIS • Must be aged 16+ to be married • Aged 5 to 15 must be a student • Donor1 • Using Donor1 would mean that “Single” is taken as well as “No” • The new record fails Rule B • Therefore Age is taken from the donor as well Two observed value changes
Implementation – CANCEIS • Must be aged 16+ to be married • Aged 5 to 15 must be a student • Donor2 • Using Donor2 would mean that “Single” is taken as well as “Yes”
Implementation – CANCEIS • Must be aged 16+ to be married • Aged 5 to 15 must be a student • Donor2 • Using Donor2 would mean that “Single” is taken as well as “Yes” • The new record passes both Rule A and Rule B Only one observed value change
Implementation – CANCEIS • Must be aged 16+ to be married • Aged 5 to 15 must be a student • Donor2 • Using Donor2 would mean that “Single” is taken as well as “Yes” • The new record passes both Rule A and Rule B • Donor2 given a higher probability of selection Only one observed value change
Implementation – CANCEIS • Points to note • Variables were imputed in blocks of similar variables (modules) • there was no individual model for any one question • There is independency between the modules • for example, cultural characteristics might come from a different donor to employment characteristics • Imputed person data was combined in a way that maintained relationship consistency within a household • Given the processing approach, quality was maintained at the geographic unit level
Imputation Process – 4 Key Stages 1. Cleansing the Data 2. Item Imputation
Implementation – Manual Imputation • Manual Imputation kept to a minimum but was necessary • Manual Imputation – QA checks • Quality Assurance at every stage of processing • Distributional checks and checks against comparator data sources • Edits made through Data File Amendments (DFAs) • DFAs not taken lightly • Involved detailed questionnaire image analysis • Mostly correcting for capture errors • e.g. Centenarians • Manual Imputation to increase donor pool • Temporary changes sometimes required when donor pool too small • E.g. Postcode matching (would have been done later in processing but brought forward)
Imputation Process – 4 Key Stages 1. Cleansing the Data 2. Item Imputation