260 likes | 274 Views
This presentation discusses strategies for editing a combination of census and tax data, including record linkage, data collection processes pre-editing and income question edit and imputation methods using the Canadian Edit and Imputation System (CANCEIS). Adjustments, coverage studies, and E&I modules are outlined.
E N D
Editing a Mixture of Canadian 2006 Census and Tax Data Mike Bankier Statistics Canada 2006 Work Session on Statistical Data Editing bankier@statcan.ca
Introduction • Census respondents can give permission to link to tax form rather than answer 13 part census income question on 20% sample long form • Early returns indicate permission rate of 83%. • Done to reduce level of response burden plus partial/total NR rate was rising for income. • Also census responses often approximate while tax responses generally very accurate.
Overview of Talk • Brief review of census/tax record linkage. • Census data collection and processing prior to E&I. • Strategy to perform E&I on mixture of census and income tax data.
Census/Tax Record Linkage • STC’s Generalized Record Linkage System (GRLS) based on Fellegi/Sunter will be used. • Name, birthdate, address, telephone number, sex, marital status, disability status, labour activity status (but not SIN) used to link. • Nicknames, reordering names, accounting for typographic errors, search across Canada, more weight for common names will be used to achieve expected 85% match rate.
Census/Tax Record Linkage • Only very good matches retained since incorrect matches can generate undesirable outliers. • No manual review done of all links because of large volume of data. • Parameters fined tuned by running linkage several times and assessing quality of links for a sample of persons.
Data Collection/Processing Prior E&I • In 2001, enumerators listed dwellings and dropped off a questionnaire. Questionnaires completed and mailed back by respondent. • In 2006, dwellings listed in advance and questionnaires were mailed to them for approximately 2/3 of dwellings. Other 1/3 treated the same way as in 2001. • 20% questionnaires completed over Internet.
Data Collection/Processing Prior E&I • Completed questionnaires scanned and data captured using intelligent character recognition. • Any responses not captured, keyed from imaged questionnaire. • In 2001, corrections made before keying (for example cents recorded as dollars) but not feasible for 2006. • In 2004 test, error rate of 11% for income variables.
Data Collection/Processing Prior E&I • Non-respondents or partial respondents with non-response to many questions were phoned or visited. • Coverage edits applied at processing centre and persons were added or subtracted occasionally.
Data Collection/Processing Prior E&I • Edits flagged persons with income responses outside limits. • Reviewed manually by comparing to correlated characteristics, looking at questionnaire image and manually modifying if necessary.
Data Collection/Processing Prior E&I • Majority of income errors the result of • Decimals not recognized or not provided • Confusion between income sources • Monthly amounts reported • Occasionally erroneous amounts entered as prank • Tax forms excluded from manual process because linkage done later and tax data mostly error free.
Adjustments Done – Coverage Studies • Dwelling Classification Survey revisited sample of households to determine if they had been classified correctly as not part of housing stock, unoccupied or occupied. Census data base adjusted for estimated undercoverage and overcoverage. • Reverse Record Check measures undercoverage and overcoverage from all sources, is used to adjust the provincial population totals but does not adjust the Census data base.
E&I of the Income Questions • With completion of the tax/census linkage, income data from Census and tax sources will be available on the Census data base. • Canadian Edit and Imputation System (CANCEIS) will be used for all Census variables including income to perform • Deterministic imputation • Minimum change donor imputation • Derive new variables
E&I of the Income Questions • Assumed income data given by most respondents is correct so every attempt will be made to change as few responses as possible. • Some fields imputed deterministically. • Donor imputation used to resolve NR. • Also balance edits to make sure income components sum to within 10% of total income. • Total income is adjusted in later step to ensure perfect agreement with components.
E&I of the Income Questions • Series of CANCEIS modules used . • First three modules • Merge tax and census data together. • Calculate average employment income by occupation and geography (SAS) for later use as matching variable. • Define strata to be used in later modules. • Determine status for each income field (income with amount reported, income indicated, loss indicated, no income, non-response).
E&I of the Income Questions • Modules 4 to 6 impute missing income responses while ensuring total within 10% of sum of components. • Module 4 imputes partial respondents who provided total income. • Module 5 imputes partial respondents who did not provide total income but provided all the components of employment income. • Module 6 imputes all other partial and total non-respondents to the income question.
E&I of the Income Questions • Modules 7 and 8 select a sample respondents with no pension benefits and impute positive amounts through donor imputation. • Modules 9 and 10 do something similar but for employment insurance benefits. • Module 11 derives other government benefits such as old age security pension.
E&I of the Income Questions • Module 12 uses donor imputation to resolve non-response to the income tax field. • Module 13 derives total income after tax. • Other modules aggregate income to the family and household levels plus derive 2 low income flags.
E&I of the Income Questions • Donor selection edits extensively used to restrict what records which pass the edits can be used as donors. • Reduces the number of outliers generated through imputation.
E&I of the Income Questions • In search for donors, distance measure applies larger weights to income fields considered more important or reliable such as total income. • Numeric amount can be missing but boxes checked can indicate that amount should be negative. Distance measure can be configured to almost guarantee that negative quantity will then be imputed.
Other Changes in E&I Since 2001 • Number of strata will be reduced dramatically since some variables used for stratification in 2001 now used in the distance measure to identify donors, this reduces boundary effects. • Also in the past, exact matches within a stratum was required while with CANCEIS near matches will be allowed (e.g. age difference of 3 years). In past default imputation sometimes used while with CANCEIS a donor will always be found.
Differences in Processing of Census/Tax Data • During donor imputation, data from tax records will generally be treated the same as data from census forms. • For tax data, will derive • Quebec provincial tax • Child Benefits • GST Credits • For census form data, Child Benefits, GST Credits will be derived.
Differences in Processing of Census/Tax Data • When adjusting for under-reporting of pensions and employment insurance, tax responses are not adjusted because of policy not to modify them. • When imputing income tax field from census forms, donors restricted to tax forms because of poor quality of responses on census forms.
Evaluation of Income E&I • On experimental basis, income responses were blanked out and then CANCEIS imputed the blanks. • CANCEIS was quite effective at replicating responses and preserving distributions when matching variables were correlated with the variable being imputed.
Future Evaluations of Income • Some people provide permission to link and also answer the income question on the census form. • It will be interesting to compare the tax and census responses for these people. • In 2004 test, census income data often rounded to nearest thousand or five thousand. • Mode effects (paper versus internet) may also be studied.
Future Changes to E&I • It is hoped that we can eliminate certain deterministic modules by obtaining Child Benefits, for example, from other sources. • Using CANCEIS, it may be possible to reduce the number of modules used in later censuses and improve consistency with labour and education variables.
Conclusions • Many changes to processing including use of tax data, new questionnaire layout for scanning, use of new E&I system. • These changes will require careful monitoring during production and may require fine-tuning. • Given high quality of tax data, its availability should prove useful.