490 likes | 556 Views
Shortcomings of Census Interaction Data. Oliver Duke-Williams o.w.duke-williams@leeds.ac.uk. Shortcomings. Overall data quality Statistical Disclosure Control Variant geographies Lack of comparability over time. Overall data quality. Generic issues Unit non-response Item non-response
E N D
Shortcomings of Census Interaction Data Oliver Duke-Williams o.w.duke-williams@leeds.ac.uk
Shortcomings • Overall data quality • Statistical Disclosure Control • Variant geographies • Lack of comparability over time
Overall data quality • Generic issues • Unit non-response • Item non-response • Interaction data issues • Problems of address recall for migration data • Problems of address accuracy for workplace data • Changing concept of usual residence
Non-response • Unit non-response – under-enumeration – is a problem for all Census data • It particularly affects migration data • Migrants are 2-10 times more likely to be missed from a Census than residents who have not moved – Simpson & Middleton (1997) • Item non-response refers to those people who have completed a Census form, but not answered a specific question
Patterns of non-response: 2001 • Address one year ago, non-response quantiles
Patterns of non-response: 2001 • Workplace postcode, non-response quantiles
Patterns of non-response: 2001 • Method of travel, non-response quantiles
Item non-response • Various possibilities for former residence and workplace addresses • Address correct but no postcode • Part postcode given (e.g. ‘LS1’) • No information given • The 1991 interaction data included the categories ‘address not stated’ and ‘workplace not stated’
Migrant origin not stated • Migrants with origin unstated as % of total inflow, 1990-91 • Limited spatial patterns • Significant numbers for most districts
Item non-response • In 2001, unknown or incomplete addresses were imputed using donor records • First, select possible donors on the basis of predictive variables • SWS: Industry, occupation, establishment size, mode of transport • SMS: Other migrants in household, country of birth, marital status • Use partial information if available • Then, select geographically nearest donor
Shortcomings • Overall data quality • Statistical Disclosure Control • Variant geographies • Lack of comparability over time
Statistical Disclosure Control • Methods applied to interaction data • 1981 • 1991 • 2001
SDC: 1981 • Workplace data – based on 10% sample, therefore no further modification required • Migration data • Set 1 • Within ward • Ward to rest of district (for flows > 25 persons) or ward to rest of county etc. • Set 2 • Ward level, total males and females only
SDC: 1991 • Workplace data – based on 10% sample, therefore no further modification required • Migration data • Suppression applied to some tables
SDC: 1991 – SMS • Set 1: Flows within and between wards • Set 2: Flows within and between districts
Greater London Metro counties Other counties (sorted alphabetically) per district totals Extent of suppression • Districts are grouped by county • Shading: • Red: Total migrants >= 10 • Blue: Total migrants 0 < n < 10 • White: Total migrants = 0
Effect of suppressionWhite migrants, 1990-91Published value as % of estimated correct value
Effect of suppressionBlack migrants, 1990-91Published value as % of estimated correct value
Effect of suppressionIndian, P‘stani, B’deshi migrants, 1990-91Published value as % of estimated correct value
Effect of suppressionChinese and other migrants, 1990-91Published value as % of estimated correct value
Effect of suppressionMis-reporting of largest non-white migrant group
Coping with problems - 1991 • Under-enumeration • Suppression
The MIGPOP data set • MIGPOP data set • Produced by Simpson and Middleton (1999) • Available from CIDER through WICID • Allows for • ‘Missing million’ • Under-reporting of migrants • Migrants with unknown origin • Contains one age by sex table
Suppression • Migration from Mid-Bedfordshire to Avon, 1990-91
SMSGAPS • SMSGAPS dataset incorporates recovered and estimated data for most suppressed tables • Produced by Rees and Duke-Williams (1997) • Contains versions of all SMS Set 2 tables except 11S and 11W • Available from CIDER through WICID
SDC: 2001 • Outputs of the 2001 Census were subject to Small Cell Adjustment Methodology • Initial version of cross-tabulation produced from raw data • ‘Small values’ were then modified • Sub-totals and totals for each table were then recalculated from the modified values
SCAM example ? ? ?
SCAM • SCAM was applied differentially across the UK • This is particularly confusing for the interaction data, as they are explicitly presented as UK level data set • SCAM was applied on the basis of where the data were collected • Migration data were collected at the destination • Flows with destinations in England, Wales and Northern Ireland were subject to SCAM • Workplace data were collected at the residence (origin) • Flows with origins in England, Wales and Northern Ireland were subject to SCAM • In addition, OA level workplace data with origins in Scotland were subject to SCAM • OA level workplace data were not published for Northern Ireland
Frequency of flow totals, 2001SMS Table MG301 Frequency of flow totals, 2001SMS Table MG301: detail Frequency of flow totals, 2001 SWS Table W301: detail Effects of SCAM • Interaction data are characterised by: • Sparse matrices • Dominance of small values • 2001 data characterised by over-reporting of multiples of 3
2001 data and multiples of 3 • It is the interior cells that are modified • Flow totals are re-calculated from these modified values
Coping with problems: 2001 • Tactics for using SCAM affected data • Use average values? • Useful in some situations, but could lead to errors if rates are calculated • Use minimum number of cells to calculate required value
Shortcomings • Overall data quality • Statistical Disclosure Control • Variant geographies • Lack of comparability over time
Variant geographies • Changes between Censuses • A problem that is common across all Census outputs • Differences compared to other Census products • Problems specific to the interaction data, in particular the 2001 data
Differences between Census products • The 2001 interaction data have geographies that do not always match those in the other aggregate data • Level 1: Output Areas • Interaction data are the same as other outputs • Level 2: ‘Wards’ • Interaction data are an amalgam of • CAS wards in England and Wales • ST wards in Scotland • Standard wards in Northern Ireland • Level 3: ‘Districts’ • Interaction data are an amalgam of • London boroughs, metro and other districts, Unitary authorities, Scottish Council Areas • Parliamentary constituencies in Northern Ireland
Problems of different geographies • When mapping data, correct boundary sets are time consuming to assemble • When constructing rates, correct denominators are time consuming to gather • Not all area data are easily available for all of these geographies
Shortcomings • Overall data quality • Statistical Disclosure Control • Variant geographies • Lack of comparability over time
Lack of comparability over time • As well as changes in geography, there are significant changes in data structure over time • General issues • Changes in population base, inclusion of students etc. • Handling of unknown migrant origins or workplace locations • Migration data • Handling of overseas origins • Use of ‘no usual residence’ Workplace data • Handling of off-shore workers • Handling of home-workers
No usual residence in 2001 migration data • Mean: 6.9% • Minimum: 3.7% - Ribble Valley • Maximum: 19% - Newham • 19/20 districts with highest levels are in London • Percentage of all migrants 2000-1, by district, who had ‘no usual residence’ one year prior to the Census
Home-workers • 1981 – Workplace at home is part of general ‘within ward’ flow • Home-workers only be distinguished from others in the ‘mode of transport’ table • 1991 – Workplace at home is a distinct workplace location • All tables can be extracted separately for home-workers • 2001 – Workplace at home is part of general ‘within ward’ flow • Home-workers only be distinguished from others in the ‘mode of transport’ table
Coping with compatibility issues • Various data sets exist that attempt to bridge some of these gaps • Re-estimate for newer geographies • eg 1981 data on 1991 and 2001 boundaries (Boyle and Feng, 2002) • Create hybrid sets • eg merge home-workers into main flow for 1991 • Create best-fit geographies than span time periods • eg CIDS common geographies
Summary • The interaction data suffer from problems related to • Disclosure control modifications • Changes over time • Awkward geographies in 2001 • These have been addressed by • Estimated and re-worked data sets • Data estimated for different boundary sets
References Boyle PJ and Feng Z (2002) A method for integrating the 1981 and 1991 GB Census interaction data Computers, Environment and Urban Systems 26 241-56 Rees, P.H. and Duke-Williams, O. (1997) Methods for estimating missing data on migrants in the 1991 British Census, International Journal of Population Geography, 3: 323-368 Simpson, S. and Middleton, E. (1997) Who is missed by a national Census? A review of empirical results from Australia, Britain, Canada and the USA, CCSR Working Paper No 2 Centre for Census and Survey Research, University of Manchester Simpson, S. and Middleton, E. (1999) Undercount of migration in the UK 1991 Census and its impact on counterurbanisation and population projections, International Journal of Population Geography, 5: 387-405