290 likes | 303 Views
Access microdata from a New Zealand census for research purposes. Curate a subset of the census dataset with 100 output variables such as demographics, ethnicity, income, and employment to ensure usability and safety. Explore controlled sampling methods for improved data accuracy. Learn about the importance of variable selection and sample size for reliable results.
E N D
A ‘Microdata for Research’ sample from a New Zealand census Mike Camden mike.camden@stats.govt.nz Statistics New Zealand www.stats.govt.nz
Reseachers accessing microdata in NZ: • Datalab • Offsite • RemoteAccess • CURFs The researchers detail in data output dataset Govt only code output ease of access CURF on CD
The census dataset The CURF is a subset of this … where CURF = Confidentialised Unit Record File About 100 output variables (all categorical): ID Geographic Family Demographics: Location & Household Sex Age Residence Ethnicity Origin Income Employment 3 820 749 people (the census-night population)
The census datasetand its CURF We want this CURF to be both useful and safe! But there are several possible results … About 100 output variables (all categorical): ID Geographic Family Demographics: Location & Household Sex Age Residence Ethnicity Origin Income Employment 3 820 749 people (the census-night population) 33 variables: all categorical, some collapsed 76 415 people; 2% 250 values modified
The possible results … Usability Useful, Safe Useful, Unsafe CURFs start along here Useless, Unsafe Useless, Safe Safety We hope we’ve got ours up here !
How to get safety (we hope)and not lose usefulness: • Choose the variables carefullybut location and householdare sad losses • Collapse variables carefullyto preserve important groups • Choose a small sample sizeand still get good estimates • Use Special Uniques to find rogue records and variablesand change a tiny fraction of the dataset
Example: carefully collapsed categories: AgeGroup has:a few (8) large (5% +) categoriesuseful life-stage categories
For future census CURFs, we’ll rethink: • Including location and household variables at expense of others • Collapsing of categories • Sample size
One measure of usefulness:reliability of counts • Here’s a cell in a table:5% of the population is in it. • The CURF will give an answer ± its sampling error: 5% ± 0.08%. • This is what the sampling errorlooks like for other population %’s:
Let’s fix curf size at 2%: • Let pop proportion go from 0% to 5%: What happens to sampling error of p? A 2% sample,with 76 000 people, gives good estimates
The CURF expresses NZ’s diversity: • We have 5 Yes/No Ethnicity variables • Special Uniques process set some to No • The CURF adds some sampling error • 5 variables give 32 (= 25) combinations …
The 32 combined ethnicities: Single ethnicities only • This variable • makes • some of usunique. • Differences • come from • Special Uniques • process • -Sampling error
Census curf unique records • 74.4% of records for NZ adultshave unique combinations of values across all 33 varsThey’re Population Uniques • If someone is unique in the CURF, are they also unique in the population? What is Pr(PU|SU) ?
We volunteered ten researchers to assess safety and usefulness … • Useful?“no more variables needed”“keep the household relationships”“I’d like Region” “I’d never use Region”“I get down to small numbers (employment by ethnicity) and worry about small sample size”“the CURF will be a major asset to researchers” • Safe?“I am quite confident that our identity has been protected as much as possible”
How big a sample? • 1% is Too SmallWe have sample surveys like that already:- Household Labour Force Survey- Income Survey- SoFIE • 3% is Too BigDisclosure risk up, variability down only a bit • Whole-number %s are bestfor the sampling method Suggestions please???
Our ‘controlled’ sampling method: We used: A sort on Sex, AgeGroup (8 groups), AreaUnit (not in CURF)then a grouping into 100sthen a systematic sample from each 100 This gives great proportions for these ‘controlled’ variablesand may help related variables (ethnicities, urban/rural etc)
Diffs in Counts: CURF – Expected Variation with random sampling and independance; ±1, ± 2 SDs Controlled variables give tiny differences ( ≤ ±1) Others show little drop in variation
Conclusions: • Making a CURF both Useful and Safe needs:Cunning Contracts, Co-operation Confidence • Controlling the sampling improves counts:for controlled variables: spectacularly for other variables: minimally! • See www.stats.govt.nz/CURF
The slides from here on are for background NEI = Not Elsewhere Indicated; X = Missing (structural)
2001 Census Statement of Confidentiality Only people authorised by the Statistics Act 1975 are allowed to see your individual information. They must use it only for statistical purposes, such as the preparation of summary statistics about groups. We’re working within this.
How big are NZ’s Area Units?? Mean = 2 000 There are lots of very small population Area Units
Diffs in Counts: CURF – Expected • Counts in the cells for controlled variables: Sex* AgeGroup * AreaUnit are ≤ ±1 out
Diffs in Counts: CURF – Expected IncomeGroup is related to Sex and AgeGroup but is still quite variable
Curf behaviour: Tables of counts (From the Minicurf) • For the population we have:Population size = NSample size = n • For any cell we have:Population count = k Sample count = x, and let p = x/n • So x behaves as a …. Ummmmm …
… is x a hypergeometric?? The population … k N The people with the property of interest … x n The curf …..
What happens to sampling errorof p as curf size increases? • If we believe:x is hypergeometric; parameters N, n, kx is approx binomial, parameters k/N, nx is approx Poisson, parameter nk/N (when k/N is small) • Let p = x/n = proportion of curf in cell.Then SE(p) = expression /√nso it declines gracefully as n increases