1 / 29

A ‘Microdata for Research’ sample from a New Zealand census

A ‘Microdata for Research’ sample from a New Zealand census. Mike Camden mike.camden@stats.govt.nz Statistics New Zealand www.stats.govt.nz. Reseachers accessing microdata in NZ:. Datalab Offsite Remote Access CURFs. The researchers. detail in data. output. dataset. Govt only.

popel
Download Presentation

A ‘Microdata for Research’ sample from a New Zealand census

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A ‘Microdata for Research’ sample from a New Zealand census Mike Camden mike.camden@stats.govt.nz Statistics New Zealand www.stats.govt.nz

  2. Reseachers accessing microdata in NZ: • Datalab • Offsite • RemoteAccess • CURFs The researchers detail in data output dataset Govt only code output ease of access CURF on CD

  3. The census dataset The CURF is a subset of this … where CURF = Confidentialised Unit Record File About 100 output variables (all categorical): ID Geographic Family Demographics: Location & Household Sex Age Residence Ethnicity Origin Income Employment 3 820 749 people (the census-night population)

  4. The census datasetand its CURF We want this CURF to be both useful and safe! But there are several possible results … About 100 output variables (all categorical): ID Geographic Family Demographics: Location & Household Sex Age Residence Ethnicity Origin Income Employment 3 820 749 people (the census-night population) 33 variables: all categorical, some collapsed 76 415 people; 2% 250 values modified

  5. The possible results … Usability Useful, Safe Useful, Unsafe CURFs start along here Useless, Unsafe Useless, Safe Safety We hope we’ve got ours up here !

  6. How to get safety (we hope)and not lose usefulness: • Choose the variables carefullybut location and householdare sad losses • Collapse variables carefullyto preserve important groups • Choose a small sample sizeand still get good estimates • Use Special Uniques to find rogue records and variablesand change a tiny fraction of the dataset

  7. Example: carefully collapsed categories: AgeGroup has:a few (8) large (5% +) categoriesuseful life-stage categories

  8. For future census CURFs, we’ll rethink: • Including location and household variables at expense of others • Collapsing of categories • Sample size

  9. One measure of usefulness:reliability of counts • Here’s a cell in a table:5% of the population is in it. • The CURF will give an answer ± its sampling error: 5% ± 0.08%. • This is what the sampling errorlooks like for other population %’s:

  10. Let’s fix curf size at 2%: • Let pop proportion go from 0% to 5%: What happens to sampling error of p? A 2% sample,with 76 000 people, gives good estimates

  11. The CURF expresses NZ’s diversity: • We have 5 Yes/No Ethnicity variables • Special Uniques process set some to No • The CURF adds some sampling error • 5 variables give 32 (= 25) combinations …

  12. The 32 combined ethnicities: Single ethnicities only • This variable • makes • some of usunique. • Differences • come from • Special Uniques • process • -Sampling error

  13. Census curf unique records • 74.4% of records for NZ adultshave unique combinations of values across all 33 varsThey’re Population Uniques • If someone is unique in the CURF, are they also unique in the population? What is Pr(PU|SU) ?

  14. Is a Sample Unique also a Population Unique?

  15. We volunteered ten researchers to assess safety and usefulness … • Useful?“no more variables needed”“keep the household relationships”“I’d like Region” “I’d never use Region”“I get down to small numbers (employment by ethnicity) and worry about small sample size”“the CURF will be a major asset to researchers” • Safe?“I am quite confident that our identity has been protected as much as possible”

  16. How big a sample? • 1% is Too SmallWe have sample surveys like that already:- Household Labour Force Survey- Income Survey- SoFIE • 3% is Too BigDisclosure risk up, variability down only a bit • Whole-number %s are bestfor the sampling method Suggestions please???

  17. Usability, Safety and Sample Size:

  18. Our ‘controlled’ sampling method: We used: A sort on Sex, AgeGroup (8 groups), AreaUnit (not in CURF)then a grouping into 100sthen a systematic sample from each 100 This gives great proportions for these ‘controlled’ variablesand may help related variables (ethnicities, urban/rural etc)

  19. Diffs in Counts: CURF – Expected Variation with random sampling and independance; ±1, ± 2 SDs Controlled variables give tiny differences ( ≤ ±1) Others show little drop in variation

  20. Conclusions: • Making a CURF both Useful and Safe needs:Cunning Contracts, Co-operation Confidence • Controlling the sampling improves counts:for controlled variables: spectacularly for other variables: minimally! • See www.stats.govt.nz/CURF

  21. The slides from here on are for background NEI = Not Elsewhere Indicated; X = Missing (structural)

  22. 2001 Census Statement of Confidentiality Only people authorised by the Statistics Act 1975 are allowed to see your individual information. They must use it only for statistical purposes, such as the preparation of summary statistics about groups. We’re working within this.

  23. Overseas Practice

  24. How big are NZ’s Area Units?? Mean = 2 000 There are lots of very small population Area Units

  25. Diffs in Counts: CURF – Expected • Counts in the cells for controlled variables: Sex* AgeGroup * AreaUnit are ≤ ±1 out

  26. Diffs in Counts: CURF – Expected IncomeGroup is related to Sex and AgeGroup but is still quite variable

  27. Curf behaviour: Tables of counts (From the Minicurf) • For the population we have:Population size = NSample size = n • For any cell we have:Population count = k Sample count = x, and let p = x/n • So x behaves as a …. Ummmmm …

  28. … is x a hypergeometric?? The population … k N The people with the property of interest … x n The curf …..

  29. What happens to sampling errorof p as curf size increases? • If we believe:x is hypergeometric; parameters N, n, kx is approx binomial, parameters k/N, nx is approx Poisson, parameter nk/N (when k/N is small) • Let p = x/n = proportion of curf in cell.Then SE(p) = expression /√nso it declines gracefully as n increases

More Related