1 / 46

John M. Abowd U.S. Census Bureau and Cornell University

Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1. John M. Abowd U.S. Census Bureau and Cornell University CNSTAT Panel on the Census Bureau’s Dynamics of Economic Well-being System January 26, 2007. Background.

braden
Download Presentation

John M. Abowd U.S. Census Bureau and Cornell University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1 John M. AbowdU.S. Census Bureau and Cornell University CNSTAT Panel on the Census Bureau’s Dynamics of Economic Well-being SystemJanuary 26, 2007

  2. Background • Longstanding goal of the Census Bureau • Statutory mandate to provide survey data used to study critical policy issues • Focus of long standing internal Census Bureau survey improvement project that is part of the LEHD Program • This is the first Title 13/Chapter 5 predominant purpose for using IRS data • Treasury Regulation Change, February 2001 (final regulation February 2003) • New W-2 items authorized: SSN, EIN, Box 1, Box 3, Box 13, number of quarters, 1099R • Creation of a public use data set that integrates survey and administrative data is the other predominant Title 13/Chapter 5 purpose

  3. Team and Sponsorship • The project was conducted by a team of researchers from the Census Bureau, IRS, Social Security Administration, and a consortium of university partners • Main financial support provided by the Census Bureau, Social Security Administration, and the National Science Foundation • Primary design decisions made by an inter-agency team lead by Martha Stinson at the Census Bureau and with the participation of SSA, IRS, the Congressional Budget Office, and the Joint Committee on Taxation

  4. Acknowledgements: Research Team • Martha Stinson (Census Bureau), project manager • Gary Benedetto, Lisa Dragoset, Sam Hawala, Bryan Ricchetti (Census Bureau) • Karen Masken (IRS) • Simon Woodcock (Simon Fraser University), Jerry Reiter (Duke University), Josep Domingo-Ferrer (University of Rovira and Virgili), Vicenc Torra (University of Barcelona), Lars Vilhuber (Cornell University and Census Bureau), consultants

  5. Acknowledgements I: Agencies • Kenneth Prewitt, C. Louis Kincannon, Hermann Habermann, Paula Schneider, Nancy Gordon, Frederick Knickerbocker, Cynthia Clark, Howard Hogan, and Thomas Mesenbourg, senior management Census Bureau • Susan Grad, Howard Iams, and Paul van de Water, senior management SSA • Mark Mazur and Nicholas Greenia, IRS senior management and IRS/SOI Census Bureau disclosure liaison • Daniel Newlon, NSF project officer

  6. Acknowledgements II: Agencies • Chet Bowie, Al Tupek, Barry Sessamen Dan Weinberg, Ron Prevost, Jeremy Wu, division and program management Census Bureau • Brian Greenberg, Dawn Haynes, SSA technical support, contract management, and disclosure officers • Patricia Doyle, Judith Eargle and Nancy Bates, Census Bureau SIPP research direction • Charlene Leggieri and Sally Obenski, Census Bureau administrative records management • Laura Zayatz, Census Bureau statistical disclosure research direction • John Sabelhaus, Congressional Budget Office research direction

  7. Conceptual Framework • Link all SIPP panels from the 1990s • Five panels: 1990, 1991, 1992, 1993, 1996 • Link to IRS data • Summary Earnings Records (FICA taxable earnings 1937-1950, and 1951-2003 annual) • Detailed Earnings Record (job level data, uncapped, 1978-2003 annual) • SSA benefits data • Master Beneficiary Record, Supplemental Security Record, Payment History Update System, 831 file (all available historical data through 2002) • Create product that prevents individuals from being re-identified in the current public use SIPP files

  8. Major Design Decisions • Limit number of SIPP variables included • Target national retirement and disability research communities • Investigate disclosure avoidance methods to protect both survey and administrative data • But, note that a re-identification in the current SIPP public use files is not a disclosure since those files have also been subjected to extensive disclosure avoidance procedures • Very high hurdle

  9. Latest Versions • Gold Standard confidential file at release 4.0 • All confidential data (person-level), all sources • Beta Public Use File 4.1 • All person-level SIPP, IRS variables from the Gold Standard Version 4.0 • Benefit and type of benefit measures for initial SSA benefit (if any), benefit and type of benefit as of April 1, 2000 • Consistent panel weight for civilian, non-institutional population as of April 1, 2000 (synthesized on each implicate) • Four missing data implicates with four synthetic implicates each (16 implicates total)

  10. Summary of Discussion Today • A tour of the methods used to complete and synthesize the SIPP-PUF • Some disclosure avoidance results • Selected analytical validity results

  11. Multiple Imputation Confidentiality Protection History • Rubin (1993): treat unsampled individuals in population as missing the survey data, impute missing values (synthetic population), sample and release (fully synthetic data) • Little (1993): treat sensitive values as missing, impute and release imputed values (partially synthetic data) • Feinberg (1994): parametric Bayesian procedure eliminated the use of any actual values in synthetic data • Ragunathan, Reiter, and Rubin (2003): adapted the Sequential Regression Multivariate Imputation method to synthetic data • Reiter (2004): Inference-valid combination of multiple imputation for missing and synthetic data • Abowd and Woodcock (2001): Applied SRMI to confidentiality protection of longitudinally linked employer-employee synthetic micro-data

  12. Multiple Imputation Confidentiality Protection Methods • Denote confidential data by Y and nonconfidential data by X (may be empty) • Both Y and X may contain missing data, so that Y=(Yobs , Ymis) and X=(Xobs , Xmis) • Assume database can be represented by joint density p(Y,X,θ) • Estimate the posterior predictive distribution p(Ynew, Xnew | Yobs, Xobs) • Sample multiple times from the posterior predictive distribution, release these samples

  13. Sequential Regression Multivariate Imputation (SRMI) Method • Synthetic data values are draws from the posterior predictive density: • In practice, use a two-step procedure: 1) complete the missing data using SRMI2) draw synthetic data from predictive density given the completed data • Repeating the procedure yields multiple synthetic data implicates

  14. SRMI Method Details • Specifying the joint density p(Y,X,θ) is unrealistic in most applications • Instead, approximate the joint density by a sequence of conditional densities defined by generalized linear models • Synthetic values of some are draws from:where Ym,Xmare completed data, and densities pkare defined by an appropriate generalized linear model and prior, a Dirichlet-multinomial model, or a Bayesian Bootstrap

  15. Maintaining Relationships in the Underlying Data • Define a multilevel parent-child tree to describe the exact relationships in the data • Variables at the root of this tree should have values for all individuals, completed and synthesized first (but as a function of all data) • Child variables only completed or synthesized when appropriate given the parent variable • For missing data, iterate nine times to complete all missing data, sample 4 implicates • For synthetic data, condition on values from the completed data, sample 4 implicates per completed implicate

  16. Maintaining Multivariate Distributions • Automated creation and management of stratifying (grouping) variables and conditioning variables • Bayesian bootstrap procedure for sets of related discrete variables estimated using the automated grouping • SRMI procedure for most continuous variables using automated grouping, conditioning variable management, Bayesian model selection

  17. Maintaining Univariate Distributions • Automated management of sets of related continuous variables (e.g., earnings histories) • Within stratifying groups, automated management of a non-parametric transform with inverse transform to preserve the univariate distribution of all continuous variables within group

  18. SRMI Example: Date of Birth • Link administrative birth date (more accurate) • Take birth date from Bayesian bootstrap link of couple administrative records when SSN is not available • Formulate grouping and control variable lists and hierarchy (two sets) • Perform overall stratifications, sample size checks

  19. SRMI Example: Date of Birth • By unique values of the grouping variables • Estimate the pdf of birth date using a kernel density estimator • Transform birth date to normal using the estimated KDE • Estimate a linear regression of transformed birth date on the master list of control variables for this group • Use Bayesian model selection to prune variable list • Re-estimate the linear regression using the Bayesian Normal-Inverse Gamma natural conjugate posterior (flat priors) • Sample from the posterior distribution of  and 2 • Given , sample from the predictive distribution of transformed birth date • Invert the transformation on birth date

  20. SRMI Example: Critical Dates

  21. Bayesian Bootstrap Method Details • The BB is a non-parametric method of taking draws from the posterior predictive distribution of a group of variables (Rubin 1981) • Automated stratification into homogeneous groups • Within groups do a Bayesian bootstrap of all variables to be synthesized at the same time • Similar to a standard bootstrap except that it accounts for the fact that the multivariate distribution is measured with error in the sample.

  22. BB example: Missing Administrative Data • Stratify households with missing IRS and SSA data (no SSN) into • Single • Married missing both SSNs • Married missing one SSN • For each set above, form grouping variable lists and hierarchy • Check overall sample sizes and establish by-groups

  23. BB example: Missing Administrative Data • For each unique value of variables in the grouping set • Impute the complete set of missing administrative records using BB from the sample of complete records in the same group • Couples are BB imputed together • When only one member of a couple has missing administrative data, the donor comes from a BB of couples with similar spouses (based on the grouping variables)

  24. Steps after Synthesizing • Two criteria for judging success • Confidentiality protection • Statistical usefulness (Analytical validity) • Perform two types of tests • Probabilistic record linkage re-identification tests: can SIPP respondents in synthetic data be linked back to already existing public use data? • Use synthetic data for analyses and compare results to results obtained using non-synthetic data

  25. Confidentiality Protection • Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based • This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF • Goals: • re-identification of SIPP records from the PUF should result in very few true matches • any candidate match should have substantial uncertainty regarding its status as true or false

  26. Disclosure Avoidance Analysis • Uses probabilistic record linking and two types of distance-based record linking • Each synthetic implicate is matched back to the gold standard • All unsynthesized variables are used as blocking variables • Different matching variable sets are used in the probabilistic record linking • All synthesized variables are used in the distance-based record linking

  27. Matching Variables and Associated M and U Probabilities

  28. Probabilistic Record Linking Results

  29. Distance-based Linking Results

  30. Analytical Validity • All univariate distributions • Selected first, second and third-order interactions • Selected linear and non-linear multivariate models • Small micro-simulations

  31. Log Total Earnings White Males

  32. Log Total Earnings Black Males

  33. Log AIME/AMW All Individuals

  34. Log Initial MBA All Retired Individuals

  35. Log Initial MBA Disabled Individuals

  36. Logistic Regression: Has a DB or DC Pension

  37. Lifetime Total FICA Earnings

  38. Lifetime Total FICA Work Years

  39. Micro-simulation of Retirement Accounts

  40. Next Steps • Census DRB has approved release • IRS Disclosure Officer has completed review and will approve release • SSA is negotiating with the Census Bureau the terms of the Beta and Final releases • Released data will be fully supported on the Cornell Virtual Research Data Center • Some models estimated on the Beta release will be re-estimated on the Gold Standard to further assess its analytical validity

More Related