460 likes | 589 Views
Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1. John M. Abowd U.S. Census Bureau and Cornell University CNSTAT Panel on the Census Bureau’s Dynamics of Economic Well-being System January 26, 2007. Background.
E N D
Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1 John M. AbowdU.S. Census Bureau and Cornell University CNSTAT Panel on the Census Bureau’s Dynamics of Economic Well-being SystemJanuary 26, 2007
Background • Longstanding goal of the Census Bureau • Statutory mandate to provide survey data used to study critical policy issues • Focus of long standing internal Census Bureau survey improvement project that is part of the LEHD Program • This is the first Title 13/Chapter 5 predominant purpose for using IRS data • Treasury Regulation Change, February 2001 (final regulation February 2003) • New W-2 items authorized: SSN, EIN, Box 1, Box 3, Box 13, number of quarters, 1099R • Creation of a public use data set that integrates survey and administrative data is the other predominant Title 13/Chapter 5 purpose
Team and Sponsorship • The project was conducted by a team of researchers from the Census Bureau, IRS, Social Security Administration, and a consortium of university partners • Main financial support provided by the Census Bureau, Social Security Administration, and the National Science Foundation • Primary design decisions made by an inter-agency team lead by Martha Stinson at the Census Bureau and with the participation of SSA, IRS, the Congressional Budget Office, and the Joint Committee on Taxation
Acknowledgements: Research Team • Martha Stinson (Census Bureau), project manager • Gary Benedetto, Lisa Dragoset, Sam Hawala, Bryan Ricchetti (Census Bureau) • Karen Masken (IRS) • Simon Woodcock (Simon Fraser University), Jerry Reiter (Duke University), Josep Domingo-Ferrer (University of Rovira and Virgili), Vicenc Torra (University of Barcelona), Lars Vilhuber (Cornell University and Census Bureau), consultants
Acknowledgements I: Agencies • Kenneth Prewitt, C. Louis Kincannon, Hermann Habermann, Paula Schneider, Nancy Gordon, Frederick Knickerbocker, Cynthia Clark, Howard Hogan, and Thomas Mesenbourg, senior management Census Bureau • Susan Grad, Howard Iams, and Paul van de Water, senior management SSA • Mark Mazur and Nicholas Greenia, IRS senior management and IRS/SOI Census Bureau disclosure liaison • Daniel Newlon, NSF project officer
Acknowledgements II: Agencies • Chet Bowie, Al Tupek, Barry Sessamen Dan Weinberg, Ron Prevost, Jeremy Wu, division and program management Census Bureau • Brian Greenberg, Dawn Haynes, SSA technical support, contract management, and disclosure officers • Patricia Doyle, Judith Eargle and Nancy Bates, Census Bureau SIPP research direction • Charlene Leggieri and Sally Obenski, Census Bureau administrative records management • Laura Zayatz, Census Bureau statistical disclosure research direction • John Sabelhaus, Congressional Budget Office research direction
Conceptual Framework • Link all SIPP panels from the 1990s • Five panels: 1990, 1991, 1992, 1993, 1996 • Link to IRS data • Summary Earnings Records (FICA taxable earnings 1937-1950, and 1951-2003 annual) • Detailed Earnings Record (job level data, uncapped, 1978-2003 annual) • SSA benefits data • Master Beneficiary Record, Supplemental Security Record, Payment History Update System, 831 file (all available historical data through 2002) • Create product that prevents individuals from being re-identified in the current public use SIPP files
Major Design Decisions • Limit number of SIPP variables included • Target national retirement and disability research communities • Investigate disclosure avoidance methods to protect both survey and administrative data • But, note that a re-identification in the current SIPP public use files is not a disclosure since those files have also been subjected to extensive disclosure avoidance procedures • Very high hurdle
Latest Versions • Gold Standard confidential file at release 4.0 • All confidential data (person-level), all sources • Beta Public Use File 4.1 • All person-level SIPP, IRS variables from the Gold Standard Version 4.0 • Benefit and type of benefit measures for initial SSA benefit (if any), benefit and type of benefit as of April 1, 2000 • Consistent panel weight for civilian, non-institutional population as of April 1, 2000 (synthesized on each implicate) • Four missing data implicates with four synthetic implicates each (16 implicates total)
Summary of Discussion Today • A tour of the methods used to complete and synthesize the SIPP-PUF • Some disclosure avoidance results • Selected analytical validity results
Multiple Imputation Confidentiality Protection History • Rubin (1993): treat unsampled individuals in population as missing the survey data, impute missing values (synthetic population), sample and release (fully synthetic data) • Little (1993): treat sensitive values as missing, impute and release imputed values (partially synthetic data) • Feinberg (1994): parametric Bayesian procedure eliminated the use of any actual values in synthetic data • Ragunathan, Reiter, and Rubin (2003): adapted the Sequential Regression Multivariate Imputation method to synthetic data • Reiter (2004): Inference-valid combination of multiple imputation for missing and synthetic data • Abowd and Woodcock (2001): Applied SRMI to confidentiality protection of longitudinally linked employer-employee synthetic micro-data
Multiple Imputation Confidentiality Protection Methods • Denote confidential data by Y and nonconfidential data by X (may be empty) • Both Y and X may contain missing data, so that Y=(Yobs , Ymis) and X=(Xobs , Xmis) • Assume database can be represented by joint density p(Y,X,θ) • Estimate the posterior predictive distribution p(Ynew, Xnew | Yobs, Xobs) • Sample multiple times from the posterior predictive distribution, release these samples
Sequential Regression Multivariate Imputation (SRMI) Method • Synthetic data values are draws from the posterior predictive density: • In practice, use a two-step procedure: 1) complete the missing data using SRMI2) draw synthetic data from predictive density given the completed data • Repeating the procedure yields multiple synthetic data implicates
SRMI Method Details • Specifying the joint density p(Y,X,θ) is unrealistic in most applications • Instead, approximate the joint density by a sequence of conditional densities defined by generalized linear models • Synthetic values of some are draws from:where Ym,Xmare completed data, and densities pkare defined by an appropriate generalized linear model and prior, a Dirichlet-multinomial model, or a Bayesian Bootstrap
Maintaining Relationships in the Underlying Data • Define a multilevel parent-child tree to describe the exact relationships in the data • Variables at the root of this tree should have values for all individuals, completed and synthesized first (but as a function of all data) • Child variables only completed or synthesized when appropriate given the parent variable • For missing data, iterate nine times to complete all missing data, sample 4 implicates • For synthetic data, condition on values from the completed data, sample 4 implicates per completed implicate
Maintaining Multivariate Distributions • Automated creation and management of stratifying (grouping) variables and conditioning variables • Bayesian bootstrap procedure for sets of related discrete variables estimated using the automated grouping • SRMI procedure for most continuous variables using automated grouping, conditioning variable management, Bayesian model selection
Maintaining Univariate Distributions • Automated management of sets of related continuous variables (e.g., earnings histories) • Within stratifying groups, automated management of a non-parametric transform with inverse transform to preserve the univariate distribution of all continuous variables within group
SRMI Example: Date of Birth • Link administrative birth date (more accurate) • Take birth date from Bayesian bootstrap link of couple administrative records when SSN is not available • Formulate grouping and control variable lists and hierarchy (two sets) • Perform overall stratifications, sample size checks
SRMI Example: Date of Birth • By unique values of the grouping variables • Estimate the pdf of birth date using a kernel density estimator • Transform birth date to normal using the estimated KDE • Estimate a linear regression of transformed birth date on the master list of control variables for this group • Use Bayesian model selection to prune variable list • Re-estimate the linear regression using the Bayesian Normal-Inverse Gamma natural conjugate posterior (flat priors) • Sample from the posterior distribution of and 2 • Given , sample from the predictive distribution of transformed birth date • Invert the transformation on birth date
Bayesian Bootstrap Method Details • The BB is a non-parametric method of taking draws from the posterior predictive distribution of a group of variables (Rubin 1981) • Automated stratification into homogeneous groups • Within groups do a Bayesian bootstrap of all variables to be synthesized at the same time • Similar to a standard bootstrap except that it accounts for the fact that the multivariate distribution is measured with error in the sample.
BB example: Missing Administrative Data • Stratify households with missing IRS and SSA data (no SSN) into • Single • Married missing both SSNs • Married missing one SSN • For each set above, form grouping variable lists and hierarchy • Check overall sample sizes and establish by-groups
BB example: Missing Administrative Data • For each unique value of variables in the grouping set • Impute the complete set of missing administrative records using BB from the sample of complete records in the same group • Couples are BB imputed together • When only one member of a couple has missing administrative data, the donor comes from a BB of couples with similar spouses (based on the grouping variables)
Steps after Synthesizing • Two criteria for judging success • Confidentiality protection • Statistical usefulness (Analytical validity) • Perform two types of tests • Probabilistic record linkage re-identification tests: can SIPP respondents in synthetic data be linked back to already existing public use data? • Use synthetic data for analyses and compare results to results obtained using non-synthetic data
Confidentiality Protection • Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based • This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF • Goals: • re-identification of SIPP records from the PUF should result in very few true matches • any candidate match should have substantial uncertainty regarding its status as true or false
Disclosure Avoidance Analysis • Uses probabilistic record linking and two types of distance-based record linking • Each synthetic implicate is matched back to the gold standard • All unsynthesized variables are used as blocking variables • Different matching variable sets are used in the probabilistic record linking • All synthesized variables are used in the distance-based record linking
Analytical Validity • All univariate distributions • Selected first, second and third-order interactions • Selected linear and non-linear multivariate models • Small micro-simulations
Next Steps • Census DRB has approved release • IRS Disclosure Officer has completed review and will approve release • SSA is negotiating with the Census Bureau the terms of the Beta and Final releases • Released data will be fully supported on the Cornell Virtual Research Data Center • Some models estimated on the Beta release will be re-estimated on the Gold Standard to further assess its analytical validity