John M. Abowd U.S. Census Bureau and Cornell University

Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1 John M. AbowdU.S. Census Bureau and Cornell University CNSTAT Panel on the Census Bureau’s Dynamics of Economic Well-being SystemJanuary 26, 2007

Background • Longstanding goal of the Census Bureau • Statutory mandate to provide survey data used to study critical policy issues • Focus of long standing internal Census Bureau survey improvement project that is part of the LEHD Program • This is the first Title 13/Chapter 5 predominant purpose for using IRS data • Treasury Regulation Change, February 2001 (final regulation February 2003) • New W-2 items authorized: SSN, EIN, Box 1, Box 3, Box 13, number of quarters, 1099R • Creation of a public use data set that integrates survey and administrative data is the other predominant Title 13/Chapter 5 purpose

Team and Sponsorship • The project was conducted by a team of researchers from the Census Bureau, IRS, Social Security Administration, and a consortium of university partners • Main financial support provided by the Census Bureau, Social Security Administration, and the National Science Foundation • Primary design decisions made by an inter-agency team lead by Martha Stinson at the Census Bureau and with the participation of SSA, IRS, the Congressional Budget Office, and the Joint Committee on Taxation

Acknowledgements: Research Team • Martha Stinson (Census Bureau), project manager • Gary Benedetto, Lisa Dragoset, Sam Hawala, Bryan Ricchetti (Census Bureau) • Karen Masken (IRS) • Simon Woodcock (Simon Fraser University), Jerry Reiter (Duke University), Josep Domingo-Ferrer (University of Rovira and Virgili), Vicenc Torra (University of Barcelona), Lars Vilhuber (Cornell University and Census Bureau), consultants

Acknowledgements I: Agencies • Kenneth Prewitt, C. Louis Kincannon, Hermann Habermann, Paula Schneider, Nancy Gordon, Frederick Knickerbocker, Cynthia Clark, Howard Hogan, and Thomas Mesenbourg, senior management Census Bureau • Susan Grad, Howard Iams, and Paul van de Water, senior management SSA • Mark Mazur and Nicholas Greenia, IRS senior management and IRS/SOI Census Bureau disclosure liaison • Daniel Newlon, NSF project officer

Acknowledgements II: Agencies • Chet Bowie, Al Tupek, Barry Sessamen Dan Weinberg, Ron Prevost, Jeremy Wu, division and program management Census Bureau • Brian Greenberg, Dawn Haynes, SSA technical support, contract management, and disclosure officers • Patricia Doyle, Judith Eargle and Nancy Bates, Census Bureau SIPP research direction • Charlene Leggieri and Sally Obenski, Census Bureau administrative records management • Laura Zayatz, Census Bureau statistical disclosure research direction • John Sabelhaus, Congressional Budget Office research direction

Conceptual Framework • Link all SIPP panels from the 1990s • Five panels: 1990, 1991, 1992, 1993, 1996 • Link to IRS data • Summary Earnings Records (FICA taxable earnings 1937-1950, and 1951-2003 annual) • Detailed Earnings Record (job level data, uncapped, 1978-2003 annual) • SSA benefits data • Master Beneficiary Record, Supplemental Security Record, Payment History Update System, 831 file (all available historical data through 2002) • Create product that prevents individuals from being re-identified in the current public use SIPP files

Major Design Decisions • Limit number of SIPP variables included • Target national retirement and disability research communities • Investigate disclosure avoidance methods to protect both survey and administrative data • But, note that a re-identification in the current SIPP public use files is not a disclosure since those files have also been subjected to extensive disclosure avoidance procedures • Very high hurdle

Latest Versions • Gold Standard confidential file at release 4.0 • All confidential data (person-level), all sources • Beta Public Use File 4.1 • All person-level SIPP, IRS variables from the Gold Standard Version 4.0 • Benefit and type of benefit measures for initial SSA benefit (if any), benefit and type of benefit as of April 1, 2000 • Consistent panel weight for civilian, non-institutional population as of April 1, 2000 (synthesized on each implicate) • Four missing data implicates with four synthetic implicates each (16 implicates total)

Summary of Discussion Today • A tour of the methods used to complete and synthesize the SIPP-PUF • Some disclosure avoidance results • Selected analytical validity results

Multiple Imputation Confidentiality Protection History • Rubin (1993): treat unsampled individuals in population as missing the survey data, impute missing values (synthetic population), sample and release (fully synthetic data) • Little (1993): treat sensitive values as missing, impute and release imputed values (partially synthetic data) • Feinberg (1994): parametric Bayesian procedure eliminated the use of any actual values in synthetic data • Ragunathan, Reiter, and Rubin (2003): adapted the Sequential Regression Multivariate Imputation method to synthetic data • Reiter (2004): Inference-valid combination of multiple imputation for missing and synthetic data • Abowd and Woodcock (2001): Applied SRMI to confidentiality protection of longitudinally linked employer-employee synthetic micro-data

Multiple Imputation Confidentiality Protection Methods • Denote confidential data by Y and nonconfidential data by X (may be empty) • Both Y and X may contain missing data, so that Y=(Yobs , Ymis) and X=(Xobs , Xmis) • Assume database can be represented by joint density p(Y,X,θ) • Estimate the posterior predictive distribution p(Ynew, Xnew | Yobs, Xobs) • Sample multiple times from the posterior predictive distribution, release these samples

Sequential Regression Multivariate Imputation (SRMI) Method • Synthetic data values are draws from the posterior predictive density: • In practice, use a two-step procedure: 1) complete the missing data using SRMI2) draw synthetic data from predictive density given the completed data • Repeating the procedure yields multiple synthetic data implicates

SRMI Method Details • Specifying the joint density p(Y,X,θ) is unrealistic in most applications • Instead, approximate the joint density by a sequence of conditional densities defined by generalized linear models • Synthetic values of some are draws from:where Ym,Xmare completed data, and densities pkare defined by an appropriate generalized linear model and prior, a Dirichlet-multinomial model, or a Bayesian Bootstrap

Maintaining Relationships in the Underlying Data • Define a multilevel parent-child tree to describe the exact relationships in the data • Variables at the root of this tree should have values for all individuals, completed and synthesized first (but as a function of all data) • Child variables only completed or synthesized when appropriate given the parent variable • For missing data, iterate nine times to complete all missing data, sample 4 implicates • For synthetic data, condition on values from the completed data, sample 4 implicates per completed implicate

Maintaining Multivariate Distributions • Automated creation and management of stratifying (grouping) variables and conditioning variables • Bayesian bootstrap procedure for sets of related discrete variables estimated using the automated grouping • SRMI procedure for most continuous variables using automated grouping, conditioning variable management, Bayesian model selection

Maintaining Univariate Distributions • Automated management of sets of related continuous variables (e.g., earnings histories) • Within stratifying groups, automated management of a non-parametric transform with inverse transform to preserve the univariate distribution of all continuous variables within group

SRMI Example: Date of Birth • Link administrative birth date (more accurate) • Take birth date from Bayesian bootstrap link of couple administrative records when SSN is not available • Formulate grouping and control variable lists and hierarchy (two sets) • Perform overall stratifications, sample size checks

SRMI Example: Date of Birth • By unique values of the grouping variables • Estimate the pdf of birth date using a kernel density estimator • Transform birth date to normal using the estimated KDE • Estimate a linear regression of transformed birth date on the master list of control variables for this group • Use Bayesian model selection to prune variable list • Re-estimate the linear regression using the Bayesian Normal-Inverse Gamma natural conjugate posterior (flat priors) • Sample from the posterior distribution of  and 2 • Given , sample from the predictive distribution of transformed birth date • Invert the transformation on birth date

SRMI Example: Critical Dates

Bayesian Bootstrap Method Details • The BB is a non-parametric method of taking draws from the posterior predictive distribution of a group of variables (Rubin 1981) • Automated stratification into homogeneous groups • Within groups do a Bayesian bootstrap of all variables to be synthesized at the same time • Similar to a standard bootstrap except that it accounts for the fact that the multivariate distribution is measured with error in the sample.

BB example: Missing Administrative Data • Stratify households with missing IRS and SSA data (no SSN) into • Single • Married missing both SSNs • Married missing one SSN • For each set above, form grouping variable lists and hierarchy • Check overall sample sizes and establish by-groups

BB example: Missing Administrative Data • For each unique value of variables in the grouping set • Impute the complete set of missing administrative records using BB from the sample of complete records in the same group • Couples are BB imputed together • When only one member of a couple has missing administrative data, the donor comes from a BB of couples with similar spouses (based on the grouping variables)

Steps after Synthesizing • Two criteria for judging success • Confidentiality protection • Statistical usefulness (Analytical validity) • Perform two types of tests • Probabilistic record linkage re-identification tests: can SIPP respondents in synthetic data be linked back to already existing public use data? • Use synthetic data for analyses and compare results to results obtained using non-synthetic data

Confidentiality Protection • Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based • This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF • Goals: • re-identification of SIPP records from the PUF should result in very few true matches • any candidate match should have substantial uncertainty regarding its status as true or false

Disclosure Avoidance Analysis • Uses probabilistic record linking and two types of distance-based record linking • Each synthetic implicate is matched back to the gold standard • All unsynthesized variables are used as blocking variables • Different matching variable sets are used in the probabilistic record linking • All synthesized variables are used in the distance-based record linking

Matching Variables and Associated M and U Probabilities

Probabilistic Record Linking Results

Distance-based Linking Results

Analytical Validity • All univariate distributions • Selected first, second and third-order interactions • Selected linear and non-linear multivariate models • Small micro-simulations

Log Total Earnings White Males

Log Total Earnings Black Males

Log AIME/AMW All Individuals

Log Initial MBA All Retired Individuals

Log Initial MBA Disabled Individuals

Logistic Regression: Has a DB or DC Pension

Lifetime Total FICA Earnings

Lifetime Total FICA Work Years

Micro-simulation of Retirement Accounts

Next Steps • Census DRB has approved release • IRS Disclosure Officer has completed review and will approve release • SSA is negotiating with the Census Bureau the terms of the Beta and Final releases • Released data will be fully supported on the Cornell Virtual Research Data Center • Some models estimated on the Beta release will be re-estimated on the Gold Standard to further assess its analytical validity

John M. Abowd U.S. Census Bureau and Cornell University

John M. Abowd U.S. Census Bureau and Cornell University

Presentation Transcript

U.S. Census Bureau

U.S. Census Bureau

U.S. Census Bureau Foreign Trade Division

U.S. Census Bureau census.gov

Source: Website of U.S. Census Bureau

Elizabeth M. Grieco and David M. Armstrong Population Division U.S. Census Bureau

Robyn Sirkis U.S. Census Bureau

U.S. Census Bureau

Census Bureau

Census Bureau

U.S. Census Bureau Geography Division Programs

Diane K. Willimack U.S. Census Bureau

U.S. Census Bureau

Diane K. Willimack U.S. Census Bureau

Robert Kominski, U.S. Census Bureau Diana B. Elliott, U.S. Census Bureau

Census Bureau

U.S. Census Bureau

Census Bureau

Sociolinguistics and U.S. Census Bureau Language Research

Source: U.S. Census Bureau

U.S. Census Bureau