1 / 22

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck. Jennifer Huckett Iowa State University June 20, 2007. Outline. Motivation Disclosure Limitation Methods Risk Assessment Simulation Study Results & Conclusions. Motivation.

Download Presentation

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007

  2. Outline • Motivation • Disclosure Limitation Methods • Risk Assessment • Simulation Study • Results & Conclusions

  3. Motivation • Iowa Department of Revenue (IDR) • Collects and maintains individual tax return data • Legislative Services Agency (LSA) • Examines impact of tax law changes on liability • Current system • LSA submits requests to IDR • IDR computes liability, reports to LSA • Occurs several times each year • Inefficient for both IDR and LSA

  4. Solutions • Secure/remote access server • Data are not released • Some analyses suppressed • Statistical disclosure limitation (SDL) • Tabular • Microdata • enable IDR to provide LSA with data set • allow LSA to compute liability with ease and accuracy • MUST ENSURE CONFIDENTIALITY of RECORDS!

  5. Establishment Connection • Very skew distributions, unusual associations among distributions • Groups of variables are related to one another in unusual ways • Similar to business tax data or business expenditure/revenue data • Confidentiality is critical

  6. Traditional Approaches • Recoding (e.g. aggregation) • Noise addition • Data swapping • Data suppression • Imputation • Combinations of these

  7. Our Approach • Synthetic microdata simulation • Retain key demographic variables • Simulate values for some variables • Quantile regression conditional on key variables • Compute fitted values at selected quantiles • Impute values for remaining variables • Hot deck + rank swap • Hot deck based on simulated income variables

  8. Quantile Regression • = “tilted absolute value function” for quantile • = linear function of predictors (xi) • performed in R • quantreg package • rq function Quantile Regression, Koenker 2004

  9. Simulate via Quantile Regression • Estimate for quantiles from the set • For each record on variable y • Randomly select ~ Uniform(0,1) • Compute fitted given x at above and below • Interpolate to obtain = simulated value

  10. Number of dependents 0, 1, 2,… Categorized into 0 1 ≥2 County 1,…,99 Categorized into 4 population size groups State filing status single married filing joint married filing separate on combined return married filing separate returns head of household widow(er) with dependent child Categorized into 1 2 and 3 4, 5, and 6 IDR Application: Key Demographic Variables

  11. IDR Application: Quantile Regression for wages

  12. IDR Application: Hot Deck and Rank Swap for Federal Tax • Hot Deck • Mahalanobis distance • closest 20 records • Rank Swap • compute sample rank, r • draw random rank, r*, from discrete Uniform[r-10, r+10] • impute value from record with rank r*

  13. Disclosure Risk Measurement • Using methods detailed in Reiter (2005) and Duncan and Lambert (1986, 1989) • Examine specific records • Original records • Released records • Model intruder behavior to assess disclosure risk • Simulation Study

  14. Original and Released Records

  15. Target record, t Intruder has information on target Attempts to match t in released records Released records j=1,…,r in Z Probability that record j belongs to target t is As probability decreases disclosure risk decreases Intruder Behavior

  16. Simulation Study Schemes for SDL influence divisions of A into Ap (available, perturbed) and Ad (available, unperturbed).

  17. SDL Schemes in Simulation Study • No SDL • Swap 30% marital status • Swap 30% marital status and minority • Recode age into 5 year intervals • Recode age into 5 year intervals and swap 30% marital status and minority • Simulation via quantile regression and hot deck

  18. Targets • Intruder has information on target, t, and wants to match with released records • Consider a few targets • Unique record • Rare record • Common record

  19. target No SDL Marital swap Marital and minority swap Age recode Swaps and recode Quantile regression and hot deck unique 1 1 0.1046 1 0.0178 0.0895 rare 0.3333 0.1044 0.1304 0.0526 0.0225 0.0016 common 0.0385 0.0320 0.0320 0.0068 0.0055 0.0008 Results from Simulation Study

  20. Conclusions & Future Work • Risk behaves as we expect • increased SDL • decreased disclosure risk (except for unique!) • Perform SDL techniques to American Community Survey data at US Census Bureau • Compare traditional techniques to quantile regression and hot deck by computing risk • Measure utility of released data

  21. Acknowledgements • Iowa Department of Revenue • Iowa’s Legislative Services Agency • National Institute of Statistical Sciences • US Census Bureau Dissertation Fellowship Award

  22. References • Duncan,G.T. and Lambert, D. 1986. “Disclosure-Limited Data Dissemination,” Journal of the American Statistical Association, 81, 10-28. • Duncan,G.T. and Lambert, D. 1989. “The Risk of Disclosure for Microdata,” Journal of Business and Economic Statisistics, 7, 207-217. • Koenker, R. 2005. “Introduction,” Quantile Regression, Econometric Society Monograph Series, Cambridge University Press. • Reiter, J.P. 2005. “Estimating Risks of Identification Disclosure in Microdata”, Journal of the American Statistical Association, 100, 472, 1103-1113.

More Related