1 / 12

Some aspects concerning analytical validity and disclosure risk of CART generated synthetic data

Hans-Peter Hafner and Rainer Lenz Research Data Centre of the Statistical Offices of the Länder Saarland State University of Applied Sciences UNECE Work Session on Statistical Data Confidentiality Tarragona, 27 October 2011.

gerik
Download Presentation

Some aspects concerning analytical validity and disclosure risk of CART generated synthetic data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hans-Peter Hafner and Rainer Lenz Research Data Centre of the Statistical Offices of the Länder Saarland State University of Applied Sciences UNECE Work Session on Statistical Data Confidentiality Tarragona, 27 October 2011 Some aspects concerning analytical validity and disclosure risk of CART generated synthetic data

  2. © Statistische Ämter der Länder, Forschungsdatenzentrum Overview • Project background • CART for synthetic data: Methodology and model modifications • Case Study: Sample of Monthly Report Manufacturing Sector • Analytical potential of the synthetic data • Testing confidentiality: Matching simulations • Prospects

  3. © Statistische Ämter der Länder, Forschungsdatenzentrum Project Background • German project InfinitE • Controlled remote data execution easier to handle for both, scientists and RDC staff • Need for semantically valid data structure files • Automation of output checking

  4. © Statistische Ämter der Länder, Forschungsdatenzentrum CART for synthetic Data: Methodology • Step 1: Generation of the tree using the original data • Step 2: Drawing of synthetic values in each leaf of the tree (Bayesian bootstrap) • For previously synthesized variables: original values are replaced by synthetic ones

  5. © Statistische Ämter der Länder, Forschungsdatenzentrum CART for synthetic Data: Model Modifications • Synthesis order of the variables • Model specification: All variables – only variables previously synthesized or not synthesized at all • Stopping rules: R package rpart complexity parameter cp • Further split only if overall lack of fit is decreased by cp

  6. © Statistische Ämter der Länder, Forschungsdatenzentrum Case Study: Data Monthly report on local units of the manufacturing sector (15% sample of the longitudinal section 1999 – 2002: 6483 units) • Full survey of local units with focus on economic activity in the manufacturing sector and at least 20 employees • Attributes: Location, turnover, wages and salaries, working hours

  7. Case Study: Anonymisation – Prerequisites • Transformation of continuous variables: Cubic root • Coarsening NACE code to 2-digit and regional key to federal states level • Different CART trees for five subsets of the data (turnover size classes) • Two variants: • (i) Absolute numerical values for all variables and all years • (ii) Absolute values for 1999 and rates of change for 2000-2002 © Statistische Ämter der Länder, Forschungsdatenzentrum

  8. © Statistische Ämter der Länder, Forschungsdatenzentrum Case Study: Analytical Validity • NEC = rate of job creation – rate of job destruction (net employment change) • JT = rate of job creation + rate of job destruction (job turnover)

  9. © Statistische Ämter der Länder, Forschungsdatenzentrum Case Study: Analytical Validity • Results using rates of change much better than those using absolute values • Results for smaller values of parameter cp tend to be better • Problem: Variation between different synthetic data sets is very large • Aim: Only one synthetic data structure file for researcher

  10. Case Study: Matching Experiments Database cross match • External data: Original data • Blocking variables: Size classes of turnover and number of employees (mean 1999 – 2002) • Key variables: • Number of employees • Turnover • First results for cp = 0.00001: For one block 27% hits, for other blocks no more than 15% © Statistische Ämter der Länder, Forschungsdatenzentrum

  11. Prospects • Examination of synthesis order and model specification • Further analyses regarding the optimal value of cp • Better adaption of the matching procedure to synthetic data © Statistische Ämter der Länder, Forschungsdatenzentrum

  12. © Statistische Ämter der Länder, Forschungsdatenzentrum Thank you for your attention! Contact Hans-Peter Hafner Research Data Center of the Statistical Offices of the Länder hhafner@statistik-hessen.de Rainer Lenz Saarland State University of Applied Sciences rainer.lenz@htw-saarland.de www.forschungsdatenzentrum.de

More Related